[HN Gopher] LlamaCloud and LlamaParse
___________________________________________________________________
LlamaCloud and LlamaParse
Author : eferreira_
Score : 111 points
Date : 2024-02-20 17:20 UTC (5 hours ago)
(HTM) web link (blog.llamaindex.ai)
(TXT) w3m dump (blog.llamaindex.ai)
| coding123 wrote:
| What's a RAG application
| seldo wrote:
| Retrieval-Augmented Generation, where you ask an LLM to answer
| a question by giving it some context information that you have
| retrieved from your own data rather than just the data it was
| trained on.
| simonw wrote:
| RAG stands for Retrieval Augmented Generation.
|
| It's the trick where a user asks you a question: "Who worked on
| the billing UI refresh last year?" - and you turn that question
| into a search against a bunch of private documents, find the
| top matches, copy them into a big prompt to an LLM and ask it
| to use that data to answer the user's question.
|
| There's a HUGE amount of depth to building this well - it's one
| of the most actively explored parts of LLM/generative-AI at the
| moment, because being able to ask human-language questions of
| large private datasets is incredibly useful.
| pierre wrote:
| I'm part of the team that build LlamaParse. It's net improvement
| compare to other PDF->Structured Text extractors (I build several
| in the past, includig https://github.com/axa-group/Parsr).
|
| For character extraction, LlamaParse use a mixture of OCR /
| character extraction from the PDF (it's the only parser I'm aware
| of that address some of the buggy PDF font issues, check the
| 'text' mode to see raw document before reconstruction), use a
| mixture of heuristic and Machine learning models to reconstruct
| the document.
|
| Once plug with a Recursive retrieval strategy, allow you to get
| Sota result on question answering over complexe text (see
| notebook: https://github.com/run-
| llama/llama_parse/blob/main/examples/...).
|
| AMA
| binarymax wrote:
| Cool! Which OCR engine/model do you use?
| pierre wrote:
| EasyOCR, may switch to paddleOCR in the future.
| vikp wrote:
| You may want to try https://github.com/VikParuchuri/surya
| (I'm the author). I've only benchmarked against tesseract,
| but it outperforms it by a lot (benchmarks in repo). Happy
| to discuss.
|
| You could also try https://github.com/VikParuchuri/marker
| for general PDF parsing (I'm also the author) - it seems
| like you're more focused on tables.
| amelius wrote:
| Can it detect and strip out advertisements?
| chasd00 wrote:
| One of the things I've been helping a team with is dealign with
| mountains of ppt decks, converted to pdf, and then
| parsed/chunked/embedded into vector storage. It doesn't work
| that well because a ppt is not a document. What are your
| thoughts when dealing with other formats first converted to
| pdf?
| a2code wrote:
| Does it work with other filetype converted into PDFs? For
| example docx, ppt, png, etc.
| bx376 wrote:
| What will the pricing be like?
| ldjkfkdsjnv wrote:
| Modern playbook:
|
| 1. Build janky open source code base
|
| 2. Sell compute to run it
|
| 3. Build features that create compute lock in (vercel is a master
| at this)
| tslmy wrote:
| Here's an alternative:
|
| Spend seed round investments on building a solid software but
| not building an income stream that can satisfy investors, thus
| not receiving any new funding and let the company die.
| johnsutor wrote:
| I wonder how LlamaParse compares head to head with
| https://unstructured.io
| justanotheratom wrote:
| not clear to me why this got downvoted. sensible question.
| infecto wrote:
| I would also like how it compares to any of the commercial
| offerings from Azure/AWS/GCP. They all have document parsing
| tools that I have found better than tools like unstructured.
| Sure you don't have some of the "magic" of segmenting text for
| vectorization and RAG but imo thats the easy part. The hard
| part is pulling data, forms, tables, text out of the PDF which
| I find the cloud tools to do a superior job.
| srameshc wrote:
| I don't understand why post this on medium ? Medium doesn't let
| me even read anymore. If you have a blog post on it so your
| audience can reach you.
| eferreira_ wrote:
| Also, have a X (formerly Twitter) thread:
|
| https://x.com/llama_index/status/1759987390435996120?s=20
| diggan wrote:
| Which also isn't really available to unregistered users, can
| only see the first tweet: https://i.imgur.com/SJA2Gzs.png
| seldo wrote:
| We are planning to move our blog off of Medium (we've been
| busy!), but this post is public so you can actually just click
| through the nag screen if you see one.
| miohtama wrote:
| > PDFs are specifically a problem: I have complex docs with lots
| of messy formatting. How do I represent this in the right way so
| the LLM can understand it?
|
| 40 years after PostScript and this is still a problem that one
| needs to throw AI at. I feel the software development and human-
| computer interaction took a wrong turn along the way. What
| happened to the semantic web?
| avhon1 wrote:
| It turns out that it takes thought effort to semantically
| tag/classify everything consistently and completely, so rather
| than make the decisions, it's easier to just not do it.
| madeofpalk wrote:
| What?
|
| We still have 'the web'. PDFs are something different and
| separate.
| lxe wrote:
| LlamaParse solves exactly the problem I've encountered over and
| over with RAG. Getting structured info from unstructured data is
| a pain.
| pknerd wrote:
| Sorry for offtopic: Are there any LLM services that I can use in
| cloud similar to OpenAI? I do not have good enough Macbook to run
| different models locally
| tslmy wrote:
| Hmmm... OpenAI itself?
|
| Did you intend to rule out OpenAI from consideration?
|
| You mentioned hardware being a constraint, but that doesn't
| tell me why you specifically wanted to find an alternative to
| OpenAI.
| simion314 wrote:
| Not OP, in my case OpenAI does not want my money, they only
| accept credit cards. For example netflix wants my money so
| they have more choices.
|
| Also I would like to pay for an equivalent alternative that
| is less censored, like ChatGPT had a bug one day that it
| refused to tell me how to force a type cast in TypeScript, it
| showed me a moderation error. So I want an AI that is
| targeted for adults and not children in some religious school
| in USA.
| pknerd wrote:
| I have used openAI but I want to try several other LLMs as
| well.
| adhamsalama wrote:
| I think Anthropic and Mistral offer this but you have to join
| their waiting lists first.
| sebastiennight wrote:
| If you use the LLM Chatbot arena[1], you can get two bots to
| compete to solve your prompts!
|
| [1]: https://chat.lmsys.org/?arena
| technics256 wrote:
| LlamaParse looks nice. Is there way to return page numbers also
| with the markdown? This is important for our use case.
| behnamoh wrote:
| > This is where LlamaParse comes in. We've developed a
| proprietary parsing service that is incredibly good at parsing
| PDFs with complex tables into a well-structured markdown format.
|
| This is my problem with projects that start off as open source
| and become famous because of their community contributions and
| attention, then the project leaders get that sweet VC money (or
| not) and make something proprietary.
|
| We've seen it with Langchain and several other "fake open source"
| projects.
| siquick wrote:
| LlamaParse is proprietary but the main LI package isn't and you
| don't need the former to use the latter.
|
| Why shouldn't they make money? LI is a fantastic way to do RAG.
| zmmmmm wrote:
| I don't disagree but I think it's not a question of "why" but
| "how".
|
| It could still be licensed in a restricted way, but keeping
| secret _how_ it works is unfortunate - it breaks the chain of
| learning that is happening across the open ecosystem and, if
| the technique is any good, all it does is force open models
| to build an actually open equivalent so that further progress
| can be made (and if it 's not really any good then it's snake
| oil, which is worse). Even if it's great it essentially
| becomes a dead end for the people who actually need and want
| an open model ecosystem.
| baby wrote:
| what is RAG?
| doublerabbit wrote:
| retrieval augmented generation.
|
| Explained by gpt itself as if you were a teddy bear.
|
| ----
|
| Okay little teddybears, let me explain what retrieval augmented
| generation is in a way you can understand!
|
| You see, sometimes when big AI models like Claude want to talk
| about something, they may not know all the facts. But they have
| a friend named the knowledge base who knows lots of
| information!
|
| When Claude wants to talk about something new, he first asks
| the knowledge base "What do you know about X?". The knowledge
| base looks through all its facts and finds the most helpful
| ones. Then it shares them with Claude so he has more context
| before talking.
|
| This process of Claude asking the knowledge base for facts is
| called retrieval augmented generation. It helps Claude sound
| smarter and avoid mistakes, because he has extra information
| from his knowledgeable friend the knowledge base.
|
| The next time Claude wants to chat with you teddybears, he will
| be even better prepared with facts from the knowledge base to
| have an interesting conversation!
| _pdp_ wrote:
| Question, why build this when you can use LLMS to extract the
| data in the most appropriate format to begin with? Isn't this a
| bit redundant? Perhaps it makes sense in the short term due to
| cost but in the long run this problem can be solved generically
| with LLMS.
| lolpanda wrote:
| I think LlamaParse is trying to solve a hard problem. Many
| enterprise customers I know have strong need to parse PDF files
| and extract data accurately. I found the interface a bit
| confusing. From your blog post, LlamaParse can extract numbers in
| tables, but it appears that the output isn't provided in tabular
| format. Instead, access to these numbers is only available
| through a question-answering. Is this accurate?
___________________________________________________________________
(page generated 2024-02-20 23:00 UTC)