[HN Gopher] LlamaCloud and LlamaParse
       ___________________________________________________________________
        
       LlamaCloud and LlamaParse
        
       Author : eferreira_
       Score  : 111 points
       Date   : 2024-02-20 17:20 UTC (5 hours ago)
        
 (HTM) web link (blog.llamaindex.ai)
 (TXT) w3m dump (blog.llamaindex.ai)
        
       | coding123 wrote:
       | What's a RAG application
        
         | seldo wrote:
         | Retrieval-Augmented Generation, where you ask an LLM to answer
         | a question by giving it some context information that you have
         | retrieved from your own data rather than just the data it was
         | trained on.
        
         | simonw wrote:
         | RAG stands for Retrieval Augmented Generation.
         | 
         | It's the trick where a user asks you a question: "Who worked on
         | the billing UI refresh last year?" - and you turn that question
         | into a search against a bunch of private documents, find the
         | top matches, copy them into a big prompt to an LLM and ask it
         | to use that data to answer the user's question.
         | 
         | There's a HUGE amount of depth to building this well - it's one
         | of the most actively explored parts of LLM/generative-AI at the
         | moment, because being able to ask human-language questions of
         | large private datasets is incredibly useful.
        
       | pierre wrote:
       | I'm part of the team that build LlamaParse. It's net improvement
       | compare to other PDF->Structured Text extractors (I build several
       | in the past, includig https://github.com/axa-group/Parsr).
       | 
       | For character extraction, LlamaParse use a mixture of OCR /
       | character extraction from the PDF (it's the only parser I'm aware
       | of that address some of the buggy PDF font issues, check the
       | 'text' mode to see raw document before reconstruction), use a
       | mixture of heuristic and Machine learning models to reconstruct
       | the document.
       | 
       | Once plug with a Recursive retrieval strategy, allow you to get
       | Sota result on question answering over complexe text (see
       | notebook: https://github.com/run-
       | llama/llama_parse/blob/main/examples/...).
       | 
       | AMA
        
         | binarymax wrote:
         | Cool! Which OCR engine/model do you use?
        
           | pierre wrote:
           | EasyOCR, may switch to paddleOCR in the future.
        
             | vikp wrote:
             | You may want to try https://github.com/VikParuchuri/surya
             | (I'm the author). I've only benchmarked against tesseract,
             | but it outperforms it by a lot (benchmarks in repo). Happy
             | to discuss.
             | 
             | You could also try https://github.com/VikParuchuri/marker
             | for general PDF parsing (I'm also the author) - it seems
             | like you're more focused on tables.
        
         | amelius wrote:
         | Can it detect and strip out advertisements?
        
         | chasd00 wrote:
         | One of the things I've been helping a team with is dealign with
         | mountains of ppt decks, converted to pdf, and then
         | parsed/chunked/embedded into vector storage. It doesn't work
         | that well because a ppt is not a document. What are your
         | thoughts when dealing with other formats first converted to
         | pdf?
        
         | a2code wrote:
         | Does it work with other filetype converted into PDFs? For
         | example docx, ppt, png, etc.
        
       | bx376 wrote:
       | What will the pricing be like?
        
       | ldjkfkdsjnv wrote:
       | Modern playbook:
       | 
       | 1. Build janky open source code base
       | 
       | 2. Sell compute to run it
       | 
       | 3. Build features that create compute lock in (vercel is a master
       | at this)
        
         | tslmy wrote:
         | Here's an alternative:
         | 
         | Spend seed round investments on building a solid software but
         | not building an income stream that can satisfy investors, thus
         | not receiving any new funding and let the company die.
        
       | johnsutor wrote:
       | I wonder how LlamaParse compares head to head with
       | https://unstructured.io
        
         | justanotheratom wrote:
         | not clear to me why this got downvoted. sensible question.
        
         | infecto wrote:
         | I would also like how it compares to any of the commercial
         | offerings from Azure/AWS/GCP. They all have document parsing
         | tools that I have found better than tools like unstructured.
         | Sure you don't have some of the "magic" of segmenting text for
         | vectorization and RAG but imo thats the easy part. The hard
         | part is pulling data, forms, tables, text out of the PDF which
         | I find the cloud tools to do a superior job.
        
       | srameshc wrote:
       | I don't understand why post this on medium ? Medium doesn't let
       | me even read anymore. If you have a blog post on it so your
       | audience can reach you.
        
         | eferreira_ wrote:
         | Also, have a X (formerly Twitter) thread:
         | 
         | https://x.com/llama_index/status/1759987390435996120?s=20
        
           | diggan wrote:
           | Which also isn't really available to unregistered users, can
           | only see the first tweet: https://i.imgur.com/SJA2Gzs.png
        
         | seldo wrote:
         | We are planning to move our blog off of Medium (we've been
         | busy!), but this post is public so you can actually just click
         | through the nag screen if you see one.
        
       | miohtama wrote:
       | > PDFs are specifically a problem: I have complex docs with lots
       | of messy formatting. How do I represent this in the right way so
       | the LLM can understand it?
       | 
       | 40 years after PostScript and this is still a problem that one
       | needs to throw AI at. I feel the software development and human-
       | computer interaction took a wrong turn along the way. What
       | happened to the semantic web?
        
         | avhon1 wrote:
         | It turns out that it takes thought effort to semantically
         | tag/classify everything consistently and completely, so rather
         | than make the decisions, it's easier to just not do it.
        
         | madeofpalk wrote:
         | What?
         | 
         | We still have 'the web'. PDFs are something different and
         | separate.
        
       | lxe wrote:
       | LlamaParse solves exactly the problem I've encountered over and
       | over with RAG. Getting structured info from unstructured data is
       | a pain.
        
       | pknerd wrote:
       | Sorry for offtopic: Are there any LLM services that I can use in
       | cloud similar to OpenAI? I do not have good enough Macbook to run
       | different models locally
        
         | tslmy wrote:
         | Hmmm... OpenAI itself?
         | 
         | Did you intend to rule out OpenAI from consideration?
         | 
         | You mentioned hardware being a constraint, but that doesn't
         | tell me why you specifically wanted to find an alternative to
         | OpenAI.
        
           | simion314 wrote:
           | Not OP, in my case OpenAI does not want my money, they only
           | accept credit cards. For example netflix wants my money so
           | they have more choices.
           | 
           | Also I would like to pay for an equivalent alternative that
           | is less censored, like ChatGPT had a bug one day that it
           | refused to tell me how to force a type cast in TypeScript, it
           | showed me a moderation error. So I want an AI that is
           | targeted for adults and not children in some religious school
           | in USA.
        
           | pknerd wrote:
           | I have used openAI but I want to try several other LLMs as
           | well.
        
         | adhamsalama wrote:
         | I think Anthropic and Mistral offer this but you have to join
         | their waiting lists first.
        
         | sebastiennight wrote:
         | If you use the LLM Chatbot arena[1], you can get two bots to
         | compete to solve your prompts!
         | 
         | [1]: https://chat.lmsys.org/?arena
        
       | technics256 wrote:
       | LlamaParse looks nice. Is there way to return page numbers also
       | with the markdown? This is important for our use case.
        
       | behnamoh wrote:
       | > This is where LlamaParse comes in. We've developed a
       | proprietary parsing service that is incredibly good at parsing
       | PDFs with complex tables into a well-structured markdown format.
       | 
       | This is my problem with projects that start off as open source
       | and become famous because of their community contributions and
       | attention, then the project leaders get that sweet VC money (or
       | not) and make something proprietary.
       | 
       | We've seen it with Langchain and several other "fake open source"
       | projects.
        
         | siquick wrote:
         | LlamaParse is proprietary but the main LI package isn't and you
         | don't need the former to use the latter.
         | 
         | Why shouldn't they make money? LI is a fantastic way to do RAG.
        
           | zmmmmm wrote:
           | I don't disagree but I think it's not a question of "why" but
           | "how".
           | 
           | It could still be licensed in a restricted way, but keeping
           | secret _how_ it works is unfortunate - it breaks the chain of
           | learning that is happening across the open ecosystem and, if
           | the technique is any good, all it does is force open models
           | to build an actually open equivalent so that further progress
           | can be made (and if it 's not really any good then it's snake
           | oil, which is worse). Even if it's great it essentially
           | becomes a dead end for the people who actually need and want
           | an open model ecosystem.
        
       | baby wrote:
       | what is RAG?
        
         | doublerabbit wrote:
         | retrieval augmented generation.
         | 
         | Explained by gpt itself as if you were a teddy bear.
         | 
         | ----
         | 
         | Okay little teddybears, let me explain what retrieval augmented
         | generation is in a way you can understand!
         | 
         | You see, sometimes when big AI models like Claude want to talk
         | about something, they may not know all the facts. But they have
         | a friend named the knowledge base who knows lots of
         | information!
         | 
         | When Claude wants to talk about something new, he first asks
         | the knowledge base "What do you know about X?". The knowledge
         | base looks through all its facts and finds the most helpful
         | ones. Then it shares them with Claude so he has more context
         | before talking.
         | 
         | This process of Claude asking the knowledge base for facts is
         | called retrieval augmented generation. It helps Claude sound
         | smarter and avoid mistakes, because he has extra information
         | from his knowledgeable friend the knowledge base.
         | 
         | The next time Claude wants to chat with you teddybears, he will
         | be even better prepared with facts from the knowledge base to
         | have an interesting conversation!
        
       | _pdp_ wrote:
       | Question, why build this when you can use LLMS to extract the
       | data in the most appropriate format to begin with? Isn't this a
       | bit redundant? Perhaps it makes sense in the short term due to
       | cost but in the long run this problem can be solved generically
       | with LLMS.
        
       | lolpanda wrote:
       | I think LlamaParse is trying to solve a hard problem. Many
       | enterprise customers I know have strong need to parse PDF files
       | and extract data accurately. I found the interface a bit
       | confusing. From your blog post, LlamaParse can extract numbers in
       | tables, but it appears that the output isn't provided in tabular
       | format. Instead, access to these numbers is only available
       | through a question-answering. Is this accurate?
        
       ___________________________________________________________________
       (page generated 2024-02-20 23:00 UTC)