[HN Gopher] Don't bother parsing: Just use images for RAG
       ___________________________________________________________________
        
       Don't bother parsing: Just use images for RAG
        
       Author : Adityav369
       Score  : 126 points
       Date   : 2025-07-21 17:16 UTC (5 hours ago)
        
 (HTM) web link (www.morphik.ai)
 (TXT) w3m dump (www.morphik.ai)
        
       | pilooch wrote:
       | Some colleagues and myself did implemented exactly this six
       | months ago for a French gov agency.
       | 
       | It's open source and available here:
       | https://github.com/jolibrain/colette
       | 
       | It's not our primary business so it's just lying there and we
       | don't advertise much, but it works, somehow and with some tweaks
       | to get it really efficient.
       | 
       | The true genius though is that the whole thing can be made fully
       | differentiable, unlocking the ability to finetune the viz rag on
       | targeted datasets.
       | 
       | The layout model can also be customized for fine grained document
       | understanding.
        
         | Adityav369 wrote:
         | Yeah the fine tuning is definitely the best part.
         | 
         | Often, the blocker becomes high quality eval sets (which I
         | guess always is the blocker).
        
         | ted_dunning wrote:
         | You don't have a license in your repository top-level. That
         | means that nobody who takes licensing at all seriously can use
         | your stuff, even just for reference.
        
           | JSR_FDED wrote:
           | Great, thanks for sharing your code. Could you please add a
           | license so I and others can understand if we're able to use
           | it?
        
           | wryun wrote:
           | They do have: https://github.com/jolibrain/colette/blob/main/
           | pyproject.tom...
           | 
           | I agree it's better to have the full licence at top level,
           | but is there a legal reason why this would be inadequate?
        
           | pilooch wrote:
           | Good catch, will add it tomorrow. License is Apache2.
        
       | tobyhinloopen wrote:
       | This is something I've done as well - I wanted to scan all
       | invoices that came into my mail so I just exported ALL
       | ATTACHMENTS from my mailbox and used a script to upload them one
       | by one, forcing a tool call to extract "is invoice: yes / no" and
       | a bunch of invoice line, company name, date, invoice number, etc
       | fields.
       | 
       | It had a surprisingly high hit rate. It took over 3 hours of LLM
       | calls but who cares - It was completely hands-off. I then
       | compared the invoices to my bank statements (aka I asked an LLM
       | to do it) and it just missed a few invoices that weren't included
       | as attachments (like those "click to download" mails). It did a
       | pretty poor job matching invoices to bank statements (like "oh
       | this invoice is a few dollars off but i'm sure its this
       | statement") so I'm afraid I still need an accountant for a while.
       | 
       | "What did it cost"? I don't know. I used a cheap-ish model,
       | Claude 3.7 I think.
        
         | taberiand wrote:
         | In your use case, for that simple data matching that it errors
         | on I think it would be better to have the LLM write the code
         | that can be used to process the input files (the raw text that
         | it produced from images and the bank statements), rather than
         | have the LLM try to match up the data in the files itself.
        
       | abc03 wrote:
       | Related question: what is today's best solution for invoices?
        
         | ArnavAgrawal03 wrote:
         | This would depend on the exact use case. Feeding in the invoice
         | directly to the model is - in my opinion - the best way to
         | approach this. If you need to search over them, then directly
         | embedding them as images is definitely a strong approach.
         | Here's something we wrote explaining the process:
         | https://www.morphik.ai/docs/concepts/colpali
        
       | themanmaran wrote:
       | Hey we've done a lot of research on this side [1] (OCR vs direct
       | image + general LLM benchmarking).
       | 
       | The biggest problem with direct image extraction is multipage
       | documents. We found that single page extraction (OCR=>LLM vs
       | Image=LLM) slightly favored the direct image extraction. But
       | anything beyond 5 images had a sharp fall off in accuracy
       | compared to OCR first.
       | 
       | Which makes sense, long context recall over text is already a
       | hard problem, but that's what LLMs are optimized for. Long
       | context recall over images is still pretty bad.
       | 
       | [1] https://getomni.ai/blog/ocr-benchmark
        
         | ArnavAgrawal03 wrote:
         | That's an interesting point. We've found that for most use
         | cases, over 5 pages of context is overkill. Having a small LLM
         | conversion layer on top of images also ends up working pretty
         | well (i.e. instead of direct OCR, passing batches of 5 images -
         | if you really need that many - to smaller vision models and
         | having them extract the most important points from the
         | document).
         | 
         | We're currently researching surgery on the cache or attention
         | maps for LLMs to have larger batches of images work better.
         | Seems like Sliding window or Infinite Retrieval might be
         | promising directions to go into.
         | 
         | Also - and this is speculation - I think that the jump in
         | multimodal capabilities that we're seeing from models is only
         | going to increase, meaning long-context for images is probably
         | not going to be a huge blocker as models improve.
        
           | themanmaran wrote:
           | This just depends a lot on how well you can parse down the
           | context prior to passing to an LLM.
           | 
           | Ex: Reading contracts or legal documents. Usually a 50 page
           | document that you can't very effectively cherry pick from.
           | Since different clauses or sections will be referenced
           | multiple times across the full document.
           | 
           | In these scenarios, it's almost always better to pass the
           | full document into the LLM rather than running RAG. And if
           | you're passing the full document it's better as text rather
           | than images.
        
       | jasonthorsness wrote:
       | It makes sense that a lossy transformation (OCR which removes
       | structure) would be worse than perceptually lossless (because
       | even if the PDF file has additional information, you only see the
       | rendered visual). But it's cool and a little surprising that the
       | multi-modal models are getting this good at interpreting images!
        
       | emanuer wrote:
       | Could someone please help me understand how a multi-modal RAG
       | does not already solve this issue?[1]
       | 
       | What am I missing?
       | 
       | Flash 2.5, Sonnet 3.7, etc. always provided me with very
       | satisfactory image analysis. And, I might be making this up, but
       | to me it feels like some models provide better responses when I
       | give them the text as an image, instead of feeding "just" the
       | text.
       | 
       | [1] https://www.youtube.com/watch?v=p7yRLIj9IyQ
        
         | ArnavAgrawal03 wrote:
         | Multimodal RAG is exactly what we argue for. In their original
         | state, though, multivectors (that form the basis for multi-
         | modal RAG) are very unwieldy - computing the similarity scores
         | is very expensive and so scaling them up in this state is hard.
         | 
         | You need to apply things like quantization, single-vector
         | conversions (using fixed dimensional encodings), and better
         | indexing to ensure that multimodal RAG works at scale.
         | 
         | That is exactly what we're doing at Morphik :)
        
       | urbandw311er wrote:
       | Something just feels a bit off about this piece. It seems to
       | labour the point about how "beautiful" or "perfect" their
       | solution is a few times too many, to the point where it starts to
       | feel more like marketing than any sort of useful technical
       | observation.
        
         | programjames wrote:
         | I disagree. It feels like something you would say when you
         | finally come across the "obviously right" solution, that's
         | easier to implement and simpler to describe. As Kolmogorov
         | said, the simplest solution is exponentially more correct than
         | the others.
        
       | ianbicking wrote:
       | Using modern tools I would naturally be inclined to:
       | 
       | 1. Have the LLM see the image and produce an text version using a
       | kind of semantic markup (even hallucinated markup)
       | 
       | 2. Use that text for most of the RAG
       | 
       | 3. If the focus (of analysis or conversation) converges one
       | image, include that image in the context in addition to the text
       | 
       | If I use a simple prompt with GPT 4o on the Palantir slide from
       | the article I get this:
       | https://gist.github.com/ianb/7a380a66c033c638c2cd1163ea7b2e9... -
       | seems pretty good!
        
       | ashishb wrote:
       | I speak from experience that this is a bad idea.
       | 
       | There are cases where documents contains text with letters that
       | look the same in many font. For example, 0 and O looks identical
       | in many fonts. So if you have a doc/xls/PDF/html then you lose
       | information by converting it into an image.
       | 
       | For cases like serial numbers, not even humans can distinguish 0
       | vs O (or l vs I) by looking at them.
        
         | weego wrote:
         | This is within the context of using it as an alternative to
         | OCR, which would suffer the same issues, with more duct tape
         | and string infrastructure and cost.
        
           | ashishb wrote:
           | You can win any race if you can cherry-pick your competitors.
        
         | zffr wrote:
         | PDFs don't always contain actual text. Sometimes they just
         | contain instructions to draw the letters.
         | 
         | For that reason, IMO rendering a PDF page as an image is a very
         | reasonable way to extract information out of it.
         | 
         | For the other formats you mentioned, I agree that it is
         | probably better to parse the document instead.
        
           | ArnavAgrawal03 wrote:
           | Completely agree with this. This is what we've observed in
           | production too. Embedding images makes the RAG a lot more
           | robust to the "inner workings" of a document.
        
         | ArnavAgrawal03 wrote:
         | For HTML, in a lot of cases, using the tags to chunk things
         | better works. However, I've found that when I'm trying to
         | design a page, showing models the actual image of the page
         | leads to way better debugging than just sending the code back.
         | 
         | 1 vs I or 0 vs O are valid issues, but in practice - and
         | there's probably selection bias here - we've seen documents
         | with a ton of diagrams and charts (that are much simpler to
         | deal with as images).
        
       | serjester wrote:
       | There's multiple fundamental problems people need to be aware of.
       | 
       | - LLM's are typically pre-trained on 4k text tokens and then
       | extrapolated out to longer context windows (it's easy to go from
       | 4000 text tokens to 4001). This is not possible with images due
       | to how they're tokenized. As a result, you're out of distribution
       | - hallucinations become a huge problem once you're dealing with
       | more than a couple of images.
       | 
       | - Pdf's at 1536 x 2048 use 3 to 5X more tokens than the raw text
       | (ie higher inference costs and slower responses). Going lower
       | results in blurry images.
       | 
       | - Images are inherently a much heavier representation in raw size
       | too, you're adding latency to every request to just download all
       | the needed images.
       | 
       | Their very small benchmark is obviously going to outperform basic
       | text chunking on finance docs heavy with charts and tables. I
       | would be far more interested in seeing an OCR step added with
       | Gemini (which can annotate images) and then comparing results.
       | 
       | An end to end image approach makes sense in certain cases (like
       | patents, architecture diagrams, etc) but it's a last resort.
        
         | ArnavAgrawal03 wrote:
         | You can add OCR with Gemini, and presumably that would lead to
         | better results than the OCR model we compared against. However,
         | it's important to note that then you're guaranteeing that the
         | entire corpus of documents you're processing will go through a
         | large VLM. That can be prohibitively expensive and slow.
         | 
         | Definitely trade-offs to be made here, we found this to be the
         | most effective in most cases.
        
         | pilooch wrote:
         | True but modern models such as gemma3 pan& scan and other
         | tricks such as training from multiple resolutions do alleviate
         | these issues.
         | 
         | An interesting property of the gemma3 family is that increasing
         | the input image siwmze actually does not increase processing
         | memory requirements, because a second stage encoder actually
         | compresses it into fixed size tokens. Very neat in practice.
        
       | jamesblonde wrote:
       | "The results transformed our system, and our query latency went
       | from 3-4s to 30ms."
       | 
       | Ignorging the trade-offs introduced, the MUVERA paper presented a
       | drop of 90% in latency with evidence in the form of a research
       | paper. Yet, you are reporting "99%" drops in latency. Big claims
       | require big evidence.
        
       | thor-rodrigues wrote:
       | I spent a good amount of time last year working on a system to
       | analyse patent documents.
       | 
       | Patents are difficult as they can include anything from abstract
       | diagrams, chemical formulas, to mathematical equations, so it
       | tends to be really tricky to prepare the data in a way that later
       | can be used by an LLM.
       | 
       | The simplest approach I found was to "take a picture" of each
       | page of the document, and ask for an LLM to generate a JSON
       | explaining the content (plus some other metadata such as page
       | number, number of visual elements, and so on)
       | 
       | If any complicated image is present, simply ask for the model to
       | describe it. Once that is done, you have a JSON file that can be
       | embedded into your vector store of choice.
       | 
       | I can't say about the price-to-performance ration, but this
       | approach seems to easier and more efficient than what is the
       | author is proposing.
        
         | cheschire wrote:
         | how often has the model hallucinated the image though?
        
         | monkeyelite wrote:
         | This is a great example of how to use LLMs thanks.
         | 
         | But it also illustrates to me that the opportunities with LLMs
         | right now are primarily about reclassifying or reprocessing
         | existing sources of value like patent documents. In the 90-00s
         | many successful SW businesses were building databases to
         | replace traditional filing.
         | 
         | Creating fundamentally new collections of value which require
         | upfront investment seems to still be challenging for our
         | economy.
        
         | Adityav369 wrote:
         | You can ask the model to describe the image, but that is
         | inherently lossy. What if it is a chart and the model gets most
         | x, y pairs, but the user asks about a missing "x" or "y" value.
         | Presenting the image at inference is effective since you're
         | guaranteeing that the LLM is able to answer exactly the user's
         | question. The only blocker here becomes how good retrieval is,
         | and that's a smaller problem to solve. This approach allows us
         | to only solve for passing in relevant context, the rest is
         | taken care of by the LLM, otherwise the problem space expands
         | to correct OCR, parsing, and getting all possible descriptions
         | to images from the model.
        
       ___________________________________________________________________
       (page generated 2025-07-21 23:00 UTC)