[HN Gopher] Don't bother parsing: Just use images for RAG
___________________________________________________________________
Don't bother parsing: Just use images for RAG
Author : Adityav369
Score : 126 points
Date : 2025-07-21 17:16 UTC (5 hours ago)
(HTM) web link (www.morphik.ai)
(TXT) w3m dump (www.morphik.ai)
| pilooch wrote:
| Some colleagues and myself did implemented exactly this six
| months ago for a French gov agency.
|
| It's open source and available here:
| https://github.com/jolibrain/colette
|
| It's not our primary business so it's just lying there and we
| don't advertise much, but it works, somehow and with some tweaks
| to get it really efficient.
|
| The true genius though is that the whole thing can be made fully
| differentiable, unlocking the ability to finetune the viz rag on
| targeted datasets.
|
| The layout model can also be customized for fine grained document
| understanding.
| Adityav369 wrote:
| Yeah the fine tuning is definitely the best part.
|
| Often, the blocker becomes high quality eval sets (which I
| guess always is the blocker).
| ted_dunning wrote:
| You don't have a license in your repository top-level. That
| means that nobody who takes licensing at all seriously can use
| your stuff, even just for reference.
| JSR_FDED wrote:
| Great, thanks for sharing your code. Could you please add a
| license so I and others can understand if we're able to use
| it?
| wryun wrote:
| They do have: https://github.com/jolibrain/colette/blob/main/
| pyproject.tom...
|
| I agree it's better to have the full licence at top level,
| but is there a legal reason why this would be inadequate?
| pilooch wrote:
| Good catch, will add it tomorrow. License is Apache2.
| tobyhinloopen wrote:
| This is something I've done as well - I wanted to scan all
| invoices that came into my mail so I just exported ALL
| ATTACHMENTS from my mailbox and used a script to upload them one
| by one, forcing a tool call to extract "is invoice: yes / no" and
| a bunch of invoice line, company name, date, invoice number, etc
| fields.
|
| It had a surprisingly high hit rate. It took over 3 hours of LLM
| calls but who cares - It was completely hands-off. I then
| compared the invoices to my bank statements (aka I asked an LLM
| to do it) and it just missed a few invoices that weren't included
| as attachments (like those "click to download" mails). It did a
| pretty poor job matching invoices to bank statements (like "oh
| this invoice is a few dollars off but i'm sure its this
| statement") so I'm afraid I still need an accountant for a while.
|
| "What did it cost"? I don't know. I used a cheap-ish model,
| Claude 3.7 I think.
| taberiand wrote:
| In your use case, for that simple data matching that it errors
| on I think it would be better to have the LLM write the code
| that can be used to process the input files (the raw text that
| it produced from images and the bank statements), rather than
| have the LLM try to match up the data in the files itself.
| abc03 wrote:
| Related question: what is today's best solution for invoices?
| ArnavAgrawal03 wrote:
| This would depend on the exact use case. Feeding in the invoice
| directly to the model is - in my opinion - the best way to
| approach this. If you need to search over them, then directly
| embedding them as images is definitely a strong approach.
| Here's something we wrote explaining the process:
| https://www.morphik.ai/docs/concepts/colpali
| themanmaran wrote:
| Hey we've done a lot of research on this side [1] (OCR vs direct
| image + general LLM benchmarking).
|
| The biggest problem with direct image extraction is multipage
| documents. We found that single page extraction (OCR=>LLM vs
| Image=LLM) slightly favored the direct image extraction. But
| anything beyond 5 images had a sharp fall off in accuracy
| compared to OCR first.
|
| Which makes sense, long context recall over text is already a
| hard problem, but that's what LLMs are optimized for. Long
| context recall over images is still pretty bad.
|
| [1] https://getomni.ai/blog/ocr-benchmark
| ArnavAgrawal03 wrote:
| That's an interesting point. We've found that for most use
| cases, over 5 pages of context is overkill. Having a small LLM
| conversion layer on top of images also ends up working pretty
| well (i.e. instead of direct OCR, passing batches of 5 images -
| if you really need that many - to smaller vision models and
| having them extract the most important points from the
| document).
|
| We're currently researching surgery on the cache or attention
| maps for LLMs to have larger batches of images work better.
| Seems like Sliding window or Infinite Retrieval might be
| promising directions to go into.
|
| Also - and this is speculation - I think that the jump in
| multimodal capabilities that we're seeing from models is only
| going to increase, meaning long-context for images is probably
| not going to be a huge blocker as models improve.
| themanmaran wrote:
| This just depends a lot on how well you can parse down the
| context prior to passing to an LLM.
|
| Ex: Reading contracts or legal documents. Usually a 50 page
| document that you can't very effectively cherry pick from.
| Since different clauses or sections will be referenced
| multiple times across the full document.
|
| In these scenarios, it's almost always better to pass the
| full document into the LLM rather than running RAG. And if
| you're passing the full document it's better as text rather
| than images.
| jasonthorsness wrote:
| It makes sense that a lossy transformation (OCR which removes
| structure) would be worse than perceptually lossless (because
| even if the PDF file has additional information, you only see the
| rendered visual). But it's cool and a little surprising that the
| multi-modal models are getting this good at interpreting images!
| emanuer wrote:
| Could someone please help me understand how a multi-modal RAG
| does not already solve this issue?[1]
|
| What am I missing?
|
| Flash 2.5, Sonnet 3.7, etc. always provided me with very
| satisfactory image analysis. And, I might be making this up, but
| to me it feels like some models provide better responses when I
| give them the text as an image, instead of feeding "just" the
| text.
|
| [1] https://www.youtube.com/watch?v=p7yRLIj9IyQ
| ArnavAgrawal03 wrote:
| Multimodal RAG is exactly what we argue for. In their original
| state, though, multivectors (that form the basis for multi-
| modal RAG) are very unwieldy - computing the similarity scores
| is very expensive and so scaling them up in this state is hard.
|
| You need to apply things like quantization, single-vector
| conversions (using fixed dimensional encodings), and better
| indexing to ensure that multimodal RAG works at scale.
|
| That is exactly what we're doing at Morphik :)
| urbandw311er wrote:
| Something just feels a bit off about this piece. It seems to
| labour the point about how "beautiful" or "perfect" their
| solution is a few times too many, to the point where it starts to
| feel more like marketing than any sort of useful technical
| observation.
| programjames wrote:
| I disagree. It feels like something you would say when you
| finally come across the "obviously right" solution, that's
| easier to implement and simpler to describe. As Kolmogorov
| said, the simplest solution is exponentially more correct than
| the others.
| ianbicking wrote:
| Using modern tools I would naturally be inclined to:
|
| 1. Have the LLM see the image and produce an text version using a
| kind of semantic markup (even hallucinated markup)
|
| 2. Use that text for most of the RAG
|
| 3. If the focus (of analysis or conversation) converges one
| image, include that image in the context in addition to the text
|
| If I use a simple prompt with GPT 4o on the Palantir slide from
| the article I get this:
| https://gist.github.com/ianb/7a380a66c033c638c2cd1163ea7b2e9... -
| seems pretty good!
| ashishb wrote:
| I speak from experience that this is a bad idea.
|
| There are cases where documents contains text with letters that
| look the same in many font. For example, 0 and O looks identical
| in many fonts. So if you have a doc/xls/PDF/html then you lose
| information by converting it into an image.
|
| For cases like serial numbers, not even humans can distinguish 0
| vs O (or l vs I) by looking at them.
| weego wrote:
| This is within the context of using it as an alternative to
| OCR, which would suffer the same issues, with more duct tape
| and string infrastructure and cost.
| ashishb wrote:
| You can win any race if you can cherry-pick your competitors.
| zffr wrote:
| PDFs don't always contain actual text. Sometimes they just
| contain instructions to draw the letters.
|
| For that reason, IMO rendering a PDF page as an image is a very
| reasonable way to extract information out of it.
|
| For the other formats you mentioned, I agree that it is
| probably better to parse the document instead.
| ArnavAgrawal03 wrote:
| Completely agree with this. This is what we've observed in
| production too. Embedding images makes the RAG a lot more
| robust to the "inner workings" of a document.
| ArnavAgrawal03 wrote:
| For HTML, in a lot of cases, using the tags to chunk things
| better works. However, I've found that when I'm trying to
| design a page, showing models the actual image of the page
| leads to way better debugging than just sending the code back.
|
| 1 vs I or 0 vs O are valid issues, but in practice - and
| there's probably selection bias here - we've seen documents
| with a ton of diagrams and charts (that are much simpler to
| deal with as images).
| serjester wrote:
| There's multiple fundamental problems people need to be aware of.
|
| - LLM's are typically pre-trained on 4k text tokens and then
| extrapolated out to longer context windows (it's easy to go from
| 4000 text tokens to 4001). This is not possible with images due
| to how they're tokenized. As a result, you're out of distribution
| - hallucinations become a huge problem once you're dealing with
| more than a couple of images.
|
| - Pdf's at 1536 x 2048 use 3 to 5X more tokens than the raw text
| (ie higher inference costs and slower responses). Going lower
| results in blurry images.
|
| - Images are inherently a much heavier representation in raw size
| too, you're adding latency to every request to just download all
| the needed images.
|
| Their very small benchmark is obviously going to outperform basic
| text chunking on finance docs heavy with charts and tables. I
| would be far more interested in seeing an OCR step added with
| Gemini (which can annotate images) and then comparing results.
|
| An end to end image approach makes sense in certain cases (like
| patents, architecture diagrams, etc) but it's a last resort.
| ArnavAgrawal03 wrote:
| You can add OCR with Gemini, and presumably that would lead to
| better results than the OCR model we compared against. However,
| it's important to note that then you're guaranteeing that the
| entire corpus of documents you're processing will go through a
| large VLM. That can be prohibitively expensive and slow.
|
| Definitely trade-offs to be made here, we found this to be the
| most effective in most cases.
| pilooch wrote:
| True but modern models such as gemma3 pan& scan and other
| tricks such as training from multiple resolutions do alleviate
| these issues.
|
| An interesting property of the gemma3 family is that increasing
| the input image siwmze actually does not increase processing
| memory requirements, because a second stage encoder actually
| compresses it into fixed size tokens. Very neat in practice.
| jamesblonde wrote:
| "The results transformed our system, and our query latency went
| from 3-4s to 30ms."
|
| Ignorging the trade-offs introduced, the MUVERA paper presented a
| drop of 90% in latency with evidence in the form of a research
| paper. Yet, you are reporting "99%" drops in latency. Big claims
| require big evidence.
| thor-rodrigues wrote:
| I spent a good amount of time last year working on a system to
| analyse patent documents.
|
| Patents are difficult as they can include anything from abstract
| diagrams, chemical formulas, to mathematical equations, so it
| tends to be really tricky to prepare the data in a way that later
| can be used by an LLM.
|
| The simplest approach I found was to "take a picture" of each
| page of the document, and ask for an LLM to generate a JSON
| explaining the content (plus some other metadata such as page
| number, number of visual elements, and so on)
|
| If any complicated image is present, simply ask for the model to
| describe it. Once that is done, you have a JSON file that can be
| embedded into your vector store of choice.
|
| I can't say about the price-to-performance ration, but this
| approach seems to easier and more efficient than what is the
| author is proposing.
| cheschire wrote:
| how often has the model hallucinated the image though?
| monkeyelite wrote:
| This is a great example of how to use LLMs thanks.
|
| But it also illustrates to me that the opportunities with LLMs
| right now are primarily about reclassifying or reprocessing
| existing sources of value like patent documents. In the 90-00s
| many successful SW businesses were building databases to
| replace traditional filing.
|
| Creating fundamentally new collections of value which require
| upfront investment seems to still be challenging for our
| economy.
| Adityav369 wrote:
| You can ask the model to describe the image, but that is
| inherently lossy. What if it is a chart and the model gets most
| x, y pairs, but the user asks about a missing "x" or "y" value.
| Presenting the image at inference is effective since you're
| guaranteeing that the LLM is able to answer exactly the user's
| question. The only blocker here becomes how good retrieval is,
| and that's a smaller problem to solve. This approach allows us
| to only solve for passing in relevant context, the rest is
| taken care of by the LLM, otherwise the problem space expands
| to correct OCR, parsing, and getting all possible descriptions
| to images from the model.
___________________________________________________________________
(page generated 2025-07-21 23:00 UTC)