[HN Gopher] Benchmarking vision-language models on OCR in dynami...
___________________________________________________________________
Benchmarking vision-language models on OCR in dynamic video
environments
Author : ashu_trv
Score : 131 points
Date : 2025-02-14 07:26 UTC (15 hours ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| nolok wrote:
| I have lots of customer files and I've looked around with all
| these AI tools for something, paid or self hosted or whatever,
| where I point it to a folder with xlsx and pdf and then I can
| query "Whats the end date or M Smith contract" or "How much does
| M Smith still owe" and I've been very disappointed by that, it's
| either very complicated, or they break down with non text based
| pdf, or...
|
| It feels to me that if you need to provide schema and preprocess
| the data and this and that at the end all AI provide is a way to
| do some SQL in natural language, meaning yes it's better but it
| doesn't remove the actual pain point if you're a tech user.
|
| Then again maybe I'm wrong, didn't find the right tool or didn't
| understand it.
|
| Is what I'm looking for something that actually exists (and
| works, not just on simple cases)?
| fhd2 wrote:
| I worked on this a bit 1-2 years ago. Back then, LLMs weren't
| really up to the task, but I found them OK for suggestions that
| a human double checks. Brings us to the Ironies of Automation
| though (human oversight of automation with a review process
| doesn't really work, it's a paper worth reading).
|
| We tried several dedicated services for extracting structured
| data and factoids like that from documents: First Google
| Document AI, then a dedicated provider focusing solely on our
| niche. Back then, that gave the best results.
|
| There wasn't enough budget to go deeper into this and we just
| reverted to doing it manually. But I think a really cool way to
| do this would be to make a user friendly UI where they can see
| suggestions and the text snippets they were extracted from as
| they skim through the document, with a simple way to modify and
| accept these. I think that'd work to scale the process quite a
| bit. Focusing the attention of the human at the relevant parts
| of the document basically.
|
| Haven't worked on this space since then, but I'm pretty bearish
| on fully automated fact extraction. Getting stuff in contracts
| and invoices wrong is typically not acceptable. I think a solid
| human in the loop approach is probably still the way to go.
| tpm wrote:
| I'm not completely up to date but a few months ago Qwen2-VL
| (runnable locally) was able to perfectly read text from images.
| So I'd say you would still need to preprocess that folder to
| texts to get any reasonable speed for queries but after that if
| you feed the data to a LLM with long enough context it should
| just work. If on the other hand it's too much data and the LLM
| is required to use tools then it is indeed still too soon. But
| it is coming.
| silveraxe93 wrote:
| Posted 4 days ago:
|
| > Three state of the art VLMs - Claude-3, Gemini-1.5, and GPT-4o
|
| Literally none of those are state of the art. Academia is
| completely unprepared to deal with the speed Ai develops. This is
| extremely common in research papers.
|
| That's literally in the abstract. If I can see a completely wrong
| sentence 5 seconds into reading the paper, why should I read the
| rest?
| lisnake wrote:
| They may have been SotA at the moment of writing
| silveraxe93 wrote:
| Sure, but they posted this 4 days ago. The minimum I'd expect
| for quality research is for them to skim the abstract before
| posting and change that line to:
|
| "Models from leading AI labs" or similar. Leaving it like now
| signals either sloppiness or dishonesty
| michaelt wrote:
| What models would you recommend instead, for sophisticated OCR
| applications?
|
| Honestly I thought Claude-3 and GPT-4o were some of the newest
| major models with vision support, and that models like o1 and
| deepseek were more reasoning-oriented than OCR-oriented.
| silveraxe93 wrote:
| For Google, definitely flash-2.0; It's a way better model.
| GPT-4o is kinda dated now. o1 is the one I'd pick for OpenAI.
| It's basically their "main" model now.
|
| I'm not that familiar with Claude for vision. I don't think
| Anthropic focusses on that. But the 3.5 family of models is
| way better. If 3.5 Sonnet supports vision that's what I'd use
| thelittleone wrote:
| Anthropic has a beta endpoint for PDFs which has produced
| impressive results for me with long and complex PDFs
| (tables, charts etc).
| diggan wrote:
| > For Google, definitely flash-2.0;
|
| It was literally launched February 5th, ~10 days ago. I'm
| no researcher, and I know "academia moves slow" is of
| course true too, but I don't think we can expect research
| papers to include things that were launched probably after
| they finished the reviews of said paper.
|
| Maybe papers aren't the right approach here at all, but I
| don't feel like it's a fair complaint they don't include
| models released less than 2 weeks ago.
| silveraxe93 wrote:
| It was officially launched 10 days ago, but has been
| openly available for way longer.
|
| Also, this is arxiv. The website that's explicitly about
| posting research pre peer-review.
| diggan wrote:
| > It was officially launched 10 days ago, but has been
| openly available for way longer.
|
| So for how long? How long did the papers you've written
| in the past take to write? AFAIK, it takes some time.
|
| And peer-review is not the only review a paper goes
| through, and was not the reviews I was referring to.
| silveraxe93 wrote:
| Honestly? I don't know how long it's been available. But
| I do know it's been some time already. Enough be aware of
| it when posting this on arxiv.
|
| I'm not even disagreeing that it takes time to write
| papers, and it's "common" for this to happen. But it's
| just more evidence for what I said in my original
| comment:
|
| > Academia is completely unprepared to deal with the
| speed AI develops
| guyomes wrote:
| My anecdotal tests and several benchmarks suggest that
| Qwen2-VL-72b [0] is better than the tested models (even
| better than Claude 3.5 Sonnet), notably for OCR applications.
| It has been available since October 2024.
|
| [0]: https://huggingface.co/Qwen/Qwen2-VL-72B-Instruct
| eptcyka wrote:
| The speed of publishing is just too slow. If you want to apply
| any kind of scientific rigor and have your peers check what
| you're doing (not even doing a full peer review), things take
| more time than just posting on blogs and iterating.
| _stillmind wrote:
| The paper says, "GPT-4o achieves the highest overall accuracy,
| while Gemini-1.5 Pro demonstrates the lowest word error rate."
| Saying Gemini "beats everyone" in this benchmark is misleading.
| nolist_policy wrote:
| Notably, they tested Gemini-1.5 Pro while Gemini 2.0 is another
| step up.[1]
|
| Between this and their 1M token context it is getting hard to
| ignore Google's models.
|
| [1] https://news.ycombinator.com/item?id=42952605
| spwa4 wrote:
| This looks a lot like "compared to a bunch of people who are 10
| years behind (non-transformer, vision-only models), and people
| who aren't trying (aren't optimizing for OCR) Google is doing
| real well"
|
| EasyOCR is LSTM-CTC from 2007, RapidOCR is a ConvNet approach
| from 2021, both focused on speed. Both will vastly outperform
| almost any transformer model, and certainly a big one, on speed
| and memory usage, but they aren't state of the art on accuracy.
| This is well known, for a decade at this point. 2 decades for
| LSTM-CTC.
|
| Plus, I must say the GPT-4o results look a lot saner. "COCONUT"
| (GPT-4o) vs "CONU CNBC" (Gemini) vs Ground Truth "C CONU CNBC".
| And, obviously the ground truth should be "COCONUT MILK" (the
| word milk is almost entirely out of the picture, but is still the
| right answer that a human would give). The "C CONU" comes from
| the first O of COCONUT being somewhat obscured by a drawing of
| ... I don't know what the hell that is. It's still very obvious
| it's meant to be "COCONUT MILK", so the GPT-4o answer is still
| not quite perfect, but heaps better than all the others.
|
| Now this looks very much like it might be temperature related,
| and I can find nothing in the paper about changing the
| temperature, which is imho a very big gap (temperature gives
| transformer models more freedom to choose more creative answers.
| The better performance of GPT-4o might well be the result of such
| a more creative choice, and might also explain why Gemini is
| trying so hard to stay so very close to the ground truth. It's
| still quite the accomplishment to succeed, but GPT-4o is still
| better)
| yorwba wrote:
| What would you say is currently the most accurate OCR solution
| if you're not concerned about speed and memory usage?
| OJFord wrote:
| Not GP but it depends what you mean by accuracy. If you want
| inference like the 'coconut milk' described then obviously an
| LLM. If you want accurate as-written transcription, then I
| don't know the state of the art, but it'll be something
| purpose built for CV & handwriting recognition.
|
| It'll also depend if you care about tabular data, whether a
| 'minor' numerical error (like 0 & 8 mismatched sometimes) is
| significantly worse than a 'typo' as it were in recognising a
| word, etc.
| spwa4 wrote:
| Accuracy should always work to be the answer you want,
| which is the most useful answer for applications. That is
| "coconut milk", not "coconut cnbc". Maybe "cnbc" should
| even be included, but definitely not replacing the word
| "milk" in that location.
| infecto wrote:
| Lots of factors to rank on but generally speaking I don't
| find any of the open source options usable. They all take
| either a long time to tune or are just not accurate enough.
| Commercial services from one of cloud players has hit the
| sweet spot for me.
| guyomes wrote:
| For handwritten texts, the tool that works best for me is
| Qwen2.5-VL-72b [0]. It is also available online [1]. I'm
| surprised that it is not mentioned in the article since even
| the previous model (Qwen2-VL-72b) was better than the other
| VLMs I tried for OCR on handwritten texts.
|
| [0]: https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct
|
| [1]: https://chat.qwenlm.ai
| driscoll42 wrote:
| So, I did some OCR research early last year, that didn't
| include any VLMs, on some 1960s era English scanned documents
| with a mix of typed and handwritten (about 80/20), and here's
| what I found (in terms of cosine similarity):
| Overall | Handwritten | Typed Google Vision: 98.80%
| | 93.29% | 99.37% Amazon Texttract: 98.80% |
| 95.37% | 99.15% surya: 97.41% | 87.16%
| | 98.48% azure: 96.09% | 92.83% |
| 96.46% trocr: 95.92% | 79.04% | 97.65%
| paddleocr: 92.96% | 52.16% | 97.23%
| tesseract: 92.38% | 42.56% | 97.59%
| nougat: 92.37% | 89.25% | 92.77%
| easy_ocr: 89.91% | 35.13% | 95.62%
| keras_ocr: 89.7% | 41.34% | 94.71%
|
| Handwritten is a weighted average of Handwritten and typed, I
| also did Jaccard and Levenshtein distance, but the results
| were similar enough that just leaving them out for sake of
| space.
|
| Overall, of you want the _best_ , if you're an enterprise,
| just use whatever AWS/GCP/Azure you're on, if you're an
| individual, pick between those. While some of the Open Source
| solutions do quite well, surya took 188 seconds to process 88
| pages on my RTX 3080, while the cloud ones were a few seconds
| to upload the docs and download them all. But if you do want
| open source, seriously consider surya, tesseract, and nougat
| depending on your needs. Surya is the best overall, while
| nougat was pretty good at handwriting. Tesseract is just
| blazingly fast, from 121-200 seconds depending on using the
| tessdata-fast or best, but that's CPU based and it's
| trivially parallelizeable, and on my 5950X using all the
| cores, took only 10 seconds to run through all 88 pages.
|
| But really, you need to generate some of your own sample test
| data/examples and run them through the models to see what's
| best. Given frankly how little this paper tested, I really
| should redo my study, add VLMs, and write a small blog/paper,
| been meaning to for years now.
| pqdbr wrote:
| Ive been looking for handwritten benchmarks for a while and
| would love to read that blog post.
| michaelt wrote:
| _> And, obviously the ground truth should be "COCONUT MILK"
| (the word milk is almost entirely out of the picture, but is
| still the right answer that a human would give)._
|
| Maybe? Seems application-dependent to me.
|
| If you're OCRing checks or invoices or car license plates or
| tables in PDF documents, you might prefer a model that's more
| conservative when it comes to filling in the blanks!
|
| And even when recognising packaged coconut products, you've
| also got your organic coconut oil, organic coconut milk with
| reduced fat, organic coconut cream, organic coconut flakes,
| organic coconut dessicated chips, organic coconut and
| strawberry bites, organic coconut milk powder, organic coconut
| milk block, organic coconut milk 9% fat, organic coconut
| yoghurt, organic coconut milk long life barista-style drink,
| organic coconut kefir, organic coconut banana and pear baby
| food pouches, organic coconut banana and pineapple smoothie,
| organic coconut scented body wash and so on.
| infecto wrote:
| It's not obvious at all--it depends on the use case.
|
| You also didn't really counter the paper. Sure, the OCR models
| are old, but what should they have tested instead? Are there
| better open-source OCR models available that would have made
| for a fairer comparison?
| speerer wrote:
| I think it's useful to add the context that CNBC is correct and
| does appear at the top right of that picture. CNBC is not a
| mis-transcribing of MILK, and the letters M, I, L and K are not
| actually visible in the picture.
| pilooch wrote:
| The question is what is OCR for ? If it's to answer questions
| and work with a document, then VLMs do actually contain self
| correcting mechanisms. That is, the end to end image + text
| input to text output is statistically grounded, by training. So
| the question to ask is what do you need OCR for ? Fedding an
| LLM? Then feed it to the VLM instead. Some other usage ? Well,
| to be decided. But near now, CTX and lstms are done with,
| because VLMs do everything: finding the area to read, reading,
| embedding, and answering. OCR was a mid-step, it's going away.
| dylan604 wrote:
| >The "C CONU" comes from the first O of COCONUT being somewhat
| obscured by a drawing of ... I don't know what the hell that
| is.
|
| It's clearly the stem from the bell pepper in front of the can.
| You're complaining that the software is lesser than a human,
| yet it appears your human needs better training in
| understanding context too.
| spwa4 wrote:
| Why would a can of coconut milk have a drawing of a bell
| pepper obscuring the writing? How does that make ANY sense at
| all?
| dylan604 wrote:
| Yup, definitely the human needs better context training.
| Then again, for an account that's only 6 months old, it's
| possible you're not really a human.
|
| Edit to insert: WHAT DRAWING? There's a can of coconut milk
| that is turned so the word coconut is not fully visible. In
| front of that can is a real red bell pepper with a green
| stem still attached that is partially obstructed by the
| bowls in the foreground. What you're attempting to claim as
| a drawing is just a real life object in the table top
| setup. Since this is a CNBC branding image, I'm assuming
| this is a still frame from a video clip. Based on being a
| video type person, this view probably changes based on time
| with different things being obstructed/revealed by the
| camera's movement.
|
| Your RLHF could really use some improvement. To be this
| argumentative when you're clearly wrong is quite amusing,
| but not in an entertaining way. It just reinforces my
| sentiments towards the joke the industry has become
| breadislove wrote:
| The systems they tested against the LLMs are mostly used as a
| part of a larger system. A more fair comparison would be to use
| something like MinerU [1] and proper benchmark like the OHR Bench
| [2] and Reductos table bench [3]. This paper is really bad...
|
| [1]: https://github.com/opendatalab/MinerU [2]:
| https://github.com/opendatalab/OHR-Bench [3]:
| https://github.com/reductoai/rd-tablebench
| croes wrote:
| Does everyone also need huge data centers at lots of energy?
| retskrad wrote:
| People say CPU benchmarks are meaningless (what does even 10-15%
| better mean in practice?) but LLM benchmarks are even more of a
| mystery. The same LLM will produce a novel output everytime you
| given it the exakt same prompt.
| stavros wrote:
| No it won't.
| casey2 wrote:
| It's not surprising that google has such a huge mote with their
| highly illegal and unethical activity of scanning and digitizing
| billions of pages of copyrighted work to train their models. Oh
| wait, google books search was fair use. I got it confused with
| LLMs.
| Terretta wrote:
| > _It 's not surprising that google has such a huge mote with
| their highly illegal and unethical activity of scanning and
| digitizing billions of pages of copyrighted work to train their
| models._
|
| Excellent Freudian slip (proverb allusion suggesting Google has
| a blind spot, while discussing OCR).
| alberto-m wrote:
| It seems to me that the software is occasionally doing better
| than the supposed "ground truth" (who annotated that?), and I
| don't understand why the authors are blindly following the
| latter, and the reviewers apparently approved that.
|
| In Figure 1 the authors complain that Gemini "misreads 'ss ety!'
| as 'ness ety!'", but even a casual look at the image reveals that
| Gemini's reading is correct.
|
| In Figure 11, they state that Claude is "altering the natural
| sequence of ideas in the ground truth", except that the sequence
| in the ground truth makes no sense, while Claude's order does
| (only the initial "the" is misplaced).
| virgilp wrote:
| I think the goal here was to convince the AI to actually read
| chars ("OCR") rather than speculate what might be written on
| paper/in the image. Hence why the ground truth is explicitly
| removing the letters & word parts that are obscured, even when
| they can be guessed.
|
| TBH, I'm not sure it's a good test. I can somewhat see the
| argument against "BASELINE" for ground truth - the underlying
| text might have been BASE(IAKS), for all we know. But, IMO the
| ground truth should have been "Direction & ess" at the very
| least. And, more significantly than that - it's a fake
| scenario, that we don't care for in practice. Why use that? Use
| invoices with IDs that sound like words but are not. Use
| license plates and stuff like that. Heck, use large prints of
| random characters, mixed with handwritten gibberish.
|
| For at least some of images that they used, the expectation
| from a good text reader is actually to understand context and
| not blindly OCR. Take "Trader Joe's": we *know* that's an 's',
| but only from outside context; from OCR, it might've been an 8,
| there's really no way to tell. Why accept the "s" in ground
| truth, but reject the full world "Coconut" (which is obviously
| what is written on the can, even if partially obscured)?
| Furthermore, a human would know what kind of products are sold
| by Trader Joe's, and coupling that with the top of the letters
| "M I L" that are visible, would deduce that's Coconut Milk. So
| really, Claude nailed that one.
| 8organicbits wrote:
| I think there are multiple possible goals we could imagine in
| text recognition tasks. Should the AI guess the occluded text?
| That could be really helpful in some instances. But if the goal
| is OCR, then it should only recognize characters optically, and
| any guessing at occluded characters is undesired.
| abecedarius wrote:
| Maybe a better goal is some representation for "COCONUT [with
| these 3 letters occluded]". Then the consumer might combine
| this with other evidence about the occluded parts, or review
| it if questions come up about how accurate the OCR was in
| this case.
| bufferoverflow wrote:
| In the very first example (occluded text) the "ground truth" is
| just incorrect.
| Vt71fcAqt7 wrote:
| >reviewers apparently approved that.
|
| What reviewers?
| dimatura wrote:
| Re: reviewers, I don't see any mention of this being accepted
| into a peer-reviewed venue. Peer review isn't necessary for
| arxiv submissions.
| belter wrote:
| Gemini is so bad I gladly cancelled my paid account. But hey,
| maybe AI and 50B dollars is what was needed to get a better
| OCR...
| bobjordan wrote:
| Really? This surprises me because I use open-AI pro for $200
| per month and I still fall back to using my Gemini $20 per
| month account a lot these days, I like the new 2.0 experimental
| speed and how it defaults to diving into producing usable code,
| immediately. Whereas, my open ai pro mode will spend a few
| minutes to give me an initial answer that beats around the bush
| at a much higher level to start. So, my workflow has evolved to
| using Gemini to iterate my initial thinking and frame out
| requirements and first draft code. Then, when I get about 2,000
| - 3,000 lines for a detailed initial pro mode prompt, I send
| that to open ai pro mode and then it shines. But, I really like
| starting with the Gemini 2.0 model first. The main thing I
| dislike about Gemini is I often need to tell it "please
| continue" when it reaches its output limit. But it nearly
| always just picks up right where it left off and continues its
| output. This is critical in using Gemini.
| malanj wrote:
| If you're wondering how they prompt the models:
|
| "Perform OCR on this image. Return only the text found in the
| image as a single continuous string without any newlines,
| additional text, or commentary. Separate words with single
| spaces. For any truncated, partially visible, or occluded text,
| include only the visible portions without attempting to complete
| or guess the full text. If no text is present, return empty
| double quotes."
|
| Found in: https://github.com/video-db/ocr-
| benchmark/blob/main/prompts....
| Terretta wrote:
| TL;DR: For original object truth rather than image truth, this
| paper shows VLMS are superior, even though prompt shows the
| authors are "holding it wrong".
|
| Yet another paper where the authors don't address what tokens
| are. It's like publishing _Rolling pin fails at math_ or
| _Calculator fails to turn dough ball into round pizza_.
|
| While I can understand where they're coming from in a desire to
| avoid hallucination when doing some letter for letter
| transcription from an image, certainly most times you reach for
| OCR you want the _original_ copy, despite damage to its
| representation (paper tears, coffee stains, hands in front of
| it). Turns out token conjunction probability conjectures come
| in handy here!
|
| Whether the image of an object, or the object, is "Ground
| Truth" is an exercise left to the user's goal. Almost all use
| cases would want what was originally written on the object, not
| its present occlulded [sic] representation.
| hubraumhugo wrote:
| As someone building in this space, we've found that raw OCR
| accuracy is just one piece (and it's becomming a commodity).
|
| The real challenge is building reliable and accurate ETL
| pipelines (document ingestion from web, OCR, classification,
| validation, etc.) that work at scale in production.
|
| The best products will be defined by everything "non-AI", like
| UX, performance, and human-in-the loop feedback loop for non-
| techies.
|
| Avoiding over-reliance on specific models also helps. With good
| internal eval data and benchmarks, you can easily switch or fine-
| tune models.
| mtrovo wrote:
| That's the point of using AI in the first place. If your
| product is just a polished interface on top of a prompt, then
| your moat isn't that strong, and chances are your product will
| be commoditized soon.
|
| By building a good UX and integrating it with other processes
| that require traditional collaboration, you increase the
| chances that replicating your secret sauce is either infeasible
| or too difficult for newcomers to bother.
| HannesWes wrote:
| This looks very interesting. I conducted some explorations of
| whether LLMs can be used to extract information from hand-written
| forms [0][1]. Such a system could allow users to snap pictures of
| forms and other legal documents, automatically extract structured
| information, and use this information to e.g. automatically fill
| out new forms or determine whether the user has the right to a
| government benefit.
|
| The initial results were quite promising, as GPT-4o could
| reliably identify the correct place in the form for the
| information, and moderately reliably extract the values, even if
| the image was blurry or the text was sloppily written. Excited to
| see how Gemini 2.0 would do on this task!
|
| [0] https://arxiv.org/abs/2412.15260
|
| [1]
| https://github.com/hwestermann/AI4A2J_analyzing_images_of_le...
| (code and data)
| deivid wrote:
| Are there any "good" OCR models that run in restricted/small
| environments? I'm thinking about local models for phone-sized
| CPUs
|
| Obviously these models would have lower accuracy, but running at
| all would be nice.
| Terretta wrote:
| See comment above:
|
| https://news.ycombinator.com/item?id=43048326
|
| Throw those into Google search along with term iOS or Android.
| echelon wrote:
| Are there any benchmarks (speed, accuracy, etc.) for non-OCR use
| cases? I want to label images and videos, but don't really care
| about text.
___________________________________________________________________
(page generated 2025-02-14 23:00 UTC)