[HN Gopher] Show HN: Benchmarking VLMs vs. Traditional OCR
___________________________________________________________________
Show HN: Benchmarking VLMs vs. Traditional OCR
Vision models have been gaining popularity as a replacement for
traditional OCR. Especially with Gemini 2.0 becoming cost
competitive with the cloud platforms. We've been continuously
evaluating different models since we released the Zerox package
last year (https://github.com/getomni-ai/zerox). And we wanted to
put some numbers behind it. So we're open sourcing our internal OCR
benchmark + evaluation datasets. Full writeup + data explorer
here: https://getomni.ai/ocr-benchmark Github:
https://github.com/getomni-ai/benchmark Huggingface:
https://huggingface.co/datasets/getomni-ai/ocr-benchmark Couple
notes on the methodology: 1. We are using JSON accuracy as our
primary metric. The end goal is to evaluate how well each OCR
provider can prepare the data for LLM ingestion. 2. This
methodology differs from a lot of OCR benchmarks, because it
doesn't rely on text similarity. We believe text similarity
measurements are heavily biased towards the exact layout of the
ground truth text, and penalize correct OCR that has slight layout
differences. 3. Every document goes Image => OCR => Predicted
JSON. And we compare the predicted JSON against the annotated
ground truth JSON. The VLMs are capable of Image => JSON directly,
we are primarily trying to measure OCR accuracy here. Planning to
release a separate report on direct JSON accuracy next week. This
is a continuous work in progress! There are at least 10 additional
providers we plan to add to the list. The next big roadmap items
are: - Comparing OCR vs. direct extraction. Early results here show
a slight accuracy improvement, but it's highly variable on page
length. - A multilingual comparison. Right now the evaluation data
is english only. - A breakdown of the data by type (best model for
handwriting, tables, charts, photos, etc.)
Author : themanmaran
Score : 24 points
Date : 2025-02-20 18:49 UTC (3 days ago)
(HTM) web link (getomni.ai)
(TXT) w3m dump (getomni.ai)
| fzysingularity wrote:
| What VLMs do you use when you're listing OmniAI - is this mostly
| wrapping the model providers like your zerox repo?
| betula_ai wrote:
| Thank you for sharing this. Some of the other public models that
| we can host ourselves may perform in practice better than the
| models listed - e.g. Qwen 2.5 VL
| https://github.com/QwenLM/Qwen2.5-VL?tab=readme-ov-file
| EarlyOom wrote:
| OCR seems to be mostly solved for 'normal' text laid out
| according to Latin alphabet norms (left to right, normal spacing
| etc.), but would love to see more adversarial examples. We've
| seen lots of regressions around faxed or scanned documents where
| the text boxes may be slightly rotated (e.g. https://www.cad-
| notes.com/autocad-tip-rotate-multiple-texts-...) not to mention
| handwriting and poorly scanned docs. Then there's contextually
| dependent information like X-axis labels that are implicit from a
| legend somewhere, so its not clear even with the bounding boxes
| what the numbers refer to. This is where VLMs really shine: they
| can extract text then use similar examples from the page to map
| them into their output values when the bounding box doesn't
| provide this for free.
| codelion wrote:
| That's a great point about the limitations of traditional OCR
| with rotated or poorly scanned documents. I agree that VLMs
| really shine when it comes to understanding context and
| extracting information beyond just the text itself. It's pretty
| cool how they can map implicit relationships, like those X-axis
| labels you mentioned.
| jasonjmcghee wrote:
| A big takeaway for me is that Gemini Flash 2.0 is a great
| solution to OCR, considering accessibility, cost, accuracy, and
| speed.
|
| It also has a 1M token context window, though from personal
| experience it seems to work better the smaller the context window
| is.
|
| Seems like Google models have been slowly improving. It wasn't so
| long ago I completely dismissed them.
| shawabawa3 wrote:
| And from my personal experience with Gemini 2.0 flash Vs 2.0
| pro is not even close
|
| I had gemini 2.0 pro read my entire hand written, stain
| covered, half English, half french family cookbook perfectly
| first time
|
| It's _crazy_ good. I had it output the whole thing in latex
| format to generate a printable document immediately too
___________________________________________________________________
(page generated 2025-02-23 23:00 UTC)