[HN Gopher] Show HN: Benchmarking VLMs vs. Traditional OCR
       ___________________________________________________________________
        
       Show HN: Benchmarking VLMs vs. Traditional OCR
        
       Vision models have been gaining popularity as a replacement for
       traditional OCR. Especially with Gemini 2.0 becoming cost
       competitive with the cloud platforms.  We've been continuously
       evaluating different models since we released the Zerox package
       last year (https://github.com/getomni-ai/zerox). And we wanted to
       put some numbers behind it. So we're open sourcing our internal OCR
       benchmark + evaluation datasets.  Full writeup + data explorer
       here: https://getomni.ai/ocr-benchmark  Github:
       https://github.com/getomni-ai/benchmark  Huggingface:
       https://huggingface.co/datasets/getomni-ai/ocr-benchmark  Couple
       notes on the methodology:  1. We are using JSON accuracy as our
       primary metric. The end goal is to evaluate how well each OCR
       provider can prepare the data for LLM ingestion.  2. This
       methodology differs from a lot of OCR benchmarks, because it
       doesn't rely on text similarity. We believe text similarity
       measurements are heavily biased towards the exact layout of the
       ground truth text, and penalize correct OCR that has slight layout
       differences.  3. Every document goes Image => OCR => Predicted
       JSON. And we compare the predicted JSON against the annotated
       ground truth JSON. The VLMs are capable of Image => JSON directly,
       we are primarily trying to measure OCR accuracy here. Planning to
       release a separate report on direct JSON accuracy next week.  This
       is a continuous work in progress! There are at least 10 additional
       providers we plan to add to the list.  The next big roadmap items
       are: - Comparing OCR vs. direct extraction. Early results here show
       a slight accuracy improvement, but it's highly variable on page
       length.  - A multilingual comparison. Right now the evaluation data
       is english only.  - A breakdown of the data by type (best model for
       handwriting, tables, charts, photos, etc.)
        
       Author : themanmaran
       Score  : 24 points
       Date   : 2025-02-20 18:49 UTC (3 days ago)
        
 (HTM) web link (getomni.ai)
 (TXT) w3m dump (getomni.ai)
        
       | fzysingularity wrote:
       | What VLMs do you use when you're listing OmniAI - is this mostly
       | wrapping the model providers like your zerox repo?
        
       | betula_ai wrote:
       | Thank you for sharing this. Some of the other public models that
       | we can host ourselves may perform in practice better than the
       | models listed - e.g. Qwen 2.5 VL
       | https://github.com/QwenLM/Qwen2.5-VL?tab=readme-ov-file
        
       | EarlyOom wrote:
       | OCR seems to be mostly solved for 'normal' text laid out
       | according to Latin alphabet norms (left to right, normal spacing
       | etc.), but would love to see more adversarial examples. We've
       | seen lots of regressions around faxed or scanned documents where
       | the text boxes may be slightly rotated (e.g. https://www.cad-
       | notes.com/autocad-tip-rotate-multiple-texts-...) not to mention
       | handwriting and poorly scanned docs. Then there's contextually
       | dependent information like X-axis labels that are implicit from a
       | legend somewhere, so its not clear even with the bounding boxes
       | what the numbers refer to. This is where VLMs really shine: they
       | can extract text then use similar examples from the page to map
       | them into their output values when the bounding box doesn't
       | provide this for free.
        
         | codelion wrote:
         | That's a great point about the limitations of traditional OCR
         | with rotated or poorly scanned documents. I agree that VLMs
         | really shine when it comes to understanding context and
         | extracting information beyond just the text itself. It's pretty
         | cool how they can map implicit relationships, like those X-axis
         | labels you mentioned.
        
       | jasonjmcghee wrote:
       | A big takeaway for me is that Gemini Flash 2.0 is a great
       | solution to OCR, considering accessibility, cost, accuracy, and
       | speed.
       | 
       | It also has a 1M token context window, though from personal
       | experience it seems to work better the smaller the context window
       | is.
       | 
       | Seems like Google models have been slowly improving. It wasn't so
       | long ago I completely dismissed them.
        
         | shawabawa3 wrote:
         | And from my personal experience with Gemini 2.0 flash Vs 2.0
         | pro is not even close
         | 
         | I had gemini 2.0 pro read my entire hand written, stain
         | covered, half English, half french family cookbook perfectly
         | first time
         | 
         | It's _crazy_ good. I had it output the whole thing in latex
         | format to generate a printable document immediately too
        
       ___________________________________________________________________
       (page generated 2025-02-23 23:00 UTC)