[HN Gopher] Show HN: Qwen-2.5-32B is now the best open source OC...
       ___________________________________________________________________
        
       Show HN: Qwen-2.5-32B is now the best open source OCR model
        
       Last week was big for open source LLMs. We got:  - Qwen 2.5 VL (72b
       and 32b)  - Gemma-3 (27b)  - DeepSeek-v3-0324  And a couple weeks
       ago we got the new mistral-ocr model. We updated our OCR benchmark
       to include the new models.  We evaluated 1,000 documents for JSON
       extraction accuracy. Major takeaways:  - Qwen 2.5 VL (72b and 32b)
       are by far the most impressive. Both landed right around 75%
       accuracy (equivalent to GPT-4o's performance). Qwen 72b was only
       0.4% above 32b. Within the margin of error.  - Both Qwen models
       passed mistral-ocr (72.2%), which is specifically trained for OCR.
       - Gemma-3 (27B) only scored 42.9%. Particularly surprising given
       that it's architecture is based on Gemini 2.0 which still tops the
       accuracy chart.  The data set and benchmark runner is fully open
       source. You can check out the code and reproduction steps here:  -
       https://getomni.ai/blog/benchmarking-open-source-models-for-...  -
       https://github.com/getomni-ai/benchmark  -
       https://huggingface.co/datasets/getomni-ai/ocr-benchmark
        
       Author : themanmaran
       Score  : 98 points
       Date   : 2025-04-01 17:00 UTC (5 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | sandreas wrote:
       | What about mini cpm v2.6?
        
       | azinman2 wrote:
       | News update: OCR company touts new benchmark that shows its own
       | products are the most performant.
        
         | jauntywundrkind wrote:
         | I searched for any link between OmniAI and Alibaba's Qwen, but
         | I can't find any link. Do you know anything I don't know?
         | 
         | All of these models are open source (I think?). They could
         | presumably build their work on any of these options. It
         | behooves them to pick well. And establish some authority along
         | the way.
        
           | rustc wrote:
           | The model with the best accuracy in the linked benchmark is
           | "OmniAI" (OP's company) which looks like a paid model, not
           | open source [1].
           | 
           | [1]: https://getomni.ai/pricing
        
         | johnisgood wrote:
         | Someone should try to reproduce and post it here. I can't, my
         | PC is about 15 years old. :(
         | 
         | (It is not a joke.)
        
           | rustc wrote:
           | Reproducing the whole benchmark would be expensive, OmniAi
           | starts at $250/month.
        
             | themanmaran wrote:
             | Generally running the whole benchmark is ~$200, since all
             | the providers cost money. But if anyone wants to
             | specifically benchmark Omni just drop us a note and we'll
             | make the credits available.
        
             | johnisgood wrote:
             | So not all of them are local and open source? Ugh.
        
               | qingcharles wrote:
               | I don't see why you couldn't run any of those locally if
               | you buy the right hardware?
        
         | kapitalx wrote:
         | To be fair, they didn't include themselves at all in the graph.
        
           | azinman2 wrote:
           | They did. It's in the #1 spot
           | 
           | Update: looks like the removed themselves from the graph
           | since I saw it earlier today!
        
       | daemonologist wrote:
       | You mention that you measured cost and latency in addition to
       | accuracy - would you be willing to share those results as well?
       | (I understand that for these open models they would vary between
       | providers, but it would be useful to have an approximate
       | baseline.)
        
         | themanmaran wrote:
         | Yes, I'll add that to the writeup! You're right, initially
         | excluded it because it was really dependent on the providers,
         | so lots of variance. Especially with the Qwen models.
         | 
         | High level results were:
         | 
         | - Qwen 32b => $0.33/1000 pages => 53s/page
         | 
         | - Qwen 72b => $0.71/1000 pages => 51s/page
         | 
         | - Llama 90b => $8.50/1000 pages => 44s/page
         | 
         | - Llama 11b => $0.21/1000 pages => 08s/page
         | 
         | - Gemma 27b => $0.25/1000 pages => 22s/page
         | 
         | - Mistral => $1.00/1000 pages => 03s/page
        
           | dylan604 wrote:
           | One of these things is not like the others. $8.50/1000?? Any
           | chance that's a typo? Otherwise, for someone that has no
           | experience with LLM pricing models, why is Llama 90b so
           | expensive?
        
             | themanmaran wrote:
             | That was the cost when we ran Llama 90b using TogetherAI.
             | But it's quite hard to standardize, since it depends a lot
             | on who is hosting the model (i.e. together, openrouter,
             | grok, etc.)
             | 
             | I think in order to run a proper cost comparison, we would
             | need to run each model on an AWS gpu instance and compare
             | the runtime required.
        
             | int_19h wrote:
             | It's not uncommon when using brokers to see outliers like
             | this. What happens basically is that some models are very
             | popular and have many different providers, and are priced
             | "close to the metal" since the routing will normally pick
             | the cheapest option with the specified requirements (like
             | context size). But then other models - typically more
             | specialized ones - are only hosted by a single provider,
             | and said provider can then price it much higher than raw
             | compute cost.
             | 
             | E.g. if you look at
             | https://openrouter.ai/models?order=pricing-high-to-low,
             | you'll see that there are some 7B and 8B models that are
             | more expensive than Claude Sonnet 3.7.
        
           | esafak wrote:
           | A 2d plot would be great
        
       | jauntywundrkind wrote:
       | The 32b sounds like it has some useful small tweakers. Tweaks to
       | make output more human friendly, better mathematical reasoning,
       | better fine-grained understanding.
       | https://qwenlm.github.io/blog/qwen2.5-vl-32b/
       | https://news.ycombinator.com/item?id=43464068
       | 
       | Qwen2.5-VL-72b was released two months ago (to little fanfare in
       | submissions, i think, but some very enthusiastic comments such as
       | rabid enthusiasm for handwriting recognition) already very
       | interesting. Its actually one of the releases that kind of turned
       | me on to AI, that broke through some of my skepticism &
       | grumpiness. There's pretty good release notes detailing
       | capabilities here; well done blog post.
       | https://qwenlm.github.io/blog/qwen2.5-vl/
       | 
       | One thing that really piqued my interest was Qwen HTML output,
       | where it can provide bounding boxes in HTML format for its
       | output. That really closes the loop interestingly to me, makes
       | the output something I can imagine quickly building useful visual
       | feedback around, or using the structured data from easily. I
       | can't imagine an easier to use output format.
        
       | CSMastermind wrote:
       | I've been very impressed with Qwen in my testing, I think people
       | are underestimating it
        
       | ks2048 wrote:
       | I've been doing some experiments with the OCR API on macOS lately
       | and wonder how it compares to these LLMs.
       | 
       | Overall, it's very impressive, but makes some mistakes (on easy
       | images - i.e. obviously wrong) that require human intervention.
       | 
       | I would like to compare it to these models, but this benchmark is
       | beyond OCR - extracted structured JSON.
        
       | ks2048 wrote:
       | I suppose none of these models can output bounding box
       | coordinates for extracted text? That seems to be a big advantage
       | of traditional OCR over LLMs.
       | 
       | For applications I'm interested in, until we can get to 95+%
       | accuracy, it will require human double-checking / corrections,
       | which seems unfeasible w/o bounding boxes to quickly check for
       | errors.
        
         | jsight wrote:
         | I'd guess that it wouldn't be a huge effort to fine tune them
         | to produce bounding boxes.
         | 
         | I haven't done it with OCR tasks, but I have fine tuned other
         | models to produce them instead of merely producing descriptive
         | text. I'm not sure if there are datasets for this already, but
         | creating one shouldn't be very difficult.
        
         | kapitalx wrote:
         | If you're limited to open source models, that's very true. But
         | for larger models and depending on your document needs, we're
         | definitely seeing very high accuracy (95%-99%) for direct to
         | json extraction (no markdown in between step) with our solution
         | at https://doctly.ai.
        
           | kapitalx wrote:
           | In addition, gemini Pro 2.5 does really well with bounding
           | boxes, but yeah not open source :(
        
         | michaelt wrote:
         | qwen2.5-vl-72b-instruct seems perfectly happy outputting
         | bounding boxes in my testing.
         | 
         | There's also a paper https://arxiv.org/pdf/2409.12191 where
         | they explicitly say some of their training included bounding
         | boxes and coordinates.
        
       | fpgaminer wrote:
       | I've been consistently surprised by Gemini's OCR capabilities.
       | And yeah, Qwen is climbing the vision ladder _fast_.
       | 
       | In my workflows I often have multiple models competing side-by-
       | side, so I get to compare the same task executed on, say, 4o,
       | Gemini, and Qwen. And I deal with a very wide range of vision
       | related tasks. The newest Qwen models are not only overall better
       | than their previous release by a good margin, but also much more
       | stable (less prone to glitching) and easier to finetune. I'm not
       | at all surprised they're topping the OCR benchmark.
       | 
       | What bugs me though is OpenAI. Outside of OCR, 4o is still king
       | in terms of overall understanding of images. But 4o is now almost
       | a year old, and in all that time they have neither improved the
       | vision performance in any newer releases, nor have they improved
       | OCR. OpenAI's OCR has been bad for a long time, and it's both odd
       | and annoying.
       | 
       | Taken with a grain of salt since again I've only had it in my
       | workflow for about a week or two, but I'd say Qwen 2.5 VL 72b
       | beats Gemini for general vision. That lands it in second place
       | for me. And it can be run _locally_. That's nuts. I'm going to
       | laugh if Qwen drops another iteration in a couple months that
       | beats 4o.
        
       | WillAdams wrote:
       | How does one configure an LLM interface using this to process
       | multiple files with a single prompt?
        
       | ianhawes wrote:
       | Is there a reason Surya isn't included?
        
       ___________________________________________________________________
       (page generated 2025-04-01 23:00 UTC)