[HN Gopher] Replace OCR with Vision Language Models
       ___________________________________________________________________
        
       Replace OCR with Vision Language Models
        
       Author : EarlyOom
       Score  : 97 points
       Date   : 2025-02-26 19:29 UTC (3 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | gfiorav wrote:
       | I wonder what the speed of this approach vs traditional ocr
       | techniques. Also, curious if this could be used for text
       | detection (find a bounding box containing text within an image).
        
         | vunderba wrote:
         | Was just coming here to say this, there does not yet exist a
         | multimodal vision LLM approach that is capable of identifying
         | bounding boxes of where the text occurs. I suppose you could
         | manually cut the image up and send each part separately to the
         | LLM but that feels like an kludge and it's still in-exact.
        
           | EarlyOom wrote:
           | We can do bounding boxes too :) we just call it visual
           | grounding https://github.com/vlm-run/vlmrun-
           | cookbook/blob/main/noteboo...
        
             | vunderba wrote:
             | Wait what? That's pretty neat. I'm on my phone right now,
             | so I can't really view the notebook very easily. How does
             | this work? Are you using some kind of continual
             | partitioning of the image and refeeding that back into the
             | LLM to sort of pseudo-zoom in/out on the parts that contain
             | non-cut off text until you can resolve that into rough
             | coordinates?
        
           | chpatrick wrote:
           | qwen 2.5 vl was specifically trained to produce bounding
           | boxes I believe.
        
       | submeta wrote:
       | Can I use this to convert flowcharts to yaml representations?
        
         | EarlyOom wrote:
         | We convert to a JSON schema, but it would be trivial to convert
         | this to yaml. There are some minor differences in e.g. tokens
         | required to output JSON vs yaml which is why we've opted for
         | our strategy.
        
       | orliesaurus wrote:
       | I think OCR tools are good at what they say on the box,
       | recognizing characters on a piece of paper etc. If I understand
       | this right, the advantage of using a vision language model is the
       | added logic that you can say things like: "Clearly this is a
       | string, but does it look like a timestamp or something else?"
        
         | EarlyOom wrote:
         | VLMs are able to take context into account when filling in
         | fields, following either a global or field specific prompt.
         | This is great for e.g. unlabeled axes, checking a legend for
         | units to be suffixed after a number, etc. Also, you catch lots
         | of really simple errors with type hints (e.g. dates, addresses,
         | country codes etc.).
        
       | ekidd wrote:
       | I've been experimenting with vlm-run (plus custom form
       | definitions), and it works surprisingly well with Gemini 2.0
       | Flash. Costs, as I understand, are also quite low for Gemini.
       | You'll have best results with simple to medium-complexity forms,
       | roughly the same ones you could ask a human to process with less
       | than 10 minutes of training.
       | 
       | If you need something like this, it's definitely good enough that
       | you should consider kicking the tires.
        
         | fzysingularity wrote:
         | Very cool! If you have more examples / schemas you'd be
         | interested in sharing, feel free to add to the `contrib`
         | section.
        
       | Eisenstein wrote:
       | If you just want to play with using a vision model to do OCR, I
       | made a little script that uses KoboldCpp to do it locally.
       | 
       | * https://github.com/jabberjabberjabber/LLMOCR
        
       | LeoPanthera wrote:
       | What's the characters-per-Wh of an LLM compared to traditional
       | OCR?
        
         | fzysingularity wrote:
         | That's a tough one to answer right now, but to be perfectly
         | honest, we're off by 2-3 orders of magnitude in terms of
         | chars/W.
         | 
         | That said, VLMs are extremely powerful visual learners with
         | LLM-like reasoning capabilities making them more versatile than
         | OCR for practically all imaging domains.
         | 
         | In a matter of a few years, I think we'll essentially see
         | models that are more cost-performant via distillation,
         | quantization and the multitude of tricks you can do to reduce
         | the inference overhead.
        
         | mlyle wrote:
         | A lot worse. But, higher quality OCR will reduce the amount of
         | human post-processing needed, and, in turn will allow us to
         | reduce the number of humans. Since humans are relatively
         | expensive in energy use, this can be expected to save a lot of
         | energy.
        
           | rafram wrote:
           | > Since humans are relatively expensive in energy use
           | 
           | Are they? I'm seeing figures around 80 watts at rest, and 150
           | when exercising. The brain itself only uses about 20 watts
           | [1]. That's 1/35 of a single H100's power consumption (700
           | watts - which doesn't even take into account the energy
           | required to cool the data center, the humans who build and
           | maintain it, ...).
           | 
           | [1]: https://www.humanbrainproject.eu/en/follow-
           | hbp/news/2023/09/...
        
             | mlyle wrote:
             | The PUE of humans for that 80 watts is terrible, though.
             | Ridiculous multiples of additional energy needed to convert
             | solar power to a form of a energy that they can use, and
             | even the manufacturing lifecycle and transport of humans to
             | the datacenter is energy inefficient.
        
       | tgtweak wrote:
       | Not really interested until this can run locally without api keys
       | :\
        
         | EarlyOom wrote:
         | You can! it works with Ollama https://github.com/vlm-
         | run/vlmrun-hub
         | 
         | At the end of the day its just schemas. You can decide for
         | yourself if its work upgrading to a larger, more expensive
         | model.
        
       | beebaween wrote:
       | What's the best way to run this is I prefer to use local GPUs?
        
         | EarlyOom wrote:
         | You can try out some of our schemas with Ollama if you want:
         | https://github.com/vlm-run/vlmrun-hub (instructions in Readme)
        
       | mmusson wrote:
       | Lol. The resume includes expert in Mia Khalifa easter egg.
        
       | gunian wrote:
       | replaced it with real humans -> nano tech in their brain ->
       | transmit to server getting almost 99% accuracy
        
       | intalentive wrote:
       | What's the value-add here? The schemas?
        
         | vlmrunadmin007 wrote:
         | Basically there is no model schema combination. IF you go ahead
         | and prompt a open source model with the schema it doesn't
         | produce the results in the expected format. The main
         | contribution is how to make these model conform to your
         | specific needs and in a structured format.
        
           | idiliv wrote:
           | Wait, but we're doing that already, and it works well (Qwen
           | 2.5 VL)? If need be, you can always resort to structured
           | generation to enforce schema conformity?
        
         | fzysingularity wrote:
         | We've seen so many different schemas and ways of prompting the
         | VLMs. We're just standardizing it here, and making it dead-
         | simple to try it out across model providers.
        
       | TZubiri wrote:
       | Wow thanks!
       | 
       | There's a client who had a startup idea that involved analyzing
       | pdfs, I used textract, but it was too cumbersome and unreliable.
       | 
       | Maybe I can reach out to see if he wants to give it anothee go
       | with this!
        
       | skbjml wrote:
       | This is awesome!
        
       | rafram wrote:
       | It's an interesting idea, but still way too unreliable to use in
       | production IMO. When a traditional OCR model can't read the text,
       | it'll output gibberish with low confidence; when a VLM can't read
       | the text, it'll output something confidently made up, and it has
       | no way to report confidence. (You can ask it to, but the number
       | will itself be made up.)
       | 
       | I tried using a VLM to recognize handwritten text in genealogical
       | sources, and it made up names and dates that sort of fit the vibe
       | of the document when it couldn't read the text! They sounded
       | right for the ethnicity and time period but were entirely fake.
       | There's no way to ground the model using the source text when the
       | model _is_ your OCR.
        
         | EarlyOom wrote:
         | This is the main focus of VLM Run and typed extraction more
         | generally. If you provide proper type constraints (e.g. with
         | Pydantic) you can dramatically reduce the surface area for
         | hallucination. Then there's actually fine-tuning on your
         | dataset (we're working on this) to push accuracy beyond what
         | you get from an unspecialized frontier model.
        
       | syntaxing wrote:
       | Maybe I'm being greedy but is it possible to have a vLLM detect
       | when a portion is an image? I want to convert some handwritten
       | notes into markdown but some portion are diagrams. I want the
       | vLLM to extract the diagrams to embed into the markdown output
        
         | vlmrunadmin007 wrote:
         | We have successfully tested the model with vLLM and plan to
         | release it across multiple inference server frameworks,
         | including vLLM and OLAMA.
        
       ___________________________________________________________________
       (page generated 2025-02-26 23:00 UTC)