[HN Gopher] Replace OCR with Vision Language Models
___________________________________________________________________
Replace OCR with Vision Language Models
Author : EarlyOom
Score : 97 points
Date : 2025-02-26 19:29 UTC (3 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| gfiorav wrote:
| I wonder what the speed of this approach vs traditional ocr
| techniques. Also, curious if this could be used for text
| detection (find a bounding box containing text within an image).
| vunderba wrote:
| Was just coming here to say this, there does not yet exist a
| multimodal vision LLM approach that is capable of identifying
| bounding boxes of where the text occurs. I suppose you could
| manually cut the image up and send each part separately to the
| LLM but that feels like an kludge and it's still in-exact.
| EarlyOom wrote:
| We can do bounding boxes too :) we just call it visual
| grounding https://github.com/vlm-run/vlmrun-
| cookbook/blob/main/noteboo...
| vunderba wrote:
| Wait what? That's pretty neat. I'm on my phone right now,
| so I can't really view the notebook very easily. How does
| this work? Are you using some kind of continual
| partitioning of the image and refeeding that back into the
| LLM to sort of pseudo-zoom in/out on the parts that contain
| non-cut off text until you can resolve that into rough
| coordinates?
| chpatrick wrote:
| qwen 2.5 vl was specifically trained to produce bounding
| boxes I believe.
| submeta wrote:
| Can I use this to convert flowcharts to yaml representations?
| EarlyOom wrote:
| We convert to a JSON schema, but it would be trivial to convert
| this to yaml. There are some minor differences in e.g. tokens
| required to output JSON vs yaml which is why we've opted for
| our strategy.
| orliesaurus wrote:
| I think OCR tools are good at what they say on the box,
| recognizing characters on a piece of paper etc. If I understand
| this right, the advantage of using a vision language model is the
| added logic that you can say things like: "Clearly this is a
| string, but does it look like a timestamp or something else?"
| EarlyOom wrote:
| VLMs are able to take context into account when filling in
| fields, following either a global or field specific prompt.
| This is great for e.g. unlabeled axes, checking a legend for
| units to be suffixed after a number, etc. Also, you catch lots
| of really simple errors with type hints (e.g. dates, addresses,
| country codes etc.).
| ekidd wrote:
| I've been experimenting with vlm-run (plus custom form
| definitions), and it works surprisingly well with Gemini 2.0
| Flash. Costs, as I understand, are also quite low for Gemini.
| You'll have best results with simple to medium-complexity forms,
| roughly the same ones you could ask a human to process with less
| than 10 minutes of training.
|
| If you need something like this, it's definitely good enough that
| you should consider kicking the tires.
| fzysingularity wrote:
| Very cool! If you have more examples / schemas you'd be
| interested in sharing, feel free to add to the `contrib`
| section.
| Eisenstein wrote:
| If you just want to play with using a vision model to do OCR, I
| made a little script that uses KoboldCpp to do it locally.
|
| * https://github.com/jabberjabberjabber/LLMOCR
| LeoPanthera wrote:
| What's the characters-per-Wh of an LLM compared to traditional
| OCR?
| fzysingularity wrote:
| That's a tough one to answer right now, but to be perfectly
| honest, we're off by 2-3 orders of magnitude in terms of
| chars/W.
|
| That said, VLMs are extremely powerful visual learners with
| LLM-like reasoning capabilities making them more versatile than
| OCR for practically all imaging domains.
|
| In a matter of a few years, I think we'll essentially see
| models that are more cost-performant via distillation,
| quantization and the multitude of tricks you can do to reduce
| the inference overhead.
| mlyle wrote:
| A lot worse. But, higher quality OCR will reduce the amount of
| human post-processing needed, and, in turn will allow us to
| reduce the number of humans. Since humans are relatively
| expensive in energy use, this can be expected to save a lot of
| energy.
| rafram wrote:
| > Since humans are relatively expensive in energy use
|
| Are they? I'm seeing figures around 80 watts at rest, and 150
| when exercising. The brain itself only uses about 20 watts
| [1]. That's 1/35 of a single H100's power consumption (700
| watts - which doesn't even take into account the energy
| required to cool the data center, the humans who build and
| maintain it, ...).
|
| [1]: https://www.humanbrainproject.eu/en/follow-
| hbp/news/2023/09/...
| mlyle wrote:
| The PUE of humans for that 80 watts is terrible, though.
| Ridiculous multiples of additional energy needed to convert
| solar power to a form of a energy that they can use, and
| even the manufacturing lifecycle and transport of humans to
| the datacenter is energy inefficient.
| tgtweak wrote:
| Not really interested until this can run locally without api keys
| :\
| EarlyOom wrote:
| You can! it works with Ollama https://github.com/vlm-
| run/vlmrun-hub
|
| At the end of the day its just schemas. You can decide for
| yourself if its work upgrading to a larger, more expensive
| model.
| beebaween wrote:
| What's the best way to run this is I prefer to use local GPUs?
| EarlyOom wrote:
| You can try out some of our schemas with Ollama if you want:
| https://github.com/vlm-run/vlmrun-hub (instructions in Readme)
| mmusson wrote:
| Lol. The resume includes expert in Mia Khalifa easter egg.
| gunian wrote:
| replaced it with real humans -> nano tech in their brain ->
| transmit to server getting almost 99% accuracy
| intalentive wrote:
| What's the value-add here? The schemas?
| vlmrunadmin007 wrote:
| Basically there is no model schema combination. IF you go ahead
| and prompt a open source model with the schema it doesn't
| produce the results in the expected format. The main
| contribution is how to make these model conform to your
| specific needs and in a structured format.
| idiliv wrote:
| Wait, but we're doing that already, and it works well (Qwen
| 2.5 VL)? If need be, you can always resort to structured
| generation to enforce schema conformity?
| fzysingularity wrote:
| We've seen so many different schemas and ways of prompting the
| VLMs. We're just standardizing it here, and making it dead-
| simple to try it out across model providers.
| TZubiri wrote:
| Wow thanks!
|
| There's a client who had a startup idea that involved analyzing
| pdfs, I used textract, but it was too cumbersome and unreliable.
|
| Maybe I can reach out to see if he wants to give it anothee go
| with this!
| skbjml wrote:
| This is awesome!
| rafram wrote:
| It's an interesting idea, but still way too unreliable to use in
| production IMO. When a traditional OCR model can't read the text,
| it'll output gibberish with low confidence; when a VLM can't read
| the text, it'll output something confidently made up, and it has
| no way to report confidence. (You can ask it to, but the number
| will itself be made up.)
|
| I tried using a VLM to recognize handwritten text in genealogical
| sources, and it made up names and dates that sort of fit the vibe
| of the document when it couldn't read the text! They sounded
| right for the ethnicity and time period but were entirely fake.
| There's no way to ground the model using the source text when the
| model _is_ your OCR.
| EarlyOom wrote:
| This is the main focus of VLM Run and typed extraction more
| generally. If you provide proper type constraints (e.g. with
| Pydantic) you can dramatically reduce the surface area for
| hallucination. Then there's actually fine-tuning on your
| dataset (we're working on this) to push accuracy beyond what
| you get from an unspecialized frontier model.
| syntaxing wrote:
| Maybe I'm being greedy but is it possible to have a vLLM detect
| when a portion is an image? I want to convert some handwritten
| notes into markdown but some portion are diagrams. I want the
| vLLM to extract the diagrams to embed into the markdown output
| vlmrunadmin007 wrote:
| We have successfully tested the model with vLLM and plan to
| release it across multiple inference server frameworks,
| including vLLM and OLAMA.
___________________________________________________________________
(page generated 2025-02-26 23:00 UTC)