[HN Gopher] Mistral OCR
___________________________________________________________________
Mistral OCR
Author : littlemerman
Score : 834 points
Date : 2025-03-06 17:39 UTC (5 hours ago)
(HTM) web link (mistral.ai)
(TXT) w3m dump (mistral.ai)
| vessenes wrote:
| Dang. Super fast and significantly more accurate than google,
| Claude and others.
|
| Pricing : $1/1000 pages, or per 2k pages if "batched". I'm not
| sure what batching means in this case: multiple pdfs? Why not
| split them to halve the cost?
|
| Anyway this looks great at pdf to markdown.
| jacksnipe wrote:
| I would assume this is 1 request containing 2k pages vs N
| requests whose total pages add up to 1000.
| abiraja wrote:
| Batching likely means the response is not real-time. You set up
| a batch job and they send you the results later.
| vessenes wrote:
| That makes sense. Idle time is nearly free after all.
| ozim wrote:
| If only business people I work with would understand 100GB
| even transfer over the network is not going to return
| immediately results ;)
| sophiebits wrote:
| Batched often means a higher latency option (minutes/hours
| instead of seconds), which providers can schedule more
| efficiently on their GPUs.
| Tostino wrote:
| Usually (With OpenAI, I haven't checked Mistral yet) it means
| an async api rather than a sync api.
|
| e.g. you submit multiple requests (pdfs) in one call, and get
| back an id for the batch. You then can check on the status of
| that batch and get the results for everything when done.
|
| It lets them use their available hardware to it's full capacity
| much better.
| odiroot wrote:
| May I ask as a layperson, how would you about using this to OCR
| multiple hundreds of pages? I tried the chat but it pretty much
| stops after the 2nd page.
| sneak wrote:
| Submit the pages via the API.
| odiroot wrote:
| This worked indeed. Although I had to cut my document into
| smaller chunks. 900 pages at once ended with a timeout.
| beklein wrote:
| You can check the example code on the Mistral documentation,
| you would _only_ have to change the value of the variable
| `document_url` to the URL of your uploaded PDF... and you
| need to change the `MISTRAL_API_KEY` to the value of your
| specific key that you can get from the Le Platforme webpage.
|
| https://docs.mistral.ai/capabilities/document/#ocr-with-pdf
| odiroot wrote:
| Thanks!
| kapitalx wrote:
| From my testing so far, it seems it's super fast and responded
| synchronously. But it decided that the entire page is an image
| and returned `` with coordinates in
| the metadata for the image, which is the entire page.
|
| Our tool, doctly.ai is much slower and async, but much more
| accurate and gets you the content itself as an markdown.
| ralusek wrote:
| I thought we stopped -ly company names ~8 years ago?
| yieldcrv wrote:
| if you talk to people gen-x and older, you still _need_
| .com domains
|
| for all those people that aren't just clicking on a link on
| their social media feed, chat group, or targeted ad
| kapitalx wrote:
| Haha for sure. Naming isn't just the hardest problem in
| computer science, it's always hard. But at some point you
| just have to pick something and move forward.
| newfocogi wrote:
| They say: "releasing the API mistral-ocr-latest at 1000 pages /
| $"
|
| I had to reread that a few times. I assume this means 1000pg/$1
| but I'm still not sure about it.
| bredren wrote:
| Ya, presumably it is missing the number `1.00`.
| groby_b wrote:
| Not really. When you go 60 mph (or km/h) you don't specify
| the 1.00 for the hours either. pages/$ is the unit, 1000 is
| the value.
| svachalek wrote:
| Yeah you can read it as "pages per dollar" or as a unit
| "pages/$", it all comes out the same meaning.
| dgfl wrote:
| Great example of how information is sometimes compartmentalized
| arbitrarily in the brain: I imagine you have never been
| confused by sentences such as "I'm running at 10 km/h".
| mkl wrote:
| Dollar signs go before the number, not after it like units.
| It needs to be 1000 pages/$1 to make sense, whereas 10km and
| 10h and 10/h all make sense so 10km/h does. I imagine you
| would be confused by km/h 10 but not $10.
| amelius wrote:
| Hmm, can it read small print? ;)
| pawelduda wrote:
| It outperforms the competition significantly AND can extract
| embedded images from the text. I really like LLMs for OCR more
| and more. Gemini was already pretty good at it
| sbarre wrote:
| 6 years ago I was working with a very large enterprise that was
| struggling to solve this problem, trying to scan millions of
| arbitrary forms and documents per month to clearly understand key
| points like account numbers, names and addresses, policy numbers,
| phone numbers, embedded images or scribbled notes, and also draw
| relationships between these values on a given form, or even
| across forms.
|
| I wasn't there to solve that specific problem but it was
| connected to what we were doing so it was fascinating to hear
| that team talk through all the things they'd tried, from brute-
| force training on templates (didn't scale as they had too many
| kinds of forms) to every vendor solution under the sun (none
| worked quite as advertised on their data)..
|
| I have to imagine this is a problem shared by so many companies.
| jcuenod wrote:
| Just tested with a multilingual (bidi) English/Hebrew document.
|
| The Hebrew output had no correspondence to the text whatsoever
| (in context, there was an English translation, and the Hebrew
| produced was a back-translation of that).
|
| Their benchmark results are impressive, don't get me wrong. But
| I'm a little disappointed. I often read multilingual document
| scans in the humanities. Multilingual (and esp. bidi) OCR is
| challenging, and I'm always looking for a better solution for a
| side-project I'm working on (fixpdfs.com).
|
| Also, I thought OCR implied that you could get bounding boxes for
| text (and reconstruct a text layer on a scan, for example). Am I
| wrong, or is this term just overloaded, now?
| nicodjimenez wrote:
| You can get bounding boxes from our pdf api at Mathpix.com
|
| Disclaimer, I'm the founder
| kergonath wrote:
| Mathpix is ace. That's the best results I got so far for
| scientific papers and reports. It understands the layout of
| complex documents very well, it's quite impressive. Equations
| are perfect, figures extraction works well.
|
| There are a few annoying issues, but overall I am very happy
| with it.
| nicodjimenez wrote:
| Thanks for the kind words. What are some of the annoying
| issues?
| notepad0x90 wrote:
| I was just watching a science-related video containing math
| equations. I wondered how soon will I be able to ask the video
| player "What am I looking at here, describe the equations" and it
| will OCR the frames, analyze them and explain them to me.
|
| It's only a matter of time before "browsing" means navigating
| HTTP sites via LLM prompts. although, I think it is critical that
| LLM input should NOT be restricted to verbal cues. Not everyone
| is an extrovert that longs to hear the sound of their own voices.
| A lot of human communication is non-verbal.
|
| Once we get over the privacy implications (and I do believe this
| can only be done by worldwide legislative efforts), I can imagine
| looking at a "website" or video, and my expressions, mannerisms
| and gestures will be considered prompts.
|
| At least that is what I imagine the tech would evolve into in 5+
| years.
| devmor wrote:
| Good lord, I dearly hope not. That sounds like a coddled
| hellscape world, something you'd see made fun of in Disney's
| Wall-E.
| notepad0x90 wrote:
| hence my comment about privacy and need for legislation :)
|
| It isn't the tech that's the problem but the people that will
| abuse it.
| devmor wrote:
| While those are concerns, my point was that having
| everything on the internet navigated to, digested and
| explained to me sounds unpleasant and overall a drain on my
| ability to think and reason for myself.
|
| It is _specifically_ how you describe using the tech that
| provokes a feeling of revulsion to me.
| notepad0x90 wrote:
| Then I think you misunderstand. The ML system would know
| when you want things digested to you or not. Right now
| companies are assuming this and forcing LLM interaction.
| But when properly done, the system would know based on
| your behavior or explicit prompts what you want and
| provide the service. If you're staring at a paragraph
| intently and confused, it might start highlighting common
| phrases or parts of the text/picture that might be hard
| to grasp and based on your reaction to that, it might
| start describing things via audio,tool tips,side
| pane,etc.. In other words, if you don't like how and when
| you're interacting with the LLM ecosystem, then that is
| an immature and failing ecosystem, in my vision this
| would be a largely solved problems, like how we interact
| with keyboards,mouse and touchscreens today.
| abrichr wrote:
| > I wondered how soon will I be able to ask the video player
| "What am I looking at here, describe the equations" and it will
| OCR the frames, analyze them and explain them to me.
|
| Seems like https://aiscreenshot.app might fit the bill.
| groby_b wrote:
| Now? OK, you need to screencap and upload to LLM, but that's
| well established tech by now. (Where by "well established", I
| mean at least 9 months old ;)
|
| Same goes for "navigating HTTP sites via LLM prompts". Most
| LLMs have web search integration, and the "Deep Research"
| variants do more complex navigation.
|
| Video chat is there partially, as well. It doesn't really pay
| much attention to gestures & expressions, but I'd put the
| "earliest possible" threshold for that a good chunk closer than
| 5 years.
| notepad0x90 wrote:
| Yeah, all these things are possible today, but getting them
| well polished and integrated is another story. Imagine all
| this being supported by "HTML6" lol. When apple gets around
| to making this part of safari, then we know it's ready.
| groby_b wrote:
| That's a great upper-bound estimator ;)
|
| But kidding aside - I'm not sure people _want_ this being
| supported by web standards. We could be a _huge_ step
| closer to that future had we decided to actually take RDF
| /Dublin Core/Microdata seriously. (LLMs perform a lot
| better with well-annotated data)
|
| The unanimous verdict across web publishers was "looks like
| a lot of work, let's not". That is, ultimately, why we need
| to jump through all the OCR hoops. Not only did the world
| not annotate the data, it then proceeded to remove as many
| traces of machine readability as possible.
|
| So, the likely gating factor is probably not Apple & Safari
| & "HTML6" (shudder!)
|
| If I venture my best bet what's preventing polished
| integration: It's really hard to do via foundational models
| only, and the number of people who want to have deep &
| well-informed conversations via a polished app enough that
| they're willing to pay for an app that does that is low
| enough that it's not the hot VC space. (Yet?)
|
| Crystal ball: Some OSS project will probably get within
| spitting distance of something really useful, but also
| probably flub the UX. Somebody else will take up these
| ideas while it's hot and polish it in a startup. So, 18-36
| months for an integrated experience from here?
| andoando wrote:
| Bit unrelated but is there anything that can help with really low
| resolution text? My neighbor got hit and run the other day for
| example, and I've been trying every tool I can to make out some
| of the letters/numbers on the plate
|
| https://ibb.co/mr8QSYnj
| busymom0 wrote:
| There are photo enhancers online. But your picture is way too
| pixelated to get any useful info from it.
| tjoff wrote:
| If you know the font in advance (which you often do in these
| cases) you can do insane reconstructions. Also keep in mind
| that it doesn't have to be a perfect match, with the help of
| the color and other facts (such as likely location) about the
| car you can narrow it down significantly.
| zellyn wrote:
| Maybe if you had multiple frames, and used something very
| clever?
| flutas wrote:
| Looks like a paper temp tag. Other than that, I'm not sure much
| can be had from it.
| zinglersen wrote:
| Finding the right subreddit and asking there is probably a
| better approach if you want to maximize the chances of getting
| the plate 'decrypted'.
| dewey wrote:
| To even get started on this you'd also need to share some
| contextual information like continent, country etc. I'd say.
| andoando wrote:
| Its in CA, looks like paper plates which follow a specific
| format and the last two seem to be the numbers '64'. Police
| should be able to search for temp tag with partial match and
| match the make/model. Was curious to see if any software
| could help though
| rvnx wrote:
| If it's a video, sharing a few frames can help as well
| TriangleEdge wrote:
| One of my hobby projects while in University was to do OCR on
| book scans. Doing character recognition was solved, but finding
| the relationship between characters was very difficult. I tried
| "primitive" neural nets, but edge cases would often break what I
| built. Super cool to me to see such an order of magnitude in
| improvement here.
|
| Does it do hand written notes and annotations? What about meta
| information like highlighting? I am also curious if LLMs will get
| better because more access to information if it can be
| effectively extracted from PDFs.
| jcuenod wrote:
| * Character recognition on monolingual text in a narrow domain
| is solved
| jervant wrote:
| I wonder how it compares to USPS workers at deciphering illegible
| handwriting.
| linklater12 wrote:
| Document processing is where b2b SAAS is at.
| opwieurposiu wrote:
| Related, does anyone know of an app that can read gauges from an
| image and log the number to influx? I have a solar power meter in
| my crawlspace, it is inconvenient to go down there. I want to
| point an old phone at it and log it so I can check it easily. The
| gauge is digital and looks like this:
|
| https://www.pvh2o.com/solarShed/firstPower.jpg
| dehrmann wrote:
| You'll be happier finding a replacement meter that has an
| interface to monitor it directly or a second meter. An old
| phone and OCR will be very brittle.
| haswell wrote:
| Not OP, but it sounds like the kind of project I'd undertake.
|
| Happiness for me is about exploring the problem within
| constraints and the satisfaction of building the solution.
| Brittleness is often of less concern than the fun factor.
|
| And some kinds of brittleness can be managed/solved, which
| adds to the fun.
| arcfour wrote:
| I would posit that learning how the device works, and how
| to integrate with a newer digital monitoring device would
| be just as interesting _and_ less brittle.
| haswell wrote:
| Possibly! But I've recently wanted to dabble with
| computer vision, so I'd be looking at a project like this
| as a way to scratch a specific itch. Again, not OP so I
| don't know what their priorities are, but just offering
| one angle for why one might choose a less "optimal"
| approach.
| renewiltord wrote:
| 4o transcribes it perfectly. You can usually root an old
| Android and write this app in ~2h with LLMs if unfamiliar. The
| hard part will be maintaining camera lens cleanliness and
| alignment etc.
|
| The time cost is so low that you should give it a gander.
| You'll be surprised how fast you can do it. If you just take
| screenshots every minute it should suffice.
| pavl wrote:
| What software-tools do you usw to Programm the APP?
| ubergeek42 wrote:
| This[1] is something I've come across but not had a chance to
| play with, designed for reading non-smart meters that might
| work for you. I'm not sure if there's any way to run it on an
| old phone though.
|
| [1] https://github.com/jomjol/AI-on-the-edge-device
| timc3 wrote:
| I use this for a watermeter. Works quite well as long as you
| have a good SD card
| jasonjayr wrote:
| Wow. I was looking at hooking my water meter into home
| assistant, and was going to investigate just counting an
| optical pulse (it has a white portion on the gear that is in
| a certain spot every .1 gal) This is like the same meter I
| use, and perfect.
|
| (It turns out my electric meter, though analog, blasts out
| it's reading on RF every 10 seconds unencrypted. I got that
| via my RTL-SDR reciever :) )
| ramses0 wrote:
| https://www.home-assistant.io/integrations/seven_segments/
|
| https://www.unix-ag.uni-kl.de/~auerswal/ssocr/
|
| https://github.com/tesseract-ocr/tesseract
|
| https://community.home-assistant.io/t/ocr-on-camera-image-fo...
|
| https://www.google.com/search?q=home+assistant+ocr+integrati...
|
| https://www.google.com/search?q=esphome+ocr+sensor
|
| https://hackaday.com/2021/02/07/an-esp-will-read-your-meter-...
|
| ...start digging around and you'll likely find something. HA
| has integrations which can support writing to InfluxDB (local
| for sure, and you can probably configure it for a remote
| influxdb).
|
| You're looking at 1xRaspberry PI, 1xUSB Webcam, 1x"Power
| Management / humidity management / waterproof electrical box"
| to stuff it into, and then either YOLO and DIY to shoot over to
| your influxdb, or set up a Home Assistant and "attach" your
| frankenbox as some sort of "sensor" or "integration" which
| spits out metrics and yadayada...
| BonoboIO wrote:
| Gemini Free Tier would surely work
| ChemSpider wrote:
| "World's best OCR model" - that is quite a statement. Are there
| any well-known benchmarks for OCR software?
| xnx wrote:
| https://huggingface.co/spaces/echo840/ocrbench-leaderboard
| ChemSpider wrote:
| Interesting. But no mistral on it yet?
| themanmaran wrote:
| We published this benchmark the other week. We'll can update
| and run with Mistral today!
|
| https://github.com/getomni-ai/benchmark
| kergonath wrote:
| Excellent. I am looking forward to it.
| cdolan wrote:
| Came here to see if you all had run a benchmark on it yet :)
| WhitneyLand wrote:
| It's interesting that none of the existing models can decode a
| Scrabble board screen shot and give an accurate grid of
| characters.
|
| I realize it's not a common business case, came across it
| testing how well LLMs can solve simple games. On a side note,
| if you bypass OCR and give models a text layout of a board
| standard LLMs cannot solve Scrabble boards but the thinking
| models usually can.
| resource_waste wrote:
| Its Mistral, they are the only homegrown AI Europe has, so
| people pretend they are meaningful.
|
| I'll give it a try, but I'm not holding my breath. I'm a huge
| AI Enthusiast and I've yet to be impressed with anything
| they've put out.
| sashank_1509 wrote:
| Really cool, thanks Mistral!
| z2 wrote:
| Is there a reliable handwriting OCR benchmark out there (updated,
| not a blog post)? Despite the gains claimed for printed text, I
| found (anecdotally) that trying to use Mistral OCR on my messy
| cursive handwriting to be much less accurate than GPT-4o, in the
| ballpark of 30% wrong vs closer to 5% wrong for GPT-4o.
|
| Edit: answered in another post:
| https://huggingface.co/spaces/echo840/ocrbench-leaderboard
| dannyobrien wrote:
| Simon Willison linked to an impressive demo of Qwen2-VL in this
| area: I haven't found a version of it that I could run locally
| yet to corroborate.
| https://simonwillison.net/2024/Sep/4/qwen2-vl/
| aperrien wrote:
| Is this model open source?
| daemonologist wrote:
| No (nor is it open-weights).
| cxie wrote:
| The new Mistral OCR release looks impressive - 94.89% overall
| accuracy and significantly better multilingual support than
| competitors. As someone who's built document processing systems
| at scale, I'm curious about the real-world implications.
|
| Has anyone tried this on specialized domains like medical or
| legal documents? The benchmarks are promising, but OCR has always
| faced challenges with domain-specific terminology and formatting.
|
| Also interesting to see the pricing model ($1/1000 pages) in a
| landscape where many expected this functionality to eventually be
| bundled into base LLM offerings. This feels like a trend where
| previously encapsulated capabilities are being unbundled into
| specialized APIs with separate pricing.
|
| I wonder if this is the beginning of the componentization of AI
| infrastructure - breaking monolithic models into specialized
| services that each do one thing extremely well.
| unboxingelf wrote:
| We'll just stick LLM Gateway LLM in front of all the
| specialized LLMs. MicroLLMs Architecture.
| cxie wrote:
| I actually think you're onto something there. The "MicroLLMs
| Architecture" could mirror how microservices revolutionized
| web architecture.
|
| Instead of one massive model trying to do everything, you'd
| have specialized models for OCR, code generation, image
| understanding, etc. Then a "router LLM" would direct queries
| to the appropriate specialized model and synthesize
| responses.
|
| The efficiency gains could be substantial - why run a 1T
| parameter model when your query just needs a lightweight OCR
| specialist? You could dynamically load only what you need.
|
| The challenge would be in the communication protocol between
| models and managing the complexity. We'd need something like
| a "prompt bus" for inter-model communication with
| standardized inputs/outputs.
|
| Has anyone here started building infrastructure for this kind
| of model orchestration yet? This feels like it could be the
| Kubernetes moment for AI systems.
| arcfour wrote:
| This is already done with agents. Some agents only have
| tools and the one model, some agents will orchestrate with
| other LLMs to handle more advanced use cases. It's pretty
| obvious solution when you think about how to get good
| performance out of a model on a complex task when useful
| context length is limited: just run multiple models with
| their own context and give them a supervisor model--just
| like how humans organize themselves in real life.
| fnordpiglet wrote:
| I'm doing this personally for my own project - essentially
| building an agent graph that starts with the image output,
| orients and cleans, does a first pass with tesseract LSTM
| best models to create PDF/HOCR/Alto, then pass to other
| LLMs and models based on their strengths to further refine
| towards markdown and latex. My goal is less about RAG
| database population but about preserving in a non manually
| typeset form the structure and data and analysis, and there
| seems to be pretty limited tooling out there since the goal
| generally seems to be the obviously immediately commercial
| goal of producing RAG amenable forms that defer the "heavy"
| side of chart / graphic / tabular reproduction to a future
| time.
| kergonath wrote:
| > Has anyone tried this on specialized domains like medical or
| legal documents?
|
| I'll try it on a whole bunch of scientific papers ASAP. Quite
| excited about this.
| stavros wrote:
| What do you mean by "free"? Using the OpenAI vision API, for
| example, for OCR is quite a bit more expensive than $1/1k
| pages.
| salynchnew wrote:
| Also interesting to see that parts of the training
| infrastructure to create frontier models is itself being
| monetized.
| PeterStuer wrote:
| I'd love to try it for my domain (regulation), but $1/1000
| pages is significantly more expensive than my current local
| Docling based setup that already does a great job of processing
| PDF's for my needs.
| yawnxyz wrote:
| I think for regulated fields / high impact fields $1/1000 is
| well-worth the price; if the accuracy is close to 100% this
| is way better than using people, who are still error-prone
| janalsncm wrote:
| I have done OCR on leases. It's hard. You have to be accurate
| and they all have bespoke formatting.
|
| It would almost be easier to switch everyone to a common format
| and spell out important entities (names, numbers) multiple
| times similar to how cheques do.
|
| The utility of the system really depends on the makeup of that
| last 5%. If problematic documents are consistently predictable,
| it's possible to do a second pass with humans. But if they're
| random, then you have to do every doc with humans and it
| doesn't save you any time.
| epolanski wrote:
| At my client we want to provide an AI that can retrieve
| relevant information from documentation (home building
| business, documents detail how to install a solar panel or a
| shower, etc) and we've set up an entire system with benchmarks,
| agents, etc, yet the bottleneck is OCR!
|
| We have millions and millions of pages of documents and an off
| by 1 % error means it compounds with the AI's own error, which
| compounds with documentation itself being incorrect at times,
| which leads it all to be not production ready (and indeed the
| project has never been released), not even close.
|
| We simply cannot afford to give our customers incorrect
| informatiin
|
| We have set up a backoffice app that when users have questions,
| it sends it to our workers along the response given by our AI
| application and the person can review it, and ideally correct
| the ocr output.
|
| Honestly after an year of working it feels like AI right now
| can only be useful when supervised all the time (such as when
| coding). Otherwise I just find LLMs still too unreliable
| besides basic bogus tasks.
| PeterStuer wrote:
| As someone who has had a home built, and nearly all my
| friends and acquaintances report the same thing, having a 1%
| error on information in this business would mean not a 10x
| but a 50x improvement over the current practice in the field.
|
| If nobody is supervising building documents all the time
| during the process, every house would be a pile of rubbish.
| And even when you do stuff stills creeps in and has to be
| redone, often more than once.
| themanmaran wrote:
| Excited to test this our on our side as well. We recently built
| an OCR benchmarking framework specifically for VLMs[1][2], so
| we'll do a test run today.
|
| From our last benchmark run, some of these numbers from Mistral
| seem a little bit optimistic. Side by side of a few models:
|
| model | omni | mistral |
|
| gemini | 86% | 89% |
|
| azure | 85% | 89% |
|
| gpt-4o | 75% | 89% |
|
| google | 68% | 83% |
|
| Currently adding the Mistral API and we'll get results out
| today!
|
| [1] https://github.com/getomni-ai/benchmark
|
| [2] https://huggingface.co/datasets/getomni-ai/ocr-benchmark
| jaggs wrote:
| By optimistic, do you mean 'tweaked'? :)
| kbyatnal wrote:
| re: real world implications, LLMs and VLMs aren't magi, and
| anyone who goes in expecting 100% automation is in for a
| surprise (especially in domains like medical or legal).
|
| IMO there's still a large gap for businesses in going from raw
| OCR outputs --> document processing deployed in prod for
| mission-critical use cases.
|
| e.g. you still need to build and label datasets, orchestrate
| pipelines (classify -> split -> extract), detect uncertainty
| and correct with human-in-the-loop, fine-tune, and a lot more.
| You can certainly get close to full automation over time, but
| it's going to take time and effort.
|
| But for RAG and other use cases where the error tolerance is
| higher, I do think these OCR models will get good enough to
| just solve that part of the problem.
|
| Disclaimer: I started a LLM doc processing company to help
| companies solve problems in this space (https://extend.app/)
| janis1234 wrote:
| $1 for 1000 pages seems high to me. Doing a google search
|
| Rent and Reserve NVIDIA A100 GPU 80GB - Pricing Starts from
| $1.35/hour
|
| I just don't know if in 1 hour and with a A100 I can process
| more than a 1000 pages. I'm guessing yes.
| blackoil wrote:
| Is the model Open Source/Weight? Else the cost is for the
| model, not GPU.
| alberth wrote:
| Curious to see how this performance against more real world usage
| of someone taking a photo of text (which the text then becomes
| slightly blurred) and performing OCR on it.
|
| I can't exactly tell if the "Mistral 7B" image is an example of
| this exact scenario.
| roboben wrote:
| Le chat doesn't seem to know about this change despite the blog
| post stating it. Can anyone explain how to use it in Le Chat?
| kapitalx wrote:
| Looks to be API only for now. Documentation here:
| https://docs.mistral.ai/capabilities/document/
| troyvit wrote:
| I asked LeChat this question:
|
| If I upload a small PDF to you are you able to convert it to
| markdown?
|
| LeChat said yes and away we went.
| jacooper wrote:
| Pretty cool, would love to use this with paperless, but I just
| can't bring myself to send a photo of all my documents to a third
| party, especially legal and sensitive documents, which is what I
| use Paperless for.
|
| Because of that I'm stuck with crappy vision on Ollama (Thanks to
| AMDs crappy ROCm support for Vllm)
| jbverschoor wrote:
| Ohhh. Gonna test it out with some 100+ year old scribbles :)
| WhitneyLand wrote:
| 1. There's no simple page / sandbox to upload images and try it.
| Fine, I'll code it up.
|
| 2. "Explore the Mistral AI APIs" (https://docs.mistral.ai) links
| to all apis except OCR.
|
| 3. The docs on the api params refer to document chunking and
| image chunking but no details on how their chunking works?
|
| So much unnecessary friction smh.
| cooperaustinj wrote:
| There is an OCR page on the link you provided. It includes a
| very, very simple curl command (like most of their docs).
|
| I think the friction here exists outside of Mistral's control.
| kergonath wrote:
| > There is an OCR page on the link you provided.
|
| I don't see it either. There might be some caching issue.
| deadbabe wrote:
| LLM based OCR is a disaster, great potential for hallucinations
| and no estimate of confidence. Results might seem promising but
| you'll always be wondering.
| menaerus wrote:
| CNN-based OCR also have "hallucinations" and Transformers
| aren't that much different in that respect. This is a problem
| solved with domain specific post-processing.
| leumon wrote:
| well already in 2013 ocr systems used in xerox scanners (turned
| on by default!) randomly altered numbers, so its not an issue
| only occuring in llms.
| bob1029 wrote:
| > It takes images and PDFs as input
|
| If you are working with PDF, I would suggest a hybrid process.
|
| It is feasible to extract information with 100% accuracy from
| PDFs that were generated using the mappable acrofields approach.
| In many domains, you have a fixed set of forms you need to
| process and this can be leveraged to build a custom tool for
| extracting the data.
|
| Only if the PDFs are unknown or were created by way of a
| cellphone camera, multifunction office device, etc should you
| need to reach for OCR.
|
| The moment you need to use this kind of technology you are in a
| completely different regime of what the business will (should)
| tolerate.
| themanmaran wrote:
| > Only if the PDFs are unknown or were created by way of a
| cellphone camera, multifunction office device, etc should you
| need to reach for OCR.
|
| It's always safer to OCR on every file. Sometimes you'll have a
| "clean" pdf that has a screenshot of an Excel table. Or a
| scanned image that has already been OCR'd by a lower quality
| tool (like the built in Adobe OCR). And if you rely on this
| you're going to get pretty unpredictable results.
|
| It's way easier (and more standardized) to run OCR on every
| file, rather than trying to guess at the contents based on the
| metadata.
| bob1029 wrote:
| It's not guessing if the form is known and you can read the
| information directly.
|
| This is a common scenario at many banks. You can expect
| nearly perfect metadata for anything pushed into their
| document storage system within the last decade.
| themanmaran wrote:
| Oh yea if the form is known and standardized everything is
| a lot easier.
|
| But we work with banks on our side, and one of the most
| common scenarios is customers uploading
| financials/bills/statements from 1000's of different
| providers. In which case it's impossible to know every
| format in advance.
| SilentM68 wrote:
| I would like to see how it performs with massively warped and
| skewed scanned text images, basically a scanned image where the
| text lines are wavy as opposed as straight horizontal, where the
| letters are elongated. One where the line widths are different
| depending on the position on the scanned image. I once had to
| deal with such a task that somebody gave me with OCR software,
| Acrobat, and other tools could not decode the mess so I had to
| recreate the 30 pages myself, manually. Not a fun thing to do but
| that is a real use case.
| arcfour wrote:
| Garbage in, garbage out?
| edude03 wrote:
| "Yes" but if a human could do it "AI" should be able to do it
| too.
| janalsncm wrote:
| The hard ones are things like contracts, leases, and financial
| documents which 1) don't have a common format 2) are filled with
| numbers proper nouns and addresses which it's _really_ important
| not to mess up 3) cannot be inferred from context.
|
| Typical OCR pipeline would be to pass the doc through a
| character-level OCR system then correct errors with a statistical
| model like an LLM. An LLM can help correct "crodit card" to
| "credit card" but it cannot correct names or numbers. It's really
| bad if it replaces a 7 with a 2.
| groby_b wrote:
| Perusing the web site, it's depressing how much behind Mistral is
| on basic "how can I make this a compelling hook for customers"
| for the page.
|
| The notebook link? An ACL'd doc
|
| The examples don't even include a small text-to-markdown sample.
|
| The before/after slider is cute, but useless - SxS is a much
| better way to compare.
|
| Trying it in "Le Chat" requires a login.
|
| It's like an example of "how can we implement maximum loss across
| our entire funnel". (I have no doubt the underlying tech does
| well, but... damn, why do you make it so hard to actually see it,
| Mistral?)
|
| If anybody tried it and has shareable examples - can you post a
| link? Also, anybody tried it with handwriting yet?
| dehrmann wrote:
| Is this burying the lede? OCR is a solved problem, but
| structuring document data from scans isn't.
| jslezak wrote:
| Has anyone tried it for handwriting?
|
| So far Gemini is the only model I can get decent output from for
| a particular hard handwriting task
| pqdbr wrote:
| I tried with both PDFs and PNGs in Le Chat and the results were
| the worst I've ever seen when compared to any other model
| (Claude, ChatGPT, Gemini).
|
| So bad that I think I need to enable the OCR function somehow,
| but couldn't find it.
| computergert wrote:
| I'm experiencing the same. Maybe the sentence "Mistral OCR
| capabilities are free to try on le Chat." was a hallucination.
| troyvit wrote:
| It worked perfectly for me with a simple 2 page PDF that
| contained no graphics or formatting beyond headers and list
| items. Since it was so small I had the time to proof-read it
| and there were no errors. It added some formatting, such as
| bolding headers in list items and putting tics around file and
| function names. I won't complain.
| sunami-ai wrote:
| Making Transformers the same cost as CNN's (which are used in
| character-level ocr, as opposed to image-patch-level) is a good
| thing. The problem with CNN based character-level OCR is not the
| recognition models but the detection models. In a former life, I
| found a way to increase detection accuracy, and, therefore,
| overall OCR accuracy, and used that as an enhancement on top of
| Amazon and Google OCR. It worked really well. But the transformer
| approach is more powerful and if it can be done for $1 per 1000
| pages, that is a game changer, IMO, at least of incumbents
| offering traditional character-level OCR.
| menaerus wrote:
| It certainly isn't the same cost if expressed as a non-
| subsidized $$$ one needs for the Transformers compute aka
| infra.
|
| CNNs trained specifically for OCR can run in real time on as
| small compute as a mobile device is.
| srinathkrishna wrote:
| Given the fact that multi-modal LLMs are getting so good at OCR
| these days, is it a shame that we can't do local OCR with high
| accuracy in the near-term?
| coolspot wrote:
| This is $1 per 1000 pages.
|
| For comparison, Azure Document Intelligence is $1.5/1000 pages
| for general OCR and $30/1000 pages for "custom extraction".
| kapitalx wrote:
| Co-founder of doctly.ai here (OCR tool)
|
| I love mistral and what they do. I got really excited about this,
| but a little disappointed after my first few tests.
|
| I tried a complex table that we use as a first test of any new
| model, and Mistral OCR decided the entire table should just be
| extracted as an 'image' and returned this markdown:
|
| ```  ```
|
| I'll keep testing, but so far, very disappointing :(
|
| This document I try is the entire reason we created Doctly to
| begin with. We needed an OCR tool for regulatory documents we use
| and nothing could really give us the right data.
|
| Doctly uses a judge, OCRs a document against multiple LLMs and
| decides which one to pick. It will continue to run the page until
| the judge scores above a certain score.
|
| I would have loved to add this into the judge list, but might
| have to skip it.
| infecto wrote:
| Why pay more for doctly than an AWS Textract?
| kapitalx wrote:
| Great question. The language models are definitely beating
| the old tools. Take a look at Gemini for example.
|
| Doctly runs a tournament style judge. It will run multiple
| generations across LLMs and pick the best one. Outperforming
| single generation and single model.
| the_mitsuhiko wrote:
| Would love to see the test file.
| Starlord2048 wrote:
| would be glad to see benchmarking results
| kapitalx wrote:
| This is a good idea. We should publish a benchmark
| results/comparison.
| fnordpiglet wrote:
| Interestingly I'm currently going through and scanning the
| hundreds of journal papers my grandfather authored in medicine
| and thinking through what to do about graphs. I was expecting
| to do some form of multiphase agent based generation of LaTeX
| or SVG rather than a verbal summary of the graphs. At least in
| his generation of authorship his papers clearly explained the
| graphs already. I was pretty excited to see your post naturally
| but when I looked at the examples what I saw was, effectively,
| a more verbose form of
|
| ```  ```
|
| I'm assuming this is partially because your use case is
| targeting RAG under various assumptions bur also partially
| because multimodal models aren't near what I would need to be
| successful with?
| kapitalx wrote:
| We need to update the examples on the front page. Currently
| for things that are considered charts/graphs/figures we
| convert to a description. For things like logos or images we
| do an image tag. You can also choose to exclude them.
|
| The difference with this is that it took the entire page as
| an image tag (it's just a table of text in my document).
| rather than being more selective.
|
| I do like that they give you coordinates for the images
| though, we need to do something like that.
|
| Give the actual tool a try. Would love to get your feedback
| for that use case. It gives you 100 free credits initially
| but if you email me (ali@doctly.ai), I can give you an extra
| 500 (goes for anyone else here also)
| niwtsol wrote:
| If you have a judge system, and Mistral performs well on other
| tests, wouldn't you want to include it so if it scores the
| highest by your judges ranking it would select the most
| accurate result? Or are you saying that mistral's image
| markdown would score higher on your judge score?
| kapitalx wrote:
| We'll definitely be doing more tests, but the results I got
| on the complex tests would result in a lower score and might
| not be worth the extra cost of the judgement itself.
|
| In our current setup Gemini wins most often. We enter
| multiple generations from each model into the 'tournament',
| sometimes one generation from gemini could be at the top
| while another in the bottom, for the same tournament.
| bambax wrote:
| Where did you test it? At the end of the post they say:
|
| > _Mistral OCR capabilities are free to try on le Chat_
|
| but when asked, Le Chat responds:
|
| > _can you do ocr?_
|
| > _I don 't have the capability to perform Optical Character
| Recognition (OCR) directly. However, if you have an image with
| text that you need to extract, you can describe the text or
| provide details, and I can help you with any information or
| analysis related to that text. If you need OCR functionality,
| you might need to use a specialized tool or service designed
| for that purpose._
|
| Edit: Tried anyway by attaching an image; it said it could do
| OCR and then output... completely random text that had
| absolutely nothing to do with the text in the image!...
| Concerning.
|
| Tried again with a better definition image, output only the
| first twenty words or so of the page.
|
| Did you try using the API?
| kapitalx wrote:
| Yes I used the API. They have examples here:
|
| https://docs.mistral.ai/capabilities/document/
|
| I used base64 encoding of the image of the pdf page. The
| output was an object that has the markdown, and coordinates
| for the images:
|
| [OCRPageObject(index=0, markdown='',
| images=[OCRImageObject(id='img-0.jpeg', top_left_x=140,
| top_left_y=65, bottom_right_x=2136, bottom_right_y=1635,
| image_base64=None)], dimensions=OCRPageDimensions(dpi=200,
| height=1778, width=2300))] model='mistral-
| ocr-2503-completion'
| usage_info=OCRUsageInfo(pages_processed=1,
| doc_size_bytes=634209)
| Grosvenor wrote:
| Does doctly do handwritten forms like dates?
|
| I have a lot of "This document filed and registered in the
| county of ______ on ______ of _____ 2023" sort of thing.
| kapitalx wrote:
| We've been getting great results with those aswell. But
| ofcourse there is always some chance of not getting it
| perfect, specially with different handwritings.
|
| Give it a try, no credit cards needed to try it. If you email
| me (ali@doctly.ai) i can give you extra free credits for
| testing.
| Grosvenor wrote:
| Just tried it. Got all the dates correct and even extracted
| signatures really well.
|
| Now to figure out how many millions of pages I have.
| owenpalmer wrote:
| This is incredibly exciting. I've been pondering/experimenting on
| a hobby project that makes reading papers and textbooks easier
| and more effective. Unfortunately the OCR and figure extraction
| technology just wasn't there yet. This is a game changer.
|
| Specifically, this allows you to associate figure references with
| the actual figure, which would allow me to build a UI that solves
| the annoying problem of looking for a referenced figure on
| another page, which breaks up the flow of reading.
|
| It also allows a clean conversion to HTML, so you can add cool
| functionality like clicking on unfamiliar words for definitions,
| or inserting LLM generated checkpoint questions to verify
| understanding. I would like to see if I can automatically
| integrate Andy Matuschak's Orbit[0] SRS into any PDF.
|
| Lots of potential here.
|
| [0] https://docs.withorbit.com/
| generalizations wrote:
| Wait does this deal with images?
| ezfe wrote:
| The output includes images from the input. You can see that
| on one of the examples where a logo is cropped out of the
| source and included in the result.
| NalNezumi wrote:
| >a UI that solves the annoying problem of looking for a
| referenced figure on another page, which breaks up the flow of
| reading.
|
| A tangent but this exact issue is what I was frustrated for a
| long time with pdf reader and reading science papers. Then I
| found sioyek that pops up a small window when you hover over
| links (references and equations and figures) and it solved it.
|
| Granted, the pdf file must be in right format, so OCR could
| make this experience better. Just saying the UI component of
| that already exist
|
| https://sioyek.info/
| PerryStyle wrote:
| Zotero's PDF viewer also does this now. Being able to
| annotate PDFs and having a reference manager has been a life
| saver.
| polytely wrote:
| I don't need AGI just give me superhuman OCR so we can turn all
| existing pdfs into text* and cheaply host it.
|
| Feels like we are almost there.
|
| *: https://annas-archive.org/blog/critical-window.html
| coolspot wrote:
| This is $1 per 1000 pages. For comparison, Azure Document
| Intelligence is $1.5/1000 pages for general OCR and $30/1000
| pages for "custom extraction".
| 0cf8612b2e1e wrote:
| Given the wide variety of pricing on all of these providers, I
| keep wondering how the economics work. Do they have fantastic
| margin on some of these products or is it a matter of
| subsidizing the costs, hoping to capture the market? Last I
| heard, OpenAI is still losing money.
| thegabriele wrote:
| I'm using gemini to solve textual CAPTCHA with some good results
| (better than untrained OCR).
|
| I will give this a shot
| bugglebeetle wrote:
| Congrats to Mistral for yet again releasing another closed source
| thing that costs more than running an open source equivalent:
|
| https://github.com/DS4SD/docling
| Squarex wrote:
| I am all for open source, but where do you see benchmarks that
| conclude that it's just equivalent?
| bugglebeetle wrote:
| Where do you see open source benchmark results that confirm
| Mistral's performance?
| anonymousd3vil wrote:
| Back in my days Mistral used to torrent models.
| Asraelite wrote:
| I never thought I'd see the day where technology finally advanced
| far enough that we can edit a PDF.
| randomNumber7 wrote:
| I never thought driving a car is harder than editing a pdf.
| pzo wrote:
| It's not about harder but about what error you can tolerate.
| Here if you have accuracy 99% for many applications it's
| enough. If you have 99% accuracy per trip of no crash during
| self driving then you gonna be dead within a year very
| likely.
|
| For cars we need accuracy at least 99.99% and that's very
| hard.
| rtsil wrote:
| I doubt most people have 99% accuracy. The threshold of
| tolerance for error is just much lower for any self-driving
| system (and with good reason, because we're not familiar
| with them yet).
| KeplerBoy wrote:
| How do you define 99% accuracy?
|
| I guess something like success rate for a trip (or mile)
| would be a more reasonable metric. Most people have a
| success rate far higher than 99% for averages trips.
|
| Most people who commute daily are probably doing
| something like a 1000 car rides a year and have minor
| accidents every few years. 99% success rates would mean
| monthly accidents.
| Apofis wrote:
| Foxit PDF exists...
| toephu2 wrote:
| I've been able to edit PDFs (95%+ of them) accurately for the
| past 10 years...
| thiago_fm wrote:
| For general use this will be good.
|
| But I bet that simple ML will lead to better OCRs when you are
| doing anything specialized, such as, medical documents, invoices
| etc.
| sureglymop wrote:
| Looks good but in the first hover/slider demo one can see how it
| could lead to confusion when handling side by side content.
|
| Table 1 is referred to in section `2 Architectural details` but
| before `2.1 Multimodal Decoder`. In the generated markdown though
| it is below the latter section, as if it was in/part of that
| section.
|
| Of course I am nitpicking here but just the first thing I
| noticed.
| 0cf8612b2e1e wrote:
| Does anything handle dual columns well? Despite being the
| academic standard, it seemingly throws off every generic tool.
| serjester wrote:
| This is cool! With that said for anyone looking to use this in
| RAG, the downside to specialized models instead of general VLMs
| is you can't easily tune it to your use specific case. So for
| example, we use Gemini to add very specific alt text to images in
| the extracted Markdown. It's also 2 - 3X the cost of Gemini Flash
| - hopefully the increased performance is significant.
|
| Regardless excited to see more and more competition in the space.
|
| Wrote an article on it: https://www.sergey.fyi/articles/gemini-
| flash-2-tips
| hyuuu wrote:
| gemini flash is notorious for hallucinating the output of the
| OCR, be careful with it. For straight forward, semi-structured,
| low page count (under 5) it should perform well, but the more
| the context window is stretched the more the output gets more
| unreliable
| oysterville wrote:
| Dupe of an hour previous post
| https://news.ycombinator.com/item?id=43282489
| beebaween wrote:
| Wonder how it does with table data in pdfs / page-long tabular
| data?
| blackeyeblitzar wrote:
| A similar but different product that was discussed on HN is
| OlmOCR from AI2, which is open source:
|
| https://news.ycombinator.com/item?id=43174298
| hubraumhugo wrote:
| It will be interesting to see how all the companies in the
| document processing space adapt as OCR becomes a commodity.
|
| The best products will be defined by everything "non-AI", like
| UX, performance and reliability at scale, and human-in-the loop
| feedback for domain experts.
| trollied wrote:
| They will offer integrations into enterprise systems, just like
| they do today.
|
| Lots of big companies don't like change. The existing document
| processing companies will just silently start using this sort
| of service to up their game, and keep their existing
| relationships.
| hyuuu wrote:
| I 100% agree with this, I think you can even extend this to any
| AI, in the end, IMO, as the llm is more commoditized, the
| surface of which the value is delivered will matter more
| lokl wrote:
| Tried with a few historical handwritten German documents,
| accuracy was abysmal.
| rvnx wrote:
| Probably they are overfitting the benchmarks, since other users
| also complain of the low accuracy
| Thaxll wrote:
| HTR ( Handwritten Text Recognition ) is a completely different
| space than OCR. What were you expecting exactly?
| riquito wrote:
| It fits the "use cases" mentioned in the article
|
| > Preserving historical and cultural heritage: Organizations
| and nonprofits that are custodians of heritage have been
| using Mistral OCR to digitize historical documents and
| artifacts, ensuring their preservation and making them
| accessible to a broader audience.
| Thaxll wrote:
| There is a difference between historical document and "my
| doctor prescription".
|
| Someone coming here and saying it does not work with my old
| german hanwriting doesn't say much.
| riquito wrote:
| You're making a strawman, the parent specifically
| mentioned "historical handwritten documents"
| anothermathbozo wrote:
| Optical Character Recognition (OCR) and Handwritten Text
| Recognition (HTR) are different tasks
| lysace wrote:
| Semi-OT (similar language): The national archives in Sweden and
| Finland published a model for OCR:ing handwritten Swedish text
| from the 1600s to the 1800s with what to me seems like a _very_
| level of accuracy given the source material. (4% character
| error rate)
|
| https://readcoop.eu/model/the-swedish-lion-i/
|
| https://www.transkribus.org/success-story/creating-the-swedi...
|
| https://huggingface.co/Riksarkivet
|
| They have also published a fairly large volume of OCR:ed texts
| (IIRC birth/death notices from church records) using this model
| online. As a beginner genealogist it's been fun to follow.
| thadt wrote:
| Also working with historical handwritten German documents. So
| far Gemini seems to be the least wrong of the ones I've tried -
| any recommendations?
| evmar wrote:
| I noticed on the Arabic example they lost a space after the first
| letter on the third to last line, can any native speakers
| confirm? (I only know enough Arabic to ask dumb questions like
| this, curious to learn more.)
|
| Edit: it looks like they also added a vowel mark not present in
| the input on the line immediately after.
|
| Edit2: here's a picture of what I'm talking about, the
| before/after: https://ibb.co/v6xcPMHv
| resiros wrote:
| Arabic speaker here. No, it's perfect.
| evmar wrote:
| I am pretty sure it added a kasrah not present in the input
| on the 2nd to last line. (Not saying it's not super
| impressive, and also that almost certainly is the right word,
| but I think that still means not quite "perfect"?)
| gl-prod wrote:
| Yes, it looks like it did add a kasrah to the word Zhry
| yoda97 wrote:
| Yep, and fmin too, this is not just OCR, it made some
| post-processing corrections or "enhancements". That could
| be good, but it could also be trouble the 1% chance it
| makes a mistake in critical documents.
| gl-prod wrote:
| He means the space between the waw (w) and the word
| evmar wrote:
| I added a pic to the original comment, sorry for not being
| clear!
| albatrosstrophy wrote:
| And here I thought after reading the headline: finally a
| reliable Arabic OCR. I've never in my life found a good that
| does the job decently especially for a scanned document. Or is
| there something out there I don't know about?
| th0ma5 wrote:
| A great question for people wanting to use OCR in business is...
| Which digits in monetary amounts can you tolerate being
| incorrect?
| kbyatnal wrote:
| We're approaching the point where OCR becomes "solved" -- very
| exciting! Any legacy vendors providing pure OCR are going to get
| steamrolled by these VLMs.
|
| However IMO, there's still a large gap for businesses in going
| from raw OCR outputs --> document processing deployed in prod for
| mission-critical use cases. LLMs and VLMs aren't magic, and
| anyone who goes in expecting 100% automation is in for a
| surprise.
|
| You still need to build and label datasets, orchestrate pipelines
| (classify -> split -> extract), detect uncertainty and correct
| with human-in-the-loop, fine-tune, and a lot more. You can
| certainly get close to full automation over time, but it's going
| to take time and effort. But the future is on the horizon!
|
| Disclaimer: I started a LLM doc processing company to help
| companies solve problems in this space (https://extend.app/)
| risyachka wrote:
| >> Any legacy vendors providing pure OCR are going to get
| steamrolled by these VLMs.
|
| -OR- they can just use these APIs, and considering that they
| have a client base - which would prefer to not rewrite
| integrations to get the same result - they can get rid of most
| code base, replace it with llm api and increase margins by 90%
| and enjoy good life.
| esafak wrote:
| They're going to become commoditized unless they add value
| elsewhere. Good news for customers.
| TeMPOraL wrote:
| They are (or at least could easily be) adding value in form
| of SLA - charging money for giving guarantees on accuracy.
| This is both better for customer, who gets concrete
| guarantees and someone to shift liability to, and for the
| vendor, that can focus on creating techniques and systems
| for getting that extra % of reliability out of the LLM OCR
| process.
|
| All of the above are things companies - particularly larger
| ones - are happy to pay for, because ORC is just a cog in
| the machine, and this makes it more reliable and
| predictable.
|
| On top of the above, there are auxiliary value-adds such a
| vendor could provide - such as, being fully compliant with
| every EU directive and regulation that's in power, or about
| to be. There's plenty of those, they overlap, and no one
| wants to deal with it if they can outsource it to someone
| who already figured it out (and will take the blame for
| fuckups).
| dml2135 wrote:
| One problem I've encountered at my small startup in evaluating
| OCR technologies is precisely convincing stakeholders that the
| "human-in-the-loop" part is both unavoidable, and ultimately
| beneficial.
|
| PMs want to hear that an OCR solution will be fully automated
| out-of-the-box. My gut says that anything offering that is
| snake-oil, and I try to convey that the OCR solution they want
| _is_ possible, but if you are unwilling to pay the tuning cost,
| it's going to flop out of the gate. At that point they lose
| interest and move on to other priorities.
| kbyatnal wrote:
| Yup definitely, and this is exactly why I built my startup.
| I've heard this a bunch across startups & large enterprises
| that we work with. 100% automation is an impossible target,
| because even humans are not 100% perfect. So how we can
| expect LLMs to be?
|
| But that doesn't mean you have to abandon the effort. You can
| still definitely achieve production-grade accuracy! It just
| requires having the right tooling in place, which reduces the
| upfront tuning cost. We typically see folks get there on the
| order of days or 1-2 weeks (it doesn't necessarily need to
| take months).
| techwizrd wrote:
| The challenge I have is how to get bounding boxes for the OCR,
| for things like redaction/de-identification.
| kbyatnal wrote:
| yeah that's a fun challenge -- what we've seen work well is a
| system that forces the LLM to generate citations for all
| extracted data, map that back to the original OCR content,
| and then generate bounding boxes that way. Tons of edge cases
| for sure that we've built a suite of heuristics for over
| time, but overall works really well.
| dontlikeyoueith wrote:
| Why would you do this and not use Textract?
| dontlikeyoueith wrote:
| AWS Textract works pretty well for this and is much cheaper
| than running LLMs.
| daemonologist wrote:
| Textract is more expensive than this (for your first 1M
| pages per month at least) and significantly more than
| something like Gemini Flash. I agree it works pretty well
| though - definitely better than any of the open source pure
| OCR solutions I've tried.
| einpoklum wrote:
| An LLM with billions of parameters for extracting text from a
| PDF (which isn't even a rasterized image) really does not
| "solve OCR".
| mvac wrote:
| Great progress, but unfortunately, for our use case (converting
| medical textbooks from PDF to MD), the results are not as good as
| those by MinerU/PDF-Extract-Kit [1].
|
| Also the collab link in the article is broken, found a functional
| one [2] in the docs.
|
| [1] https://github.com/opendatalab/MinerU [2]
| https://colab.research.google.com/github/mistralai/cookbook/...
| owenpalmer wrote:
| I've been searching relentlessly for something like this! I
| wonder why it's been so hard to find... is it the Chinese?
|
| In any case, thanks for sharing.
| 101008 wrote:
| Is this free in LeChat? I uploaded a handwritten text and it
| stopped after the 4th word.
| bsnnkv wrote:
| Someone working there has good taste to include a Nizar Qabbani
| poem.
| rvz wrote:
| > "Fastest in its category"
|
| Not one mention of the company that they have partnered with and
| that is Cerebras AI and that is the reason they have fast
| inference [0]
|
| Literally no-one here is talking about them and they are about to
| IPO.
|
| [0] https://cerebras.ai/blog/mistral-le-chat
| pilooch wrote:
| But what's the need exactly for OCR when you have multimodal LLMs
| that can read the same info and directly answer any questions
| about it ?
|
| For a VLLM, my understanding is that OCR corresponds to a sub-
| field of questions, of the type 'read exactly what's written in
| this document'.
| daemonologist wrote:
| It's useful to have the plain text down the line for operations
| not involving a language model (e.g. search). Also if you have
| a bunch of prompts you want to run it's potentially cheaper,
| although perhaps less accurate, to run the OCR once and save
| yourself some tokens or even use a smaller model for subsequent
| prompts.
| ks2048 wrote:
| Tons of uses: Storage (text instead of images), search (user
| typing in a text box and you want instant retrieval from a
| dataset), etc. And costs: run on images once - then the rest of
| your queries will only need to run on text.
| simonw wrote:
| The biggest risk of vision LLMs for OCR is that they might
| accidentally follow instructions is the text that they are
| meant to be processing.
|
| (I asked Mistral if their OCR system was vulnerable to this and
| they said "should be robust, but curious to see if you find any
| fun examples" -
| https://twitter.com/simonw/status/1897713755741368434 and
| https://twitter.com/sophiamyang/status/1897719199595720722 )
| pilooch wrote:
| Fun, but LLMs would follow them post OCR anyways ;)
|
| I see OCR much like phonemes in speech, once you have end to
| end systems, they become latent constructs from the past.
|
| And that is actually good, more code going into models
| instead.
| troyvit wrote:
| Getting PDFs into #$@ Confluence apparently. Just had to do
| this and Mistral saved me a ton of hassle compared to this:
| https://community.atlassian.com/forums/Confluence-questions/...
| gatienboquet wrote:
| I feel like i can't create an agent with their OCR model yet ? Is
| it something planned or it's only API?
| simonw wrote:
| What do you mean by agent?
| gatienboquet wrote:
| La Plateforme agent builder -
| https://console.mistral.ai/build/agents/new
| kiratp wrote:
| It's shocking how much our industry fails to see past its own
| nose.
|
| Not a single example on that page is a Purchase Order, Invoice
| etc. Not a single example shown is relevant to industry at scale.
| kashnote wrote:
| Fwiw, they have an example of a parking receipt in a cookbook:
| https://colab.research.google.com/github/mistralai/cookbook/...
| guiomie wrote:
| Agreed. In general I've had such bad performance for complex
| table based invoice parsing, that every few months I try the
| latest models to see if its better. It does say "96.12" on top-
| tier benchmark under the Table category.
| mtillman wrote:
| We find CV models to be better (higher midpoint on an ROC
| curve) for the types of docs you mention.
| simpaticoder wrote:
| Another good example would be contracts of any kind. Imagine
| photographing a contract (like a car loan) and on the spot
| getting an AI to read it, understand it, forecast scenarious,
| highlight red flags, and do some comparison shopping for you.
| JBiserkov wrote:
| ... imagining ...
|
| ... hallucinating during read ...
|
| ... hallucinating during understand ...
|
| ... hallucinating during forecast ...
|
| ... highlighting a hallucination as red flag ...
|
| ... missing an actual red flag ...
|
| ... consuming water to cool myself...
|
| Phew, being an AI is hard!
| merb wrote:
| Mistral is Europe based where invoices are more or less sent
| digitally in like 95% of all the cases anyway. Some are even
| digital invoices, which will at some point in the eu be
| mandatory. For orders there are proposals for that, too. And
| basically invoice data extraction is a different beast.
| wolfi1 wrote:
| even in Europe this is still a thing, I know of systems which
| still are unable to read items having more than one line
| (costing s sh*tload of money)
| napolux wrote:
| Can confirm, in Italy electronic invoicing is mandatory since
| 2019
| revnode wrote:
| So an invoice attached to an email as a PDF is sent digitally
| ... those unfamiliar with PDF will think text and data
| extraction is trivial then, but this isn't true. You can have
| a fully digital, non-image PDF that is vector based and has
| what looks like text, but doesn't have a single piece of
| extractable text in it. It's all about how the PDF was
| generated. Tables can be formatted in a million ways, etc.
|
| Your best bet is to always convert it to an image and OCR it
| to extract structured data.
| codetrotter wrote:
| One use-case is digitising receipts from business related
| travels for expenses that employees paid for out of their own
| pocket and which they are submitting pictures to the business
| for reimbursement.
|
| Bus travels, meals including dinners and snacks, etc. for
| which the employee has receipts on paper.
| kiratp wrote:
| This isn't even close to true.
|
| Source: We have large EU customers.
| arpinum wrote:
| Businesses at scale use EDI to handle purchase orders and
| invoices, no OCR needed.
| cdolan wrote:
| Thats simply not a factual statement.
|
| Scaled businesses do USE edi, but they still receive hundreds
| of thousands of PDF documents a month
|
| source: built a saas product that handles pdfs for a specific
| industry
| mentalgear wrote:
| To be fair: Reading the blog post, the main objective seems to
| have been to enable information extraction with high confidence
| for the academic sector (e.g. unlocking all these paper pdfs),
| and not necessarily to be another receipt scanner.
| kiratp wrote:
| It hilarious that the academic sector 1. publishes as PDF 2.
| spends all this energy on how to extract that info back from
| PDF 3. publishes that research as PDF as well.
|
| Receipt scanning is a multiple orders of magnitude more
| valuable business. Mistral at this point is looking for a
| commercial niche (like how Claude is aiming at software
| development)
| sha16 wrote:
| I wanted to apply OCR to my company's invoicing since they
| basically did purchasing for a bunch of other large companies,
| but the variability in the conversion was not tolerable. Even
| rounding something differently could catch an accountant's eye,
| let alone detecting a "8" as a "0" or worse.
| dotnetkow wrote:
| Agreed, though in this case, they are going for general-purpose
| OCR. That's fine in some cases, but purpose-built models
| trained on receipts, invoices, tax documents, etc., definitely
| perform better. We've got a similar API solution coming out
| soon (https://digital.abbyy.com/code-extract-automate-your-new-
| mus...) that should work better for businesses automating their
| docs at scale.
| qwertox wrote:
| We developers seem to really dislike PDFs, to a degree that we'll
| build LLMs and have them translate it into Markdown.
|
| Jokes aside, PDFs really serve a good purpose, but getting data
| out of them is usually really hard. They should have something
| like an embedded Markdown version with a JSON structure
| describing the layout, so that machines can easily digest the
| data they contain.
| jgalt212 wrote:
| I think you might be looking for PDF/A.
|
| https://www.adobe.com/uk/acrobat/resources/document-files/pd...
|
| For example, if you print a word doc to PDF, you get the raw
| text in PDF form, not an image of the text.
| gpvos wrote:
| PDF/A doesn't require preserving the document structure, only
| that any text is extractable.
| d_llon wrote:
| It's disappointing to see that the benchmark results are so
| opaque. I hope we see reproducible results soon, and hopefully
| from Mistral themselves.
|
| 1. We don't know what the evaluation setup is. It's very possible
| that the ranking would be different with a bit of prompt
| engineering.
|
| 2. We don't know how large each dataset is (or even how the
| metrics are calculated/aggregated). The metrics are all reported
| as XY.ZW%, but it's very possible that the .ZW% -- or even Y.ZW%
| -- is just noise.[1]
|
| 3. We don't know how the datasets were mined or filtered. Mistral
| could have (even accidentally!) filtered out particularly data
| points that their model struggled with. (E.g., imagine good-
| meaning engineer testing a document with Mistral OCR first,
| finding it doesn't work, and deducing that it's probably bad data
| and removing it.)
|
| [1] https://medium.com/towards-data-science/digit-
| significance-i...
| s4i wrote:
| I wonder how good it would be to convert sheet music to MusicXML.
| All the current tools more or less suck with this task, or maybe
| I'm just ignorant and don't know what lego bricks to put
| together.
| adrianh wrote:
| Try our machine-learning powered sheet music scanning engine at
| Soundslice:
|
| https://www.soundslice.com/sheet-music-scanner/
|
| Definitely doesn't suck.
| protonbob wrote:
| Wow this basically "solves" DRM for books as well as opening up
| the door for digitizing old texts more accurately.
| shmoogy wrote:
| What's the general time for something like this to hit
| openrouter? I really hate having accounts everywhere when I'm
| trying to test new things.
| bondolo wrote:
| Such a shame that PDF doesn't just, like, include the semantic
| structure of the document by default. It is brilliant that we
| standardized on an archival document format that doesn't include
| direct access to the document text or structure as a core
| intrinsic default feature.
|
| I say this with great anger as someone who works in accessibility
| and has had PDF as a thorn in my side for 30 years.
| andai wrote:
| Tables? I regularly run into PDFs where even the body text is
| mangled!
| NeutralForest wrote:
| I agree with this so much. I've tried to sometimes push friends
| and family to use text formats (at least I sent them something
| like Markdown), which is very easy to render in the browser
| anyways. But often you have to fall back to PDF, which I
| dislike very much. There's so much content like books and
| papers that are in PDF as well. Why did we pick a binary blob
| as shareable format again?
| meatmanek wrote:
| > Why did we pick a binary blob as shareable format again?
|
| PDF was created to solve the problem of being able to render
| a document the same way on different computers, and it mostly
| achieved that goal. Editable formats like .doc, .html, .rtf
| were unreliable -- different software would produce different
| results, and even if two computers have the exact same
| version of Microsoft Word, they might render differently
| because they have different fonts available. PDFs embed the
| fonts needed for the document, and specify exactly where each
| character goes, so they're fully self-contained.
|
| After Acrobat Reader became free with version 2 in 1994,
| everybody with a computer ended up downloading it after
| running across a PDF they needed to view. As it became more
| common for people to be able to view PDFs, it became more
| convenient to produce PDFs when you needed everybody to be
| able to view your document consistently. Eventually, the
| ability to produce PDFs became free (with e.g. Office 2007 or
| Mac OS X's ability to print to PDF), which cemented PDF's
| popularity.
|
| Notably, the original goals of PDF had nothing to do with
| being able to copy text out of them -- the goal was simply to
| produce a perfect reproduction of the document on
| screen/paper. That wasn't enough of an inconvenience to
| prevent PDF from becoming popular. (Some people saw the
| inability for people to easily copy text from them as a
| benefit -- basically a weak form of text DRM.)
| cess11 wrote:
| PDF is pretty strictly modeled on printed documents and their
| mainstream typography at the time of invention of Postscript
| and so on.
|
| Printed documents do not have any structure beyond the paper
| and placement of ink on them.
| lukasb wrote:
| Even assuming you could get people to do the work (probably the
| real issue here) could a single schema syntax capture the
| semantics of the universe of documents that exist as PDFs? PDFs
| succeeded because they could reproduce anything.
| euleriancon wrote:
| html
| OrvalWintermute wrote:
| I'm happy to see this development after being underwhelmed with
| Chatgpt OCR!
| climb_stealth wrote:
| Does this support Japanese? They list a table of language
| comparisons againat other approaches but I can't tell if it is
| exhaustive.
|
| I'm hoping that something like this will be able to handle
| 3000-page Japanese car workshop manuals. Because traditional OCR
| really struggles with it. It has tables, graphics, text in
| graphics, the whole shebang.
| hyuuu wrote:
| It's a weird timing because I just launched https://dochq.io - ai
| document extraction where you can define what you need to get out
| your documents in plain English, I legitimately thought that this
| was going to be such a niche product but hell, there has been a
| very rapid rise for AI-based OCR lately, an article/tweet even
| went viral 2 weeks ago I think? About using Gemini to do OCR, fun
| times.
| sixhobbits wrote:
| Nice demos but I wonder how well it does on longer files. I've
| been experimenting with passing some fairly neat PDFs to various
| LLMs for data extraction. They're created from Excel exports and
| some of the data is cut off or badly laid out, but it's all
| digitally extractable.
|
| The challenge isn't so much the OCR part, but just the length.
| After one page the LLMs get "lazy" and just skip bits or stop
| entirely.
|
| And page by page isn't trivial as header rows are repeated or
| missing etc.
|
| So far my experience has definitely been that the last 2% of the
| content still takes the most time to accurately extract for large
| messy documents, and LLMs still don't seem to have a one-shot
| solve for that. Maybe this is it?
| hack_ml wrote:
| You will have to send one page at a time, most of this work has
| to be done via RAG. Adding a large context (like a whole PDF),
| still does not work that well in my experience.
| lysace wrote:
| Nit: Please change the URL from
|
| https://mistral.ai/fr/news/mistral-ocr
|
| to
|
| https://mistral.ai/news/mistral-ocr
|
| The article is the same, but the site navigation is in English
| instead of French.
|
| Unless it's a silent statement, of course. =)
| lblume wrote:
| For me, the second page redirects to the first. (And I don't
| live in France.)
| anovick wrote:
| How does one use it to identify bounding rectangles of
| images/diagrams in the PDF?
| Oras wrote:
| I feel this is created for RAG. I tried a document [0] that I
| tested with OCR; it got all the table values correctly, but the
| page's footer was missing.
|
| Headers and footers are a real pain with RAG applications, as
| they are not required, and most OCR or PDF parsers will return
| them, and there is extract work to do to remove them.
|
| [0]
| https://github.com/orasik/parsevision/blob/main/example/Mult...
| maCDzP wrote:
| Oh - on premise solution - awesome!
| cavisne wrote:
| Its funny how Gemini consistently beats googles dedicated
| document API.
| jjice wrote:
| I'm not surprised honestly - it's just the newer better things
| vs their older offering
| atemerev wrote:
| So, the only thing that stopped AI from learning from all our
| science and taking over the world was the difficulty of
| converting PDFs of academic papers to more computer readable
| formats.
|
| Not anymore.
| noloz wrote:
| Are there any open source projects with the same goal?
| submeta wrote:
| Is this able to convert pdf flowcharts into yaml or json
| representations of them? I have been experimenting with Claude
| 3.5. It has been very good at readig / understanding/ converting
| into representations of flow charts.
|
| So I am wondering if this is more capable. Will try definitely,
| but maybe someone can chime in.
| bambax wrote:
| It's not bad! But it still hallucinates. Here's an example of an
| (admittedly difficult) image:
|
| https://i.imgur.com/jcwW5AG.jpeg
|
| For the blocks in the center, it outputs:
|
| > _Claude, duc de Saint-Simon, pair et chevalier des ordres,
| gouverneur de Blaye, Senlis, etc., ne le 16 aout 1607 , 3 mai
| 1693 ; ep. 1*, le 26 septembre 1644, Diane - Henriette de Budos
| de Portes, morte le 2 decembre 1670; 2*, le 17 octobre 1672,
| Charlotte de l 'Aubespine, morte le 6 octobre 1725._
|
| This is perfect! But then the next one:
|
| > _Louis, commandeur de Malte, Louis de Fay Laurent bre 1644,
| Diane - Henriette de Budos de Portes, de Cressonsac. du
| Chastelet, mortilhomme aux gardes, 2 juin 1679._
|
| This is really bad because
|
| 1/ a portion of the text of the previous bloc is repeated
|
| 2/ a portion of the next bloc is imported here where it shouldn't
| be ("Cressonsac"), and of the right most bloc ("Chastelet")
|
| 3/ but worst of all, a whole word is invented, "mortilhomme" that
| appears nowhere in the original. (The word doesn't exist in
| French so in that case it would be easier to spot; but the risk
| is when words are invented, that do exist and "feel right" in the
| context.)
|
| (Correct text for the second bloc should be:
|
| > _Louis, commandeur de Malte, capitaine aux gardes, 2 juin
| 1679._ )
| bambax wrote:
| Another test with a text in English, which is maybe more fair
| (although Mistral is a French company ;-). This image is from
| Parliamentary debates of the parliament of New Zealand in
| 1854-55:
|
| https://i.imgur.com/1uVAWx9.png
|
| Here's the output of the first paragraph, with mistakes in
| brackets:
|
| > _drafts would be laid on the table, and a long discussion
| would ensue; whereas a Committee would be able to frame a
| document which, with perhaps a few verbal emundations_
| [emendations] _, would be adopted; the time of the House would
| thus be saved, and its business expected_ [expedited] _. With
| regard to the question of the comparative advantages of The-
| day_ [Tuesday]* and Friday, he should vote for the amendment,
| on the principle that the wishes of members from a distance
| should be considered on all sensations _[occasions]_ where a
| principle would not be compromised or the convenience of the
| House interfered with. He hoped the honourable member for the
| Town of Christchurch would adopt the suggestion he (Mr.
| Forssith _[Forsaith]_ ) had thrown out and said _[add]_ to his
| motion the names of a Committee.*
|
| Some mistakes are minor (emnundations/emendations or
| Forssith/Forsaith), but others are very bad, because they are
| unpredictable and don't correspond to any pattern, and
| therefore can be very hard to spot: _sensations_ instead of
| occasions, or _expected_ in lieu of expedited... That last one
| really changes the meaning of the sentence.
| spudlyo wrote:
| I want to rejoice that OCR is now a "solved" problem, but I
| feel like hallucinations are just as problematic as the kind of
| stuff I have to put up with tesseract -- both require careful
| manual proofreading for an acceptable degree of confidence. I
| guess I'll have to try it and see for myself just how much
| better these solutions are for my public domain archive.org
| Latin language reader & textbook projects.
| layer8 wrote:
| > This is perfect!
|
| Just a nit, but I wouldn't call it perfect when using U+25CB *
| WHITE CIRCLE instead of what should be U+00BA o MASCULINE
| ORDINAL INDICATOR, or alternatively a superscript "o". These
| are https://fr.wikipedia.org/wiki/Adverbe_ordinal.
|
| There's also extra spaces after both "1607" and "1693", and
| around the hyphen in "Diane-Henriette".
|
| Lastly, U+2019 instead of U+0027 would be more appropriate for
| the apostrophe, all the more since in the image it looks like
| the former and not like the latter.
| TeMPOraL wrote:
| This is "reasoning model" stuff even for humans :).
| layer8 wrote:
| There is OCR software that analyses which language is used,
| and than applies heuristics for the recognized language to
| steer the character recognition in terms of character
| sequence likelihoods and punctuation rules.
|
| I don't think you need a reasoning model for that, just
| better training; although conversely a reasoning model
| should hopefully notice the errors.
| neom wrote:
| I gave it a bunch of my wifes 18th century English scans to
| transcribe, mostly couldn't do it, and it's been doing this for
| 15 minutes now, not sure why but i find quite amusing:
| https://share.zight.com/L1u2jZYl
| Zopieux wrote:
| Saving you a click: no, it cannot be self hosted (unless you have
| a few million dollars laying around)
| dotnetkow wrote:
| Congrats to the Mistral team for launching! A general-purpose OCR
| model is useful, of course. However, more purpose-built solutions
| are a must to convert business documents reliably. AI models pre-
| trained on specific document types perform better and are more
| accurate. Coming soon from the ABBYY team, we're shipping a new
| OCR API designed to be consistent, reliable, and hallucination-
| free. Check it out if you're looking for best-in-class DX:
| https://digital.abbyy.com/code-extract-automate-your-new-mus...
| riffic wrote:
| It'd be great if this could be tested against genealogical
| documents written in cursive like oh most of the documents on
| microfilm stored by the LDS on familysearch, or eastern european
| archival projects etc.
| peterburkimsher wrote:
| Does it work for video subtitles? And in Chinese? I'm looking to
| transcribe subtitles of live music recordings from ANHOP and
| KHOP.
| nyeah wrote:
| It's not fair to call it a "Mistrial" just because it
| hallucinates a little bit.
___________________________________________________________________
(page generated 2025-03-06 23:00 UTC)