hngopher.com

       [HN Gopher] Mistral OCR
       ___________________________________________________________________
        
       Mistral OCR
        
       Author : littlemerman
       Score  : 834 points
       Date   : 2025-03-06 17:39 UTC (5 hours ago)
        
 (HTM) web link (mistral.ai)
 (TXT) w3m dump (mistral.ai)
        
       | vessenes wrote:
       | Dang. Super fast and significantly more accurate than google,
       | Claude and others.
       | 
       | Pricing : $1/1000 pages, or per 2k pages if "batched". I'm not
       | sure what batching means in this case: multiple pdfs? Why not
       | split them to halve the cost?
       | 
       | Anyway this looks great at pdf to markdown.
        
         | jacksnipe wrote:
         | I would assume this is 1 request containing 2k pages vs N
         | requests whose total pages add up to 1000.
        
         | abiraja wrote:
         | Batching likely means the response is not real-time. You set up
         | a batch job and they send you the results later.
        
           | vessenes wrote:
           | That makes sense. Idle time is nearly free after all.
        
           | ozim wrote:
           | If only business people I work with would understand 100GB
           | even transfer over the network is not going to return
           | immediately results ;)
        
         | sophiebits wrote:
         | Batched often means a higher latency option (minutes/hours
         | instead of seconds), which providers can schedule more
         | efficiently on their GPUs.
        
         | Tostino wrote:
         | Usually (With OpenAI, I haven't checked Mistral yet) it means
         | an async api rather than a sync api.
         | 
         | e.g. you submit multiple requests (pdfs) in one call, and get
         | back an id for the batch. You then can check on the status of
         | that batch and get the results for everything when done.
         | 
         | It lets them use their available hardware to it's full capacity
         | much better.
        
         | odiroot wrote:
         | May I ask as a layperson, how would you about using this to OCR
         | multiple hundreds of pages? I tried the chat but it pretty much
         | stops after the 2nd page.
        
           | sneak wrote:
           | Submit the pages via the API.
        
             | odiroot wrote:
             | This worked indeed. Although I had to cut my document into
             | smaller chunks. 900 pages at once ended with a timeout.
        
           | beklein wrote:
           | You can check the example code on the Mistral documentation,
           | you would _only_ have to change the value of the variable
           | `document_url` to the URL of your uploaded PDF... and you
           | need to change the `MISTRAL_API_KEY` to the value of your
           | specific key that you can get from the Le Platforme webpage.
           | 
           | https://docs.mistral.ai/capabilities/document/#ocr-with-pdf
        
             | odiroot wrote:
             | Thanks!
        
         | kapitalx wrote:
         | From my testing so far, it seems it's super fast and responded
         | synchronously. But it decided that the entire page is an image
         | and returned `![img-0.jpeg](img-0.jpeg)` with coordinates in
         | the metadata for the image, which is the entire page.
         | 
         | Our tool, doctly.ai is much slower and async, but much more
         | accurate and gets you the content itself as an markdown.
        
           | ralusek wrote:
           | I thought we stopped -ly company names ~8 years ago?
        
             | yieldcrv wrote:
             | if you talk to people gen-x and older, you still _need_
             | .com domains
             | 
             | for all those people that aren't just clicking on a link on
             | their social media feed, chat group, or targeted ad
        
             | kapitalx wrote:
             | Haha for sure. Naming isn't just the hardest problem in
             | computer science, it's always hard. But at some point you
             | just have to pick something and move forward.
        
       | newfocogi wrote:
       | They say: "releasing the API mistral-ocr-latest at 1000 pages /
       | $"
       | 
       | I had to reread that a few times. I assume this means 1000pg/$1
       | but I'm still not sure about it.
        
         | bredren wrote:
         | Ya, presumably it is missing the number `1.00`.
        
           | groby_b wrote:
           | Not really. When you go 60 mph (or km/h) you don't specify
           | the 1.00 for the hours either. pages/$ is the unit, 1000 is
           | the value.
        
         | svachalek wrote:
         | Yeah you can read it as "pages per dollar" or as a unit
         | "pages/$", it all comes out the same meaning.
        
         | dgfl wrote:
         | Great example of how information is sometimes compartmentalized
         | arbitrarily in the brain: I imagine you have never been
         | confused by sentences such as "I'm running at 10 km/h".
        
           | mkl wrote:
           | Dollar signs go before the number, not after it like units.
           | It needs to be 1000 pages/$1 to make sense, whereas 10km and
           | 10h and 10/h all make sense so 10km/h does. I imagine you
           | would be confused by km/h 10 but not $10.
        
         | amelius wrote:
         | Hmm, can it read small print? ;)
        
       | pawelduda wrote:
       | It outperforms the competition significantly AND can extract
       | embedded images from the text. I really like LLMs for OCR more
       | and more. Gemini was already pretty good at it
        
       | sbarre wrote:
       | 6 years ago I was working with a very large enterprise that was
       | struggling to solve this problem, trying to scan millions of
       | arbitrary forms and documents per month to clearly understand key
       | points like account numbers, names and addresses, policy numbers,
       | phone numbers, embedded images or scribbled notes, and also draw
       | relationships between these values on a given form, or even
       | across forms.
       | 
       | I wasn't there to solve that specific problem but it was
       | connected to what we were doing so it was fascinating to hear
       | that team talk through all the things they'd tried, from brute-
       | force training on templates (didn't scale as they had too many
       | kinds of forms) to every vendor solution under the sun (none
       | worked quite as advertised on their data)..
       | 
       | I have to imagine this is a problem shared by so many companies.
        
       | jcuenod wrote:
       | Just tested with a multilingual (bidi) English/Hebrew document.
       | 
       | The Hebrew output had no correspondence to the text whatsoever
       | (in context, there was an English translation, and the Hebrew
       | produced was a back-translation of that).
       | 
       | Their benchmark results are impressive, don't get me wrong. But
       | I'm a little disappointed. I often read multilingual document
       | scans in the humanities. Multilingual (and esp. bidi) OCR is
       | challenging, and I'm always looking for a better solution for a
       | side-project I'm working on (fixpdfs.com).
       | 
       | Also, I thought OCR implied that you could get bounding boxes for
       | text (and reconstruct a text layer on a scan, for example). Am I
       | wrong, or is this term just overloaded, now?
        
         | nicodjimenez wrote:
         | You can get bounding boxes from our pdf api at Mathpix.com
         | 
         | Disclaimer, I'm the founder
        
           | kergonath wrote:
           | Mathpix is ace. That's the best results I got so far for
           | scientific papers and reports. It understands the layout of
           | complex documents very well, it's quite impressive. Equations
           | are perfect, figures extraction works well.
           | 
           | There are a few annoying issues, but overall I am very happy
           | with it.
        
             | nicodjimenez wrote:
             | Thanks for the kind words. What are some of the annoying
             | issues?
        
       | notepad0x90 wrote:
       | I was just watching a science-related video containing math
       | equations. I wondered how soon will I be able to ask the video
       | player "What am I looking at here, describe the equations" and it
       | will OCR the frames, analyze them and explain them to me.
       | 
       | It's only a matter of time before "browsing" means navigating
       | HTTP sites via LLM prompts. although, I think it is critical that
       | LLM input should NOT be restricted to verbal cues. Not everyone
       | is an extrovert that longs to hear the sound of their own voices.
       | A lot of human communication is non-verbal.
       | 
       | Once we get over the privacy implications (and I do believe this
       | can only be done by worldwide legislative efforts), I can imagine
       | looking at a "website" or video, and my expressions, mannerisms
       | and gestures will be considered prompts.
       | 
       | At least that is what I imagine the tech would evolve into in 5+
       | years.
        
         | devmor wrote:
         | Good lord, I dearly hope not. That sounds like a coddled
         | hellscape world, something you'd see made fun of in Disney's
         | Wall-E.
        
           | notepad0x90 wrote:
           | hence my comment about privacy and need for legislation :)
           | 
           | It isn't the tech that's the problem but the people that will
           | abuse it.
        
             | devmor wrote:
             | While those are concerns, my point was that having
             | everything on the internet navigated to, digested and
             | explained to me sounds unpleasant and overall a drain on my
             | ability to think and reason for myself.
             | 
             | It is _specifically_ how you describe using the tech that
             | provokes a feeling of revulsion to me.
        
               | notepad0x90 wrote:
               | Then I think you misunderstand. The ML system would know
               | when you want things digested to you or not. Right now
               | companies are assuming this and forcing LLM interaction.
               | But when properly done, the system would know based on
               | your behavior or explicit prompts what you want and
               | provide the service. If you're staring at a paragraph
               | intently and confused, it might start highlighting common
               | phrases or parts of the text/picture that might be hard
               | to grasp and based on your reaction to that, it might
               | start describing things via audio,tool tips,side
               | pane,etc.. In other words, if you don't like how and when
               | you're interacting with the LLM ecosystem, then that is
               | an immature and failing ecosystem, in my vision this
               | would be a largely solved problems, like how we interact
               | with keyboards,mouse and touchscreens today.
        
         | abrichr wrote:
         | > I wondered how soon will I be able to ask the video player
         | "What am I looking at here, describe the equations" and it will
         | OCR the frames, analyze them and explain them to me.
         | 
         | Seems like https://aiscreenshot.app might fit the bill.
        
         | groby_b wrote:
         | Now? OK, you need to screencap and upload to LLM, but that's
         | well established tech by now. (Where by "well established", I
         | mean at least 9 months old ;)
         | 
         | Same goes for "navigating HTTP sites via LLM prompts". Most
         | LLMs have web search integration, and the "Deep Research"
         | variants do more complex navigation.
         | 
         | Video chat is there partially, as well. It doesn't really pay
         | much attention to gestures & expressions, but I'd put the
         | "earliest possible" threshold for that a good chunk closer than
         | 5 years.
        
           | notepad0x90 wrote:
           | Yeah, all these things are possible today, but getting them
           | well polished and integrated is another story. Imagine all
           | this being supported by "HTML6" lol. When apple gets around
           | to making this part of safari, then we know it's ready.
        
             | groby_b wrote:
             | That's a great upper-bound estimator ;)
             | 
             | But kidding aside - I'm not sure people _want_ this being
             | supported by web standards. We could be a _huge_ step
             | closer to that future had we decided to actually take RDF
             | /Dublin Core/Microdata seriously. (LLMs perform a lot
             | better with well-annotated data)
             | 
             | The unanimous verdict across web publishers was "looks like
             | a lot of work, let's not". That is, ultimately, why we need
             | to jump through all the OCR hoops. Not only did the world
             | not annotate the data, it then proceeded to remove as many
             | traces of machine readability as possible.
             | 
             | So, the likely gating factor is probably not Apple & Safari
             | & "HTML6" (shudder!)
             | 
             | If I venture my best bet what's preventing polished
             | integration: It's really hard to do via foundational models
             | only, and the number of people who want to have deep &
             | well-informed conversations via a polished app enough that
             | they're willing to pay for an app that does that is low
             | enough that it's not the hot VC space. (Yet?)
             | 
             | Crystal ball: Some OSS project will probably get within
             | spitting distance of something really useful, but also
             | probably flub the UX. Somebody else will take up these
             | ideas while it's hot and polish it in a startup. So, 18-36
             | months for an integrated experience from here?
        
       | andoando wrote:
       | Bit unrelated but is there anything that can help with really low
       | resolution text? My neighbor got hit and run the other day for
       | example, and I've been trying every tool I can to make out some
       | of the letters/numbers on the plate
       | 
       | https://ibb.co/mr8QSYnj
        
         | busymom0 wrote:
         | There are photo enhancers online. But your picture is way too
         | pixelated to get any useful info from it.
        
           | tjoff wrote:
           | If you know the font in advance (which you often do in these
           | cases) you can do insane reconstructions. Also keep in mind
           | that it doesn't have to be a perfect match, with the help of
           | the color and other facts (such as likely location) about the
           | car you can narrow it down significantly.
        
           | zellyn wrote:
           | Maybe if you had multiple frames, and used something very
           | clever?
        
         | flutas wrote:
         | Looks like a paper temp tag. Other than that, I'm not sure much
         | can be had from it.
        
         | zinglersen wrote:
         | Finding the right subreddit and asking there is probably a
         | better approach if you want to maximize the chances of getting
         | the plate 'decrypted'.
        
         | dewey wrote:
         | To even get started on this you'd also need to share some
         | contextual information like continent, country etc. I'd say.
        
           | andoando wrote:
           | Its in CA, looks like paper plates which follow a specific
           | format and the last two seem to be the numbers '64'. Police
           | should be able to search for temp tag with partial match and
           | match the make/model. Was curious to see if any software
           | could help though
        
         | rvnx wrote:
         | If it's a video, sharing a few frames can help as well
        
       | TriangleEdge wrote:
       | One of my hobby projects while in University was to do OCR on
       | book scans. Doing character recognition was solved, but finding
       | the relationship between characters was very difficult. I tried
       | "primitive" neural nets, but edge cases would often break what I
       | built. Super cool to me to see such an order of magnitude in
       | improvement here.
       | 
       | Does it do hand written notes and annotations? What about meta
       | information like highlighting? I am also curious if LLMs will get
       | better because more access to information if it can be
       | effectively extracted from PDFs.
        
         | jcuenod wrote:
         | * Character recognition on monolingual text in a narrow domain
         | is solved
        
       | jervant wrote:
       | I wonder how it compares to USPS workers at deciphering illegible
       | handwriting.
        
       | linklater12 wrote:
       | Document processing is where b2b SAAS is at.
        
       | opwieurposiu wrote:
       | Related, does anyone know of an app that can read gauges from an
       | image and log the number to influx? I have a solar power meter in
       | my crawlspace, it is inconvenient to go down there. I want to
       | point an old phone at it and log it so I can check it easily. The
       | gauge is digital and looks like this:
       | 
       | https://www.pvh2o.com/solarShed/firstPower.jpg
        
         | dehrmann wrote:
         | You'll be happier finding a replacement meter that has an
         | interface to monitor it directly or a second meter. An old
         | phone and OCR will be very brittle.
        
           | haswell wrote:
           | Not OP, but it sounds like the kind of project I'd undertake.
           | 
           | Happiness for me is about exploring the problem within
           | constraints and the satisfaction of building the solution.
           | Brittleness is often of less concern than the fun factor.
           | 
           | And some kinds of brittleness can be managed/solved, which
           | adds to the fun.
        
             | arcfour wrote:
             | I would posit that learning how the device works, and how
             | to integrate with a newer digital monitoring device would
             | be just as interesting _and_ less brittle.
        
               | haswell wrote:
               | Possibly! But I've recently wanted to dabble with
               | computer vision, so I'd be looking at a project like this
               | as a way to scratch a specific itch. Again, not OP so I
               | don't know what their priorities are, but just offering
               | one angle for why one might choose a less "optimal"
               | approach.
        
         | renewiltord wrote:
         | 4o transcribes it perfectly. You can usually root an old
         | Android and write this app in ~2h with LLMs if unfamiliar. The
         | hard part will be maintaining camera lens cleanliness and
         | alignment etc.
         | 
         | The time cost is so low that you should give it a gander.
         | You'll be surprised how fast you can do it. If you just take
         | screenshots every minute it should suffice.
        
           | pavl wrote:
           | What software-tools do you usw to Programm the APP?
        
         | ubergeek42 wrote:
         | This[1] is something I've come across but not had a chance to
         | play with, designed for reading non-smart meters that might
         | work for you. I'm not sure if there's any way to run it on an
         | old phone though.
         | 
         | [1] https://github.com/jomjol/AI-on-the-edge-device
        
           | timc3 wrote:
           | I use this for a watermeter. Works quite well as long as you
           | have a good SD card
        
           | jasonjayr wrote:
           | Wow. I was looking at hooking my water meter into home
           | assistant, and was going to investigate just counting an
           | optical pulse (it has a white portion on the gear that is in
           | a certain spot every .1 gal) This is like the same meter I
           | use, and perfect.
           | 
           | (It turns out my electric meter, though analog, blasts out
           | it's reading on RF every 10 seconds unencrypted. I got that
           | via my RTL-SDR reciever :) )
        
         | ramses0 wrote:
         | https://www.home-assistant.io/integrations/seven_segments/
         | 
         | https://www.unix-ag.uni-kl.de/~auerswal/ssocr/
         | 
         | https://github.com/tesseract-ocr/tesseract
         | 
         | https://community.home-assistant.io/t/ocr-on-camera-image-fo...
         | 
         | https://www.google.com/search?q=home+assistant+ocr+integrati...
         | 
         | https://www.google.com/search?q=esphome+ocr+sensor
         | 
         | https://hackaday.com/2021/02/07/an-esp-will-read-your-meter-...
         | 
         | ...start digging around and you'll likely find something. HA
         | has integrations which can support writing to InfluxDB (local
         | for sure, and you can probably configure it for a remote
         | influxdb).
         | 
         | You're looking at 1xRaspberry PI, 1xUSB Webcam, 1x"Power
         | Management / humidity management / waterproof electrical box"
         | to stuff it into, and then either YOLO and DIY to shoot over to
         | your influxdb, or set up a Home Assistant and "attach" your
         | frankenbox as some sort of "sensor" or "integration" which
         | spits out metrics and yadayada...
        
         | BonoboIO wrote:
         | Gemini Free Tier would surely work
        
       | ChemSpider wrote:
       | "World's best OCR model" - that is quite a statement. Are there
       | any well-known benchmarks for OCR software?
        
         | xnx wrote:
         | https://huggingface.co/spaces/echo840/ocrbench-leaderboard
        
           | ChemSpider wrote:
           | Interesting. But no mistral on it yet?
        
         | themanmaran wrote:
         | We published this benchmark the other week. We'll can update
         | and run with Mistral today!
         | 
         | https://github.com/getomni-ai/benchmark
        
           | kergonath wrote:
           | Excellent. I am looking forward to it.
        
           | cdolan wrote:
           | Came here to see if you all had run a benchmark on it yet :)
        
         | WhitneyLand wrote:
         | It's interesting that none of the existing models can decode a
         | Scrabble board screen shot and give an accurate grid of
         | characters.
         | 
         | I realize it's not a common business case, came across it
         | testing how well LLMs can solve simple games. On a side note,
         | if you bypass OCR and give models a text layout of a board
         | standard LLMs cannot solve Scrabble boards but the thinking
         | models usually can.
        
         | resource_waste wrote:
         | Its Mistral, they are the only homegrown AI Europe has, so
         | people pretend they are meaningful.
         | 
         | I'll give it a try, but I'm not holding my breath. I'm a huge
         | AI Enthusiast and I've yet to be impressed with anything
         | they've put out.
        
       | sashank_1509 wrote:
       | Really cool, thanks Mistral!
        
       | z2 wrote:
       | Is there a reliable handwriting OCR benchmark out there (updated,
       | not a blog post)? Despite the gains claimed for printed text, I
       | found (anecdotally) that trying to use Mistral OCR on my messy
       | cursive handwriting to be much less accurate than GPT-4o, in the
       | ballpark of 30% wrong vs closer to 5% wrong for GPT-4o.
       | 
       | Edit: answered in another post:
       | https://huggingface.co/spaces/echo840/ocrbench-leaderboard
        
         | dannyobrien wrote:
         | Simon Willison linked to an impressive demo of Qwen2-VL in this
         | area: I haven't found a version of it that I could run locally
         | yet to corroborate.
         | https://simonwillison.net/2024/Sep/4/qwen2-vl/
        
       | aperrien wrote:
       | Is this model open source?
        
         | daemonologist wrote:
         | No (nor is it open-weights).
        
       | cxie wrote:
       | The new Mistral OCR release looks impressive - 94.89% overall
       | accuracy and significantly better multilingual support than
       | competitors. As someone who's built document processing systems
       | at scale, I'm curious about the real-world implications.
       | 
       | Has anyone tried this on specialized domains like medical or
       | legal documents? The benchmarks are promising, but OCR has always
       | faced challenges with domain-specific terminology and formatting.
       | 
       | Also interesting to see the pricing model ($1/1000 pages) in a
       | landscape where many expected this functionality to eventually be
       | bundled into base LLM offerings. This feels like a trend where
       | previously encapsulated capabilities are being unbundled into
       | specialized APIs with separate pricing.
       | 
       | I wonder if this is the beginning of the componentization of AI
       | infrastructure - breaking monolithic models into specialized
       | services that each do one thing extremely well.
        
         | unboxingelf wrote:
         | We'll just stick LLM Gateway LLM in front of all the
         | specialized LLMs. MicroLLMs Architecture.
        
           | cxie wrote:
           | I actually think you're onto something there. The "MicroLLMs
           | Architecture" could mirror how microservices revolutionized
           | web architecture.
           | 
           | Instead of one massive model trying to do everything, you'd
           | have specialized models for OCR, code generation, image
           | understanding, etc. Then a "router LLM" would direct queries
           | to the appropriate specialized model and synthesize
           | responses.
           | 
           | The efficiency gains could be substantial - why run a 1T
           | parameter model when your query just needs a lightweight OCR
           | specialist? You could dynamically load only what you need.
           | 
           | The challenge would be in the communication protocol between
           | models and managing the complexity. We'd need something like
           | a "prompt bus" for inter-model communication with
           | standardized inputs/outputs.
           | 
           | Has anyone here started building infrastructure for this kind
           | of model orchestration yet? This feels like it could be the
           | Kubernetes moment for AI systems.
        
             | arcfour wrote:
             | This is already done with agents. Some agents only have
             | tools and the one model, some agents will orchestrate with
             | other LLMs to handle more advanced use cases. It's pretty
             | obvious solution when you think about how to get good
             | performance out of a model on a complex task when useful
             | context length is limited: just run multiple models with
             | their own context and give them a supervisor model--just
             | like how humans organize themselves in real life.
        
             | fnordpiglet wrote:
             | I'm doing this personally for my own project - essentially
             | building an agent graph that starts with the image output,
             | orients and cleans, does a first pass with tesseract LSTM
             | best models to create PDF/HOCR/Alto, then pass to other
             | LLMs and models based on their strengths to further refine
             | towards markdown and latex. My goal is less about RAG
             | database population but about preserving in a non manually
             | typeset form the structure and data and analysis, and there
             | seems to be pretty limited tooling out there since the goal
             | generally seems to be the obviously immediately commercial
             | goal of producing RAG amenable forms that defer the "heavy"
             | side of chart / graphic / tabular reproduction to a future
             | time.
        
         | kergonath wrote:
         | > Has anyone tried this on specialized domains like medical or
         | legal documents?
         | 
         | I'll try it on a whole bunch of scientific papers ASAP. Quite
         | excited about this.
        
         | stavros wrote:
         | What do you mean by "free"? Using the OpenAI vision API, for
         | example, for OCR is quite a bit more expensive than $1/1k
         | pages.
        
         | salynchnew wrote:
         | Also interesting to see that parts of the training
         | infrastructure to create frontier models is itself being
         | monetized.
        
         | PeterStuer wrote:
         | I'd love to try it for my domain (regulation), but $1/1000
         | pages is significantly more expensive than my current local
         | Docling based setup that already does a great job of processing
         | PDF's for my needs.
        
           | yawnxyz wrote:
           | I think for regulated fields / high impact fields $1/1000 is
           | well-worth the price; if the accuracy is close to 100% this
           | is way better than using people, who are still error-prone
        
         | janalsncm wrote:
         | I have done OCR on leases. It's hard. You have to be accurate
         | and they all have bespoke formatting.
         | 
         | It would almost be easier to switch everyone to a common format
         | and spell out important entities (names, numbers) multiple
         | times similar to how cheques do.
         | 
         | The utility of the system really depends on the makeup of that
         | last 5%. If problematic documents are consistently predictable,
         | it's possible to do a second pass with humans. But if they're
         | random, then you have to do every doc with humans and it
         | doesn't save you any time.
        
         | epolanski wrote:
         | At my client we want to provide an AI that can retrieve
         | relevant information from documentation (home building
         | business, documents detail how to install a solar panel or a
         | shower, etc) and we've set up an entire system with benchmarks,
         | agents, etc, yet the bottleneck is OCR!
         | 
         | We have millions and millions of pages of documents and an off
         | by 1 % error means it compounds with the AI's own error, which
         | compounds with documentation itself being incorrect at times,
         | which leads it all to be not production ready (and indeed the
         | project has never been released), not even close.
         | 
         | We simply cannot afford to give our customers incorrect
         | informatiin
         | 
         | We have set up a backoffice app that when users have questions,
         | it sends it to our workers along the response given by our AI
         | application and the person can review it, and ideally correct
         | the ocr output.
         | 
         | Honestly after an year of working it feels like AI right now
         | can only be useful when supervised all the time (such as when
         | coding). Otherwise I just find LLMs still too unreliable
         | besides basic bogus tasks.
        
           | PeterStuer wrote:
           | As someone who has had a home built, and nearly all my
           | friends and acquaintances report the same thing, having a 1%
           | error on information in this business would mean not a 10x
           | but a 50x improvement over the current practice in the field.
           | 
           | If nobody is supervising building documents all the time
           | during the process, every house would be a pile of rubbish.
           | And even when you do stuff stills creeps in and has to be
           | redone, often more than once.
        
         | themanmaran wrote:
         | Excited to test this our on our side as well. We recently built
         | an OCR benchmarking framework specifically for VLMs[1][2], so
         | we'll do a test run today.
         | 
         | From our last benchmark run, some of these numbers from Mistral
         | seem a little bit optimistic. Side by side of a few models:
         | 
         | model | omni | mistral |
         | 
         | gemini | 86% | 89% |
         | 
         | azure | 85% | 89% |
         | 
         | gpt-4o | 75% | 89% |
         | 
         | google | 68% | 83% |
         | 
         | Currently adding the Mistral API and we'll get results out
         | today!
         | 
         | [1] https://github.com/getomni-ai/benchmark
         | 
         | [2] https://huggingface.co/datasets/getomni-ai/ocr-benchmark
        
           | jaggs wrote:
           | By optimistic, do you mean 'tweaked'? :)
        
         | kbyatnal wrote:
         | re: real world implications, LLMs and VLMs aren't magi, and
         | anyone who goes in expecting 100% automation is in for a
         | surprise (especially in domains like medical or legal).
         | 
         | IMO there's still a large gap for businesses in going from raw
         | OCR outputs --> document processing deployed in prod for
         | mission-critical use cases.
         | 
         | e.g. you still need to build and label datasets, orchestrate
         | pipelines (classify -> split -> extract), detect uncertainty
         | and correct with human-in-the-loop, fine-tune, and a lot more.
         | You can certainly get close to full automation over time, but
         | it's going to take time and effort.
         | 
         | But for RAG and other use cases where the error tolerance is
         | higher, I do think these OCR models will get good enough to
         | just solve that part of the problem.
         | 
         | Disclaimer: I started a LLM doc processing company to help
         | companies solve problems in this space (https://extend.app/)
        
         | janis1234 wrote:
         | $1 for 1000 pages seems high to me. Doing a google search
         | 
         | Rent and Reserve NVIDIA A100 GPU 80GB - Pricing Starts from
         | $1.35/hour
         | 
         | I just don't know if in 1 hour and with a A100 I can process
         | more than a 1000 pages. I'm guessing yes.
        
           | blackoil wrote:
           | Is the model Open Source/Weight? Else the cost is for the
           | model, not GPU.
        
       | alberth wrote:
       | Curious to see how this performance against more real world usage
       | of someone taking a photo of text (which the text then becomes
       | slightly blurred) and performing OCR on it.
       | 
       | I can't exactly tell if the "Mistral 7B" image is an example of
       | this exact scenario.
        
       | roboben wrote:
       | Le chat doesn't seem to know about this change despite the blog
       | post stating it. Can anyone explain how to use it in Le Chat?
        
         | kapitalx wrote:
         | Looks to be API only for now. Documentation here:
         | https://docs.mistral.ai/capabilities/document/
        
         | troyvit wrote:
         | I asked LeChat this question:
         | 
         | If I upload a small PDF to you are you able to convert it to
         | markdown?
         | 
         | LeChat said yes and away we went.
        
       | jacooper wrote:
       | Pretty cool, would love to use this with paperless, but I just
       | can't bring myself to send a photo of all my documents to a third
       | party, especially legal and sensitive documents, which is what I
       | use Paperless for.
       | 
       | Because of that I'm stuck with crappy vision on Ollama (Thanks to
       | AMDs crappy ROCm support for Vllm)
        
       | jbverschoor wrote:
       | Ohhh. Gonna test it out with some 100+ year old scribbles :)
        
       | WhitneyLand wrote:
       | 1. There's no simple page / sandbox to upload images and try it.
       | Fine, I'll code it up.
       | 
       | 2. "Explore the Mistral AI APIs" (https://docs.mistral.ai) links
       | to all apis except OCR.
       | 
       | 3. The docs on the api params refer to document chunking and
       | image chunking but no details on how their chunking works?
       | 
       | So much unnecessary friction smh.
        
         | cooperaustinj wrote:
         | There is an OCR page on the link you provided. It includes a
         | very, very simple curl command (like most of their docs).
         | 
         | I think the friction here exists outside of Mistral's control.
        
           | kergonath wrote:
           | > There is an OCR page on the link you provided.
           | 
           | I don't see it either. There might be some caching issue.
        
       | deadbabe wrote:
       | LLM based OCR is a disaster, great potential for hallucinations
       | and no estimate of confidence. Results might seem promising but
       | you'll always be wondering.
        
         | menaerus wrote:
         | CNN-based OCR also have "hallucinations" and Transformers
         | aren't that much different in that respect. This is a problem
         | solved with domain specific post-processing.
        
         | leumon wrote:
         | well already in 2013 ocr systems used in xerox scanners (turned
         | on by default!) randomly altered numbers, so its not an issue
         | only occuring in llms.
        
       | bob1029 wrote:
       | > It takes images and PDFs as input
       | 
       | If you are working with PDF, I would suggest a hybrid process.
       | 
       | It is feasible to extract information with 100% accuracy from
       | PDFs that were generated using the mappable acrofields approach.
       | In many domains, you have a fixed set of forms you need to
       | process and this can be leveraged to build a custom tool for
       | extracting the data.
       | 
       | Only if the PDFs are unknown or were created by way of a
       | cellphone camera, multifunction office device, etc should you
       | need to reach for OCR.
       | 
       | The moment you need to use this kind of technology you are in a
       | completely different regime of what the business will (should)
       | tolerate.
        
         | themanmaran wrote:
         | > Only if the PDFs are unknown or were created by way of a
         | cellphone camera, multifunction office device, etc should you
         | need to reach for OCR.
         | 
         | It's always safer to OCR on every file. Sometimes you'll have a
         | "clean" pdf that has a screenshot of an Excel table. Or a
         | scanned image that has already been OCR'd by a lower quality
         | tool (like the built in Adobe OCR). And if you rely on this
         | you're going to get pretty unpredictable results.
         | 
         | It's way easier (and more standardized) to run OCR on every
         | file, rather than trying to guess at the contents based on the
         | metadata.
        
           | bob1029 wrote:
           | It's not guessing if the form is known and you can read the
           | information directly.
           | 
           | This is a common scenario at many banks. You can expect
           | nearly perfect metadata for anything pushed into their
           | document storage system within the last decade.
        
             | themanmaran wrote:
             | Oh yea if the form is known and standardized everything is
             | a lot easier.
             | 
             | But we work with banks on our side, and one of the most
             | common scenarios is customers uploading
             | financials/bills/statements from 1000's of different
             | providers. In which case it's impossible to know every
             | format in advance.
        
       | SilentM68 wrote:
       | I would like to see how it performs with massively warped and
       | skewed scanned text images, basically a scanned image where the
       | text lines are wavy as opposed as straight horizontal, where the
       | letters are elongated. One where the line widths are different
       | depending on the position on the scanned image. I once had to
       | deal with such a task that somebody gave me with OCR software,
       | Acrobat, and other tools could not decode the mess so I had to
       | recreate the 30 pages myself, manually. Not a fun thing to do but
       | that is a real use case.
        
         | arcfour wrote:
         | Garbage in, garbage out?
        
           | edude03 wrote:
           | "Yes" but if a human could do it "AI" should be able to do it
           | too.
        
       | janalsncm wrote:
       | The hard ones are things like contracts, leases, and financial
       | documents which 1) don't have a common format 2) are filled with
       | numbers proper nouns and addresses which it's _really_ important
       | not to mess up 3) cannot be inferred from context.
       | 
       | Typical OCR pipeline would be to pass the doc through a
       | character-level OCR system then correct errors with a statistical
       | model like an LLM. An LLM can help correct "crodit card" to
       | "credit card" but it cannot correct names or numbers. It's really
       | bad if it replaces a 7 with a 2.
        
       | groby_b wrote:
       | Perusing the web site, it's depressing how much behind Mistral is
       | on basic "how can I make this a compelling hook for customers"
       | for the page.
       | 
       | The notebook link? An ACL'd doc
       | 
       | The examples don't even include a small text-to-markdown sample.
       | 
       | The before/after slider is cute, but useless - SxS is a much
       | better way to compare.
       | 
       | Trying it in "Le Chat" requires a login.
       | 
       | It's like an example of "how can we implement maximum loss across
       | our entire funnel". (I have no doubt the underlying tech does
       | well, but... damn, why do you make it so hard to actually see it,
       | Mistral?)
       | 
       | If anybody tried it and has shareable examples - can you post a
       | link? Also, anybody tried it with handwriting yet?
        
       | dehrmann wrote:
       | Is this burying the lede? OCR is a solved problem, but
       | structuring document data from scans isn't.
        
       | jslezak wrote:
       | Has anyone tried it for handwriting?
       | 
       | So far Gemini is the only model I can get decent output from for
       | a particular hard handwriting task
        
       | pqdbr wrote:
       | I tried with both PDFs and PNGs in Le Chat and the results were
       | the worst I've ever seen when compared to any other model
       | (Claude, ChatGPT, Gemini).
       | 
       | So bad that I think I need to enable the OCR function somehow,
       | but couldn't find it.
        
         | computergert wrote:
         | I'm experiencing the same. Maybe the sentence "Mistral OCR
         | capabilities are free to try on le Chat." was a hallucination.
        
         | troyvit wrote:
         | It worked perfectly for me with a simple 2 page PDF that
         | contained no graphics or formatting beyond headers and list
         | items. Since it was so small I had the time to proof-read it
         | and there were no errors. It added some formatting, such as
         | bolding headers in list items and putting tics around file and
         | function names. I won't complain.
        
       | sunami-ai wrote:
       | Making Transformers the same cost as CNN's (which are used in
       | character-level ocr, as opposed to image-patch-level) is a good
       | thing. The problem with CNN based character-level OCR is not the
       | recognition models but the detection models. In a former life, I
       | found a way to increase detection accuracy, and, therefore,
       | overall OCR accuracy, and used that as an enhancement on top of
       | Amazon and Google OCR. It worked really well. But the transformer
       | approach is more powerful and if it can be done for $1 per 1000
       | pages, that is a game changer, IMO, at least of incumbents
       | offering traditional character-level OCR.
        
         | menaerus wrote:
         | It certainly isn't the same cost if expressed as a non-
         | subsidized $$$ one needs for the Transformers compute aka
         | infra.
         | 
         | CNNs trained specifically for OCR can run in real time on as
         | small compute as a mobile device is.
        
       | srinathkrishna wrote:
       | Given the fact that multi-modal LLMs are getting so good at OCR
       | these days, is it a shame that we can't do local OCR with high
       | accuracy in the near-term?
        
       | coolspot wrote:
       | This is $1 per 1000 pages.
       | 
       | For comparison, Azure Document Intelligence is $1.5/1000 pages
       | for general OCR and $30/1000 pages for "custom extraction".
        
       | kapitalx wrote:
       | Co-founder of doctly.ai here (OCR tool)
       | 
       | I love mistral and what they do. I got really excited about this,
       | but a little disappointed after my first few tests.
       | 
       | I tried a complex table that we use as a first test of any new
       | model, and Mistral OCR decided the entire table should just be
       | extracted as an 'image' and returned this markdown:
       | 
       | ``` ![img-0.jpeg](img-0.jpeg) ```
       | 
       | I'll keep testing, but so far, very disappointing :(
       | 
       | This document I try is the entire reason we created Doctly to
       | begin with. We needed an OCR tool for regulatory documents we use
       | and nothing could really give us the right data.
       | 
       | Doctly uses a judge, OCRs a document against multiple LLMs and
       | decides which one to pick. It will continue to run the page until
       | the judge scores above a certain score.
       | 
       | I would have loved to add this into the judge list, but might
       | have to skip it.
        
         | infecto wrote:
         | Why pay more for doctly than an AWS Textract?
        
           | kapitalx wrote:
           | Great question. The language models are definitely beating
           | the old tools. Take a look at Gemini for example.
           | 
           | Doctly runs a tournament style judge. It will run multiple
           | generations across LLMs and pick the best one. Outperforming
           | single generation and single model.
        
         | the_mitsuhiko wrote:
         | Would love to see the test file.
        
           | Starlord2048 wrote:
           | would be glad to see benchmarking results
        
             | kapitalx wrote:
             | This is a good idea. We should publish a benchmark
             | results/comparison.
        
         | fnordpiglet wrote:
         | Interestingly I'm currently going through and scanning the
         | hundreds of journal papers my grandfather authored in medicine
         | and thinking through what to do about graphs. I was expecting
         | to do some form of multiphase agent based generation of LaTeX
         | or SVG rather than a verbal summary of the graphs. At least in
         | his generation of authorship his papers clearly explained the
         | graphs already. I was pretty excited to see your post naturally
         | but when I looked at the examples what I saw was, effectively,
         | a more verbose form of
         | 
         | ``` ![img-0.jpeg](img-0.jpeg) ```
         | 
         | I'm assuming this is partially because your use case is
         | targeting RAG under various assumptions bur also partially
         | because multimodal models aren't near what I would need to be
         | successful with?
        
           | kapitalx wrote:
           | We need to update the examples on the front page. Currently
           | for things that are considered charts/graphs/figures we
           | convert to a description. For things like logos or images we
           | do an image tag. You can also choose to exclude them.
           | 
           | The difference with this is that it took the entire page as
           | an image tag (it's just a table of text in my document).
           | rather than being more selective.
           | 
           | I do like that they give you coordinates for the images
           | though, we need to do something like that.
           | 
           | Give the actual tool a try. Would love to get your feedback
           | for that use case. It gives you 100 free credits initially
           | but if you email me (ali@doctly.ai), I can give you an extra
           | 500 (goes for anyone else here also)
        
         | niwtsol wrote:
         | If you have a judge system, and Mistral performs well on other
         | tests, wouldn't you want to include it so if it scores the
         | highest by your judges ranking it would select the most
         | accurate result? Or are you saying that mistral's image
         | markdown would score higher on your judge score?
        
           | kapitalx wrote:
           | We'll definitely be doing more tests, but the results I got
           | on the complex tests would result in a lower score and might
           | not be worth the extra cost of the judgement itself.
           | 
           | In our current setup Gemini wins most often. We enter
           | multiple generations from each model into the 'tournament',
           | sometimes one generation from gemini could be at the top
           | while another in the bottom, for the same tournament.
        
         | bambax wrote:
         | Where did you test it? At the end of the post they say:
         | 
         | > _Mistral OCR capabilities are free to try on le Chat_
         | 
         | but when asked, Le Chat responds:
         | 
         | > _can you do ocr?_
         | 
         | > _I don 't have the capability to perform Optical Character
         | Recognition (OCR) directly. However, if you have an image with
         | text that you need to extract, you can describe the text or
         | provide details, and I can help you with any information or
         | analysis related to that text. If you need OCR functionality,
         | you might need to use a specialized tool or service designed
         | for that purpose._
         | 
         | Edit: Tried anyway by attaching an image; it said it could do
         | OCR and then output... completely random text that had
         | absolutely nothing to do with the text in the image!...
         | Concerning.
         | 
         | Tried again with a better definition image, output only the
         | first twenty words or so of the page.
         | 
         | Did you try using the API?
        
           | kapitalx wrote:
           | Yes I used the API. They have examples here:
           | 
           | https://docs.mistral.ai/capabilities/document/
           | 
           | I used base64 encoding of the image of the pdf page. The
           | output was an object that has the markdown, and coordinates
           | for the images:
           | 
           | [OCRPageObject(index=0, markdown='![img-0.jpeg](img-0.jpeg)',
           | images=[OCRImageObject(id='img-0.jpeg', top_left_x=140,
           | top_left_y=65, bottom_right_x=2136, bottom_right_y=1635,
           | image_base64=None)], dimensions=OCRPageDimensions(dpi=200,
           | height=1778, width=2300))] model='mistral-
           | ocr-2503-completion'
           | usage_info=OCRUsageInfo(pages_processed=1,
           | doc_size_bytes=634209)
        
         | Grosvenor wrote:
         | Does doctly do handwritten forms like dates?
         | 
         | I have a lot of "This document filed and registered in the
         | county of ______ on ______ of _____ 2023" sort of thing.
        
           | kapitalx wrote:
           | We've been getting great results with those aswell. But
           | ofcourse there is always some chance of not getting it
           | perfect, specially with different handwritings.
           | 
           | Give it a try, no credit cards needed to try it. If you email
           | me (ali@doctly.ai) i can give you extra free credits for
           | testing.
        
             | Grosvenor wrote:
             | Just tried it. Got all the dates correct and even extracted
             | signatures really well.
             | 
             | Now to figure out how many millions of pages I have.
        
       | owenpalmer wrote:
       | This is incredibly exciting. I've been pondering/experimenting on
       | a hobby project that makes reading papers and textbooks easier
       | and more effective. Unfortunately the OCR and figure extraction
       | technology just wasn't there yet. This is a game changer.
       | 
       | Specifically, this allows you to associate figure references with
       | the actual figure, which would allow me to build a UI that solves
       | the annoying problem of looking for a referenced figure on
       | another page, which breaks up the flow of reading.
       | 
       | It also allows a clean conversion to HTML, so you can add cool
       | functionality like clicking on unfamiliar words for definitions,
       | or inserting LLM generated checkpoint questions to verify
       | understanding. I would like to see if I can automatically
       | integrate Andy Matuschak's Orbit[0] SRS into any PDF.
       | 
       | Lots of potential here.
       | 
       | [0] https://docs.withorbit.com/
        
         | generalizations wrote:
         | Wait does this deal with images?
        
           | ezfe wrote:
           | The output includes images from the input. You can see that
           | on one of the examples where a logo is cropped out of the
           | source and included in the result.
        
         | NalNezumi wrote:
         | >a UI that solves the annoying problem of looking for a
         | referenced figure on another page, which breaks up the flow of
         | reading.
         | 
         | A tangent but this exact issue is what I was frustrated for a
         | long time with pdf reader and reading science papers. Then I
         | found sioyek that pops up a small window when you hover over
         | links (references and equations and figures) and it solved it.
         | 
         | Granted, the pdf file must be in right format, so OCR could
         | make this experience better. Just saying the UI component of
         | that already exist
         | 
         | https://sioyek.info/
        
           | PerryStyle wrote:
           | Zotero's PDF viewer also does this now. Being able to
           | annotate PDFs and having a reference manager has been a life
           | saver.
        
       | polytely wrote:
       | I don't need AGI just give me superhuman OCR so we can turn all
       | existing pdfs into text* and cheaply host it.
       | 
       | Feels like we are almost there.
       | 
       | *: https://annas-archive.org/blog/critical-window.html
        
       | coolspot wrote:
       | This is $1 per 1000 pages. For comparison, Azure Document
       | Intelligence is $1.5/1000 pages for general OCR and $30/1000
       | pages for "custom extraction".
        
         | 0cf8612b2e1e wrote:
         | Given the wide variety of pricing on all of these providers, I
         | keep wondering how the economics work. Do they have fantastic
         | margin on some of these products or is it a matter of
         | subsidizing the costs, hoping to capture the market? Last I
         | heard, OpenAI is still losing money.
        
       | thegabriele wrote:
       | I'm using gemini to solve textual CAPTCHA with some good results
       | (better than untrained OCR).
       | 
       | I will give this a shot
        
       | bugglebeetle wrote:
       | Congrats to Mistral for yet again releasing another closed source
       | thing that costs more than running an open source equivalent:
       | 
       | https://github.com/DS4SD/docling
        
         | Squarex wrote:
         | I am all for open source, but where do you see benchmarks that
         | conclude that it's just equivalent?
        
           | bugglebeetle wrote:
           | Where do you see open source benchmark results that confirm
           | Mistral's performance?
        
         | anonymousd3vil wrote:
         | Back in my days Mistral used to torrent models.
        
       | Asraelite wrote:
       | I never thought I'd see the day where technology finally advanced
       | far enough that we can edit a PDF.
        
         | randomNumber7 wrote:
         | I never thought driving a car is harder than editing a pdf.
        
           | pzo wrote:
           | It's not about harder but about what error you can tolerate.
           | Here if you have accuracy 99% for many applications it's
           | enough. If you have 99% accuracy per trip of no crash during
           | self driving then you gonna be dead within a year very
           | likely.
           | 
           | For cars we need accuracy at least 99.99% and that's very
           | hard.
        
             | rtsil wrote:
             | I doubt most people have 99% accuracy. The threshold of
             | tolerance for error is just much lower for any self-driving
             | system (and with good reason, because we're not familiar
             | with them yet).
        
               | KeplerBoy wrote:
               | How do you define 99% accuracy?
               | 
               | I guess something like success rate for a trip (or mile)
               | would be a more reasonable metric. Most people have a
               | success rate far higher than 99% for averages trips.
               | 
               | Most people who commute daily are probably doing
               | something like a 1000 car rides a year and have minor
               | accidents every few years. 99% success rates would mean
               | monthly accidents.
        
         | Apofis wrote:
         | Foxit PDF exists...
        
         | toephu2 wrote:
         | I've been able to edit PDFs (95%+ of them) accurately for the
         | past 10 years...
        
       | thiago_fm wrote:
       | For general use this will be good.
       | 
       | But I bet that simple ML will lead to better OCRs when you are
       | doing anything specialized, such as, medical documents, invoices
       | etc.
        
       | sureglymop wrote:
       | Looks good but in the first hover/slider demo one can see how it
       | could lead to confusion when handling side by side content.
       | 
       | Table 1 is referred to in section `2 Architectural details` but
       | before `2.1 Multimodal Decoder`. In the generated markdown though
       | it is below the latter section, as if it was in/part of that
       | section.
       | 
       | Of course I am nitpicking here but just the first thing I
       | noticed.
        
         | 0cf8612b2e1e wrote:
         | Does anything handle dual columns well? Despite being the
         | academic standard, it seemingly throws off every generic tool.
        
       | serjester wrote:
       | This is cool! With that said for anyone looking to use this in
       | RAG, the downside to specialized models instead of general VLMs
       | is you can't easily tune it to your use specific case. So for
       | example, we use Gemini to add very specific alt text to images in
       | the extracted Markdown. It's also 2 - 3X the cost of Gemini Flash
       | - hopefully the increased performance is significant.
       | 
       | Regardless excited to see more and more competition in the space.
       | 
       | Wrote an article on it: https://www.sergey.fyi/articles/gemini-
       | flash-2-tips
        
         | hyuuu wrote:
         | gemini flash is notorious for hallucinating the output of the
         | OCR, be careful with it. For straight forward, semi-structured,
         | low page count (under 5) it should perform well, but the more
         | the context window is stretched the more the output gets more
         | unreliable
        
       | oysterville wrote:
       | Dupe of an hour previous post
       | https://news.ycombinator.com/item?id=43282489
        
       | beebaween wrote:
       | Wonder how it does with table data in pdfs / page-long tabular
       | data?
        
       | blackeyeblitzar wrote:
       | A similar but different product that was discussed on HN is
       | OlmOCR from AI2, which is open source:
       | 
       | https://news.ycombinator.com/item?id=43174298
        
       | hubraumhugo wrote:
       | It will be interesting to see how all the companies in the
       | document processing space adapt as OCR becomes a commodity.
       | 
       | The best products will be defined by everything "non-AI", like
       | UX, performance and reliability at scale, and human-in-the loop
       | feedback for domain experts.
        
         | trollied wrote:
         | They will offer integrations into enterprise systems, just like
         | they do today.
         | 
         | Lots of big companies don't like change. The existing document
         | processing companies will just silently start using this sort
         | of service to up their game, and keep their existing
         | relationships.
        
         | hyuuu wrote:
         | I 100% agree with this, I think you can even extend this to any
         | AI, in the end, IMO, as the llm is more commoditized, the
         | surface of which the value is delivered will matter more
        
       | lokl wrote:
       | Tried with a few historical handwritten German documents,
       | accuracy was abysmal.
        
         | rvnx wrote:
         | Probably they are overfitting the benchmarks, since other users
         | also complain of the low accuracy
        
         | Thaxll wrote:
         | HTR ( Handwritten Text Recognition ) is a completely different
         | space than OCR. What were you expecting exactly?
        
           | riquito wrote:
           | It fits the "use cases" mentioned in the article
           | 
           | > Preserving historical and cultural heritage: Organizations
           | and nonprofits that are custodians of heritage have been
           | using Mistral OCR to digitize historical documents and
           | artifacts, ensuring their preservation and making them
           | accessible to a broader audience.
        
             | Thaxll wrote:
             | There is a difference between historical document and "my
             | doctor prescription".
             | 
             | Someone coming here and saying it does not work with my old
             | german hanwriting doesn't say much.
        
               | riquito wrote:
               | You're making a strawman, the parent specifically
               | mentioned "historical handwritten documents"
        
         | anothermathbozo wrote:
         | Optical Character Recognition (OCR) and Handwritten Text
         | Recognition (HTR) are different tasks
        
         | lysace wrote:
         | Semi-OT (similar language): The national archives in Sweden and
         | Finland published a model for OCR:ing handwritten Swedish text
         | from the 1600s to the 1800s with what to me seems like a _very_
         | level of accuracy given the source material. (4% character
         | error rate)
         | 
         | https://readcoop.eu/model/the-swedish-lion-i/
         | 
         | https://www.transkribus.org/success-story/creating-the-swedi...
         | 
         | https://huggingface.co/Riksarkivet
         | 
         | They have also published a fairly large volume of OCR:ed texts
         | (IIRC birth/death notices from church records) using this model
         | online. As a beginner genealogist it's been fun to follow.
        
         | thadt wrote:
         | Also working with historical handwritten German documents. So
         | far Gemini seems to be the least wrong of the ones I've tried -
         | any recommendations?
        
       | evmar wrote:
       | I noticed on the Arabic example they lost a space after the first
       | letter on the third to last line, can any native speakers
       | confirm? (I only know enough Arabic to ask dumb questions like
       | this, curious to learn more.)
       | 
       | Edit: it looks like they also added a vowel mark not present in
       | the input on the line immediately after.
       | 
       | Edit2: here's a picture of what I'm talking about, the
       | before/after: https://ibb.co/v6xcPMHv
        
         | resiros wrote:
         | Arabic speaker here. No, it's perfect.
        
           | evmar wrote:
           | I am pretty sure it added a kasrah not present in the input
           | on the 2nd to last line. (Not saying it's not super
           | impressive, and also that almost certainly is the right word,
           | but I think that still means not quite "perfect"?)
        
             | gl-prod wrote:
             | Yes, it looks like it did add a kasrah to the word Zhry
        
               | yoda97 wrote:
               | Yep, and fmin too, this is not just OCR, it made some
               | post-processing corrections or "enhancements". That could
               | be good, but it could also be trouble the 1% chance it
               | makes a mistake in critical documents.
        
           | gl-prod wrote:
           | He means the space between the waw (w) and the word
        
             | evmar wrote:
             | I added a pic to the original comment, sorry for not being
             | clear!
        
         | albatrosstrophy wrote:
         | And here I thought after reading the headline: finally a
         | reliable Arabic OCR. I've never in my life found a good that
         | does the job decently especially for a scanned document. Or is
         | there something out there I don't know about?
        
       | th0ma5 wrote:
       | A great question for people wanting to use OCR in business is...
       | Which digits in monetary amounts can you tolerate being
       | incorrect?
        
       | kbyatnal wrote:
       | We're approaching the point where OCR becomes "solved" -- very
       | exciting! Any legacy vendors providing pure OCR are going to get
       | steamrolled by these VLMs.
       | 
       | However IMO, there's still a large gap for businesses in going
       | from raw OCR outputs --> document processing deployed in prod for
       | mission-critical use cases. LLMs and VLMs aren't magic, and
       | anyone who goes in expecting 100% automation is in for a
       | surprise.
       | 
       | You still need to build and label datasets, orchestrate pipelines
       | (classify -> split -> extract), detect uncertainty and correct
       | with human-in-the-loop, fine-tune, and a lot more. You can
       | certainly get close to full automation over time, but it's going
       | to take time and effort. But the future is on the horizon!
       | 
       | Disclaimer: I started a LLM doc processing company to help
       | companies solve problems in this space (https://extend.app/)
        
         | risyachka wrote:
         | >> Any legacy vendors providing pure OCR are going to get
         | steamrolled by these VLMs.
         | 
         | -OR- they can just use these APIs, and considering that they
         | have a client base - which would prefer to not rewrite
         | integrations to get the same result - they can get rid of most
         | code base, replace it with llm api and increase margins by 90%
         | and enjoy good life.
        
           | esafak wrote:
           | They're going to become commoditized unless they add value
           | elsewhere. Good news for customers.
        
             | TeMPOraL wrote:
             | They are (or at least could easily be) adding value in form
             | of SLA - charging money for giving guarantees on accuracy.
             | This is both better for customer, who gets concrete
             | guarantees and someone to shift liability to, and for the
             | vendor, that can focus on creating techniques and systems
             | for getting that extra % of reliability out of the LLM OCR
             | process.
             | 
             | All of the above are things companies - particularly larger
             | ones - are happy to pay for, because ORC is just a cog in
             | the machine, and this makes it more reliable and
             | predictable.
             | 
             | On top of the above, there are auxiliary value-adds such a
             | vendor could provide - such as, being fully compliant with
             | every EU directive and regulation that's in power, or about
             | to be. There's plenty of those, they overlap, and no one
             | wants to deal with it if they can outsource it to someone
             | who already figured it out (and will take the blame for
             | fuckups).
        
         | dml2135 wrote:
         | One problem I've encountered at my small startup in evaluating
         | OCR technologies is precisely convincing stakeholders that the
         | "human-in-the-loop" part is both unavoidable, and ultimately
         | beneficial.
         | 
         | PMs want to hear that an OCR solution will be fully automated
         | out-of-the-box. My gut says that anything offering that is
         | snake-oil, and I try to convey that the OCR solution they want
         | _is_ possible, but if you are unwilling to pay the tuning cost,
         | it's going to flop out of the gate. At that point they lose
         | interest and move on to other priorities.
        
           | kbyatnal wrote:
           | Yup definitely, and this is exactly why I built my startup.
           | I've heard this a bunch across startups & large enterprises
           | that we work with. 100% automation is an impossible target,
           | because even humans are not 100% perfect. So how we can
           | expect LLMs to be?
           | 
           | But that doesn't mean you have to abandon the effort. You can
           | still definitely achieve production-grade accuracy! It just
           | requires having the right tooling in place, which reduces the
           | upfront tuning cost. We typically see folks get there on the
           | order of days or 1-2 weeks (it doesn't necessarily need to
           | take months).
        
         | techwizrd wrote:
         | The challenge I have is how to get bounding boxes for the OCR,
         | for things like redaction/de-identification.
        
           | kbyatnal wrote:
           | yeah that's a fun challenge -- what we've seen work well is a
           | system that forces the LLM to generate citations for all
           | extracted data, map that back to the original OCR content,
           | and then generate bounding boxes that way. Tons of edge cases
           | for sure that we've built a suite of heuristics for over
           | time, but overall works really well.
        
             | dontlikeyoueith wrote:
             | Why would you do this and not use Textract?
        
           | dontlikeyoueith wrote:
           | AWS Textract works pretty well for this and is much cheaper
           | than running LLMs.
        
             | daemonologist wrote:
             | Textract is more expensive than this (for your first 1M
             | pages per month at least) and significantly more than
             | something like Gemini Flash. I agree it works pretty well
             | though - definitely better than any of the open source pure
             | OCR solutions I've tried.
        
         | einpoklum wrote:
         | An LLM with billions of parameters for extracting text from a
         | PDF (which isn't even a rasterized image) really does not
         | "solve OCR".
        
       | mvac wrote:
       | Great progress, but unfortunately, for our use case (converting
       | medical textbooks from PDF to MD), the results are not as good as
       | those by MinerU/PDF-Extract-Kit [1].
       | 
       | Also the collab link in the article is broken, found a functional
       | one [2] in the docs.
       | 
       | [1] https://github.com/opendatalab/MinerU [2]
       | https://colab.research.google.com/github/mistralai/cookbook/...
        
         | owenpalmer wrote:
         | I've been searching relentlessly for something like this! I
         | wonder why it's been so hard to find... is it the Chinese?
         | 
         | In any case, thanks for sharing.
        
       | 101008 wrote:
       | Is this free in LeChat? I uploaded a handwritten text and it
       | stopped after the 4th word.
        
       | bsnnkv wrote:
       | Someone working there has good taste to include a Nizar Qabbani
       | poem.
        
       | rvz wrote:
       | > "Fastest in its category"
       | 
       | Not one mention of the company that they have partnered with and
       | that is Cerebras AI and that is the reason they have fast
       | inference [0]
       | 
       | Literally no-one here is talking about them and they are about to
       | IPO.
       | 
       | [0] https://cerebras.ai/blog/mistral-le-chat
        
       | pilooch wrote:
       | But what's the need exactly for OCR when you have multimodal LLMs
       | that can read the same info and directly answer any questions
       | about it ?
       | 
       | For a VLLM, my understanding is that OCR corresponds to a sub-
       | field of questions, of the type 'read exactly what's written in
       | this document'.
        
         | daemonologist wrote:
         | It's useful to have the plain text down the line for operations
         | not involving a language model (e.g. search). Also if you have
         | a bunch of prompts you want to run it's potentially cheaper,
         | although perhaps less accurate, to run the OCR once and save
         | yourself some tokens or even use a smaller model for subsequent
         | prompts.
        
         | ks2048 wrote:
         | Tons of uses: Storage (text instead of images), search (user
         | typing in a text box and you want instant retrieval from a
         | dataset), etc. And costs: run on images once - then the rest of
         | your queries will only need to run on text.
        
         | simonw wrote:
         | The biggest risk of vision LLMs for OCR is that they might
         | accidentally follow instructions is the text that they are
         | meant to be processing.
         | 
         | (I asked Mistral if their OCR system was vulnerable to this and
         | they said "should be robust, but curious to see if you find any
         | fun examples" -
         | https://twitter.com/simonw/status/1897713755741368434 and
         | https://twitter.com/sophiamyang/status/1897719199595720722 )
        
           | pilooch wrote:
           | Fun, but LLMs would follow them post OCR anyways ;)
           | 
           | I see OCR much like phonemes in speech, once you have end to
           | end systems, they become latent constructs from the past.
           | 
           | And that is actually good, more code going into models
           | instead.
        
         | troyvit wrote:
         | Getting PDFs into #$@ Confluence apparently. Just had to do
         | this and Mistral saved me a ton of hassle compared to this:
         | https://community.atlassian.com/forums/Confluence-questions/...
        
       | gatienboquet wrote:
       | I feel like i can't create an agent with their OCR model yet ? Is
       | it something planned or it's only API?
        
         | simonw wrote:
         | What do you mean by agent?
        
           | gatienboquet wrote:
           | La Plateforme agent builder -
           | https://console.mistral.ai/build/agents/new
        
       | kiratp wrote:
       | It's shocking how much our industry fails to see past its own
       | nose.
       | 
       | Not a single example on that page is a Purchase Order, Invoice
       | etc. Not a single example shown is relevant to industry at scale.
        
         | kashnote wrote:
         | Fwiw, they have an example of a parking receipt in a cookbook:
         | https://colab.research.google.com/github/mistralai/cookbook/...
        
         | guiomie wrote:
         | Agreed. In general I've had such bad performance for complex
         | table based invoice parsing, that every few months I try the
         | latest models to see if its better. It does say "96.12" on top-
         | tier benchmark under the Table category.
        
         | mtillman wrote:
         | We find CV models to be better (higher midpoint on an ROC
         | curve) for the types of docs you mention.
        
         | simpaticoder wrote:
         | Another good example would be contracts of any kind. Imagine
         | photographing a contract (like a car loan) and on the spot
         | getting an AI to read it, understand it, forecast scenarious,
         | highlight red flags, and do some comparison shopping for you.
        
           | JBiserkov wrote:
           | ... imagining ...
           | 
           | ... hallucinating during read ...
           | 
           | ... hallucinating during understand ...
           | 
           | ... hallucinating during forecast ...
           | 
           | ... highlighting a hallucination as red flag ...
           | 
           | ... missing an actual red flag ...
           | 
           | ... consuming water to cool myself...
           | 
           | Phew, being an AI is hard!
        
         | merb wrote:
         | Mistral is Europe based where invoices are more or less sent
         | digitally in like 95% of all the cases anyway. Some are even
         | digital invoices, which will at some point in the eu be
         | mandatory. For orders there are proposals for that, too. And
         | basically invoice data extraction is a different beast.
        
           | wolfi1 wrote:
           | even in Europe this is still a thing, I know of systems which
           | still are unable to read items having more than one line
           | (costing s sh*tload of money)
        
           | napolux wrote:
           | Can confirm, in Italy electronic invoicing is mandatory since
           | 2019
        
           | revnode wrote:
           | So an invoice attached to an email as a PDF is sent digitally
           | ... those unfamiliar with PDF will think text and data
           | extraction is trivial then, but this isn't true. You can have
           | a fully digital, non-image PDF that is vector based and has
           | what looks like text, but doesn't have a single piece of
           | extractable text in it. It's all about how the PDF was
           | generated. Tables can be formatted in a million ways, etc.
           | 
           | Your best bet is to always convert it to an image and OCR it
           | to extract structured data.
        
           | codetrotter wrote:
           | One use-case is digitising receipts from business related
           | travels for expenses that employees paid for out of their own
           | pocket and which they are submitting pictures to the business
           | for reimbursement.
           | 
           | Bus travels, meals including dinners and snacks, etc. for
           | which the employee has receipts on paper.
        
           | kiratp wrote:
           | This isn't even close to true.
           | 
           | Source: We have large EU customers.
        
         | arpinum wrote:
         | Businesses at scale use EDI to handle purchase orders and
         | invoices, no OCR needed.
        
           | cdolan wrote:
           | Thats simply not a factual statement.
           | 
           | Scaled businesses do USE edi, but they still receive hundreds
           | of thousands of PDF documents a month
           | 
           | source: built a saas product that handles pdfs for a specific
           | industry
        
         | mentalgear wrote:
         | To be fair: Reading the blog post, the main objective seems to
         | have been to enable information extraction with high confidence
         | for the academic sector (e.g. unlocking all these paper pdfs),
         | and not necessarily to be another receipt scanner.
        
           | kiratp wrote:
           | It hilarious that the academic sector 1. publishes as PDF 2.
           | spends all this energy on how to extract that info back from
           | PDF 3. publishes that research as PDF as well.
           | 
           | Receipt scanning is a multiple orders of magnitude more
           | valuable business. Mistral at this point is looking for a
           | commercial niche (like how Claude is aiming at software
           | development)
        
         | sha16 wrote:
         | I wanted to apply OCR to my company's invoicing since they
         | basically did purchasing for a bunch of other large companies,
         | but the variability in the conversion was not tolerable. Even
         | rounding something differently could catch an accountant's eye,
         | let alone detecting a "8" as a "0" or worse.
        
         | dotnetkow wrote:
         | Agreed, though in this case, they are going for general-purpose
         | OCR. That's fine in some cases, but purpose-built models
         | trained on receipts, invoices, tax documents, etc., definitely
         | perform better. We've got a similar API solution coming out
         | soon (https://digital.abbyy.com/code-extract-automate-your-new-
         | mus...) that should work better for businesses automating their
         | docs at scale.
        
       | qwertox wrote:
       | We developers seem to really dislike PDFs, to a degree that we'll
       | build LLMs and have them translate it into Markdown.
       | 
       | Jokes aside, PDFs really serve a good purpose, but getting data
       | out of them is usually really hard. They should have something
       | like an embedded Markdown version with a JSON structure
       | describing the layout, so that machines can easily digest the
       | data they contain.
        
         | jgalt212 wrote:
         | I think you might be looking for PDF/A.
         | 
         | https://www.adobe.com/uk/acrobat/resources/document-files/pd...
         | 
         | For example, if you print a word doc to PDF, you get the raw
         | text in PDF form, not an image of the text.
        
           | gpvos wrote:
           | PDF/A doesn't require preserving the document structure, only
           | that any text is extractable.
        
       | d_llon wrote:
       | It's disappointing to see that the benchmark results are so
       | opaque. I hope we see reproducible results soon, and hopefully
       | from Mistral themselves.
       | 
       | 1. We don't know what the evaluation setup is. It's very possible
       | that the ranking would be different with a bit of prompt
       | engineering.
       | 
       | 2. We don't know how large each dataset is (or even how the
       | metrics are calculated/aggregated). The metrics are all reported
       | as XY.ZW%, but it's very possible that the .ZW% -- or even Y.ZW%
       | -- is just noise.[1]
       | 
       | 3. We don't know how the datasets were mined or filtered. Mistral
       | could have (even accidentally!) filtered out particularly data
       | points that their model struggled with. (E.g., imagine good-
       | meaning engineer testing a document with Mistral OCR first,
       | finding it doesn't work, and deducing that it's probably bad data
       | and removing it.)
       | 
       | [1] https://medium.com/towards-data-science/digit-
       | significance-i...
        
       | s4i wrote:
       | I wonder how good it would be to convert sheet music to MusicXML.
       | All the current tools more or less suck with this task, or maybe
       | I'm just ignorant and don't know what lego bricks to put
       | together.
        
         | adrianh wrote:
         | Try our machine-learning powered sheet music scanning engine at
         | Soundslice:
         | 
         | https://www.soundslice.com/sheet-music-scanner/
         | 
         | Definitely doesn't suck.
        
       | protonbob wrote:
       | Wow this basically "solves" DRM for books as well as opening up
       | the door for digitizing old texts more accurately.
        
       | shmoogy wrote:
       | What's the general time for something like this to hit
       | openrouter? I really hate having accounts everywhere when I'm
       | trying to test new things.
        
       | bondolo wrote:
       | Such a shame that PDF doesn't just, like, include the semantic
       | structure of the document by default. It is brilliant that we
       | standardized on an archival document format that doesn't include
       | direct access to the document text or structure as a core
       | intrinsic default feature.
       | 
       | I say this with great anger as someone who works in accessibility
       | and has had PDF as a thorn in my side for 30 years.
        
         | andai wrote:
         | Tables? I regularly run into PDFs where even the body text is
         | mangled!
        
         | NeutralForest wrote:
         | I agree with this so much. I've tried to sometimes push friends
         | and family to use text formats (at least I sent them something
         | like Markdown), which is very easy to render in the browser
         | anyways. But often you have to fall back to PDF, which I
         | dislike very much. There's so much content like books and
         | papers that are in PDF as well. Why did we pick a binary blob
         | as shareable format again?
        
           | meatmanek wrote:
           | > Why did we pick a binary blob as shareable format again?
           | 
           | PDF was created to solve the problem of being able to render
           | a document the same way on different computers, and it mostly
           | achieved that goal. Editable formats like .doc, .html, .rtf
           | were unreliable -- different software would produce different
           | results, and even if two computers have the exact same
           | version of Microsoft Word, they might render differently
           | because they have different fonts available. PDFs embed the
           | fonts needed for the document, and specify exactly where each
           | character goes, so they're fully self-contained.
           | 
           | After Acrobat Reader became free with version 2 in 1994,
           | everybody with a computer ended up downloading it after
           | running across a PDF they needed to view. As it became more
           | common for people to be able to view PDFs, it became more
           | convenient to produce PDFs when you needed everybody to be
           | able to view your document consistently. Eventually, the
           | ability to produce PDFs became free (with e.g. Office 2007 or
           | Mac OS X's ability to print to PDF), which cemented PDF's
           | popularity.
           | 
           | Notably, the original goals of PDF had nothing to do with
           | being able to copy text out of them -- the goal was simply to
           | produce a perfect reproduction of the document on
           | screen/paper. That wasn't enough of an inconvenience to
           | prevent PDF from becoming popular. (Some people saw the
           | inability for people to easily copy text from them as a
           | benefit -- basically a weak form of text DRM.)
        
         | cess11 wrote:
         | PDF is pretty strictly modeled on printed documents and their
         | mainstream typography at the time of invention of Postscript
         | and so on.
         | 
         | Printed documents do not have any structure beyond the paper
         | and placement of ink on them.
        
         | lukasb wrote:
         | Even assuming you could get people to do the work (probably the
         | real issue here) could a single schema syntax capture the
         | semantics of the universe of documents that exist as PDFs? PDFs
         | succeeded because they could reproduce anything.
        
           | euleriancon wrote:
           | html
        
       | OrvalWintermute wrote:
       | I'm happy to see this development after being underwhelmed with
       | Chatgpt OCR!
        
       | climb_stealth wrote:
       | Does this support Japanese? They list a table of language
       | comparisons againat other approaches but I can't tell if it is
       | exhaustive.
       | 
       | I'm hoping that something like this will be able to handle
       | 3000-page Japanese car workshop manuals. Because traditional OCR
       | really struggles with it. It has tables, graphics, text in
       | graphics, the whole shebang.
        
       | hyuuu wrote:
       | It's a weird timing because I just launched https://dochq.io - ai
       | document extraction where you can define what you need to get out
       | your documents in plain English, I legitimately thought that this
       | was going to be such a niche product but hell, there has been a
       | very rapid rise for AI-based OCR lately, an article/tweet even
       | went viral 2 weeks ago I think? About using Gemini to do OCR, fun
       | times.
        
       | sixhobbits wrote:
       | Nice demos but I wonder how well it does on longer files. I've
       | been experimenting with passing some fairly neat PDFs to various
       | LLMs for data extraction. They're created from Excel exports and
       | some of the data is cut off or badly laid out, but it's all
       | digitally extractable.
       | 
       | The challenge isn't so much the OCR part, but just the length.
       | After one page the LLMs get "lazy" and just skip bits or stop
       | entirely.
       | 
       | And page by page isn't trivial as header rows are repeated or
       | missing etc.
       | 
       | So far my experience has definitely been that the last 2% of the
       | content still takes the most time to accurately extract for large
       | messy documents, and LLMs still don't seem to have a one-shot
       | solve for that. Maybe this is it?
        
         | hack_ml wrote:
         | You will have to send one page at a time, most of this work has
         | to be done via RAG. Adding a large context (like a whole PDF),
         | still does not work that well in my experience.
        
       | lysace wrote:
       | Nit: Please change the URL from
       | 
       | https://mistral.ai/fr/news/mistral-ocr
       | 
       | to
       | 
       | https://mistral.ai/news/mistral-ocr
       | 
       | The article is the same, but the site navigation is in English
       | instead of French.
       | 
       | Unless it's a silent statement, of course. =)
        
         | lblume wrote:
         | For me, the second page redirects to the first. (And I don't
         | live in France.)
        
       | anovick wrote:
       | How does one use it to identify bounding rectangles of
       | images/diagrams in the PDF?
        
       | Oras wrote:
       | I feel this is created for RAG. I tried a document [0] that I
       | tested with OCR; it got all the table values correctly, but the
       | page's footer was missing.
       | 
       | Headers and footers are a real pain with RAG applications, as
       | they are not required, and most OCR or PDF parsers will return
       | them, and there is extract work to do to remove them.
       | 
       | [0]
       | https://github.com/orasik/parsevision/blob/main/example/Mult...
        
       | maCDzP wrote:
       | Oh - on premise solution - awesome!
        
       | cavisne wrote:
       | Its funny how Gemini consistently beats googles dedicated
       | document API.
        
         | jjice wrote:
         | I'm not surprised honestly - it's just the newer better things
         | vs their older offering
        
       | atemerev wrote:
       | So, the only thing that stopped AI from learning from all our
       | science and taking over the world was the difficulty of
       | converting PDFs of academic papers to more computer readable
       | formats.
       | 
       | Not anymore.
        
       | noloz wrote:
       | Are there any open source projects with the same goal?
        
       | submeta wrote:
       | Is this able to convert pdf flowcharts into yaml or json
       | representations of them? I have been experimenting with Claude
       | 3.5. It has been very good at readig / understanding/ converting
       | into representations of flow charts.
       | 
       | So I am wondering if this is more capable. Will try definitely,
       | but maybe someone can chime in.
        
       | bambax wrote:
       | It's not bad! But it still hallucinates. Here's an example of an
       | (admittedly difficult) image:
       | 
       | https://i.imgur.com/jcwW5AG.jpeg
       | 
       | For the blocks in the center, it outputs:
       | 
       | > _Claude, duc de Saint-Simon, pair et chevalier des ordres,
       | gouverneur de Blaye, Senlis, etc., ne le 16 aout 1607 , 3 mai
       | 1693 ; ep. 1*, le 26 septembre 1644, Diane - Henriette de Budos
       | de Portes, morte le 2 decembre 1670; 2*, le 17 octobre 1672,
       | Charlotte de l 'Aubespine, morte le 6 octobre 1725._
       | 
       | This is perfect! But then the next one:
       | 
       | > _Louis, commandeur de Malte, Louis de Fay Laurent bre 1644,
       | Diane - Henriette de Budos de Portes, de Cressonsac. du
       | Chastelet, mortilhomme aux gardes, 2 juin 1679._
       | 
       | This is really bad because
       | 
       | 1/ a portion of the text of the previous bloc is repeated
       | 
       | 2/ a portion of the next bloc is imported here where it shouldn't
       | be ("Cressonsac"), and of the right most bloc ("Chastelet")
       | 
       | 3/ but worst of all, a whole word is invented, "mortilhomme" that
       | appears nowhere in the original. (The word doesn't exist in
       | French so in that case it would be easier to spot; but the risk
       | is when words are invented, that do exist and "feel right" in the
       | context.)
       | 
       | (Correct text for the second bloc should be:
       | 
       | > _Louis, commandeur de Malte, capitaine aux gardes, 2 juin
       | 1679._ )
        
         | bambax wrote:
         | Another test with a text in English, which is maybe more fair
         | (although Mistral is a French company ;-). This image is from
         | Parliamentary debates of the parliament of New Zealand in
         | 1854-55:
         | 
         | https://i.imgur.com/1uVAWx9.png
         | 
         | Here's the output of the first paragraph, with mistakes in
         | brackets:
         | 
         | > _drafts would be laid on the table, and a long discussion
         | would ensue; whereas a Committee would be able to frame a
         | document which, with perhaps a few verbal emundations_
         | [emendations] _, would be adopted; the time of the House would
         | thus be saved, and its business expected_ [expedited] _. With
         | regard to the question of the comparative advantages of The-
         | day_ [Tuesday]* and Friday, he should vote for the amendment,
         | on the principle that the wishes of members from a distance
         | should be considered on all sensations _[occasions]_ where a
         | principle would not be compromised or the convenience of the
         | House interfered with. He hoped the honourable member for the
         | Town of Christchurch would adopt the suggestion he (Mr.
         | Forssith _[Forsaith]_ ) had thrown out and said _[add]_ to his
         | motion the names of a Committee.*
         | 
         | Some mistakes are minor (emnundations/emendations or
         | Forssith/Forsaith), but others are very bad, because they are
         | unpredictable and don't correspond to any pattern, and
         | therefore can be very hard to spot: _sensations_ instead of
         | occasions, or _expected_ in lieu of expedited... That last one
         | really changes the meaning of the sentence.
        
         | spudlyo wrote:
         | I want to rejoice that OCR is now a "solved" problem, but I
         | feel like hallucinations are just as problematic as the kind of
         | stuff I have to put up with tesseract -- both require careful
         | manual proofreading for an acceptable degree of confidence. I
         | guess I'll have to try it and see for myself just how much
         | better these solutions are for my public domain archive.org
         | Latin language reader & textbook projects.
        
         | layer8 wrote:
         | > This is perfect!
         | 
         | Just a nit, but I wouldn't call it perfect when using U+25CB *
         | WHITE CIRCLE instead of what should be U+00BA o MASCULINE
         | ORDINAL INDICATOR, or alternatively a superscript "o". These
         | are https://fr.wikipedia.org/wiki/Adverbe_ordinal.
         | 
         | There's also extra spaces after both "1607" and "1693", and
         | around the hyphen in "Diane-Henriette".
         | 
         | Lastly, U+2019 instead of U+0027 would be more appropriate for
         | the apostrophe, all the more since in the image it looks like
         | the former and not like the latter.
        
           | TeMPOraL wrote:
           | This is "reasoning model" stuff even for humans :).
        
             | layer8 wrote:
             | There is OCR software that analyses which language is used,
             | and than applies heuristics for the recognized language to
             | steer the character recognition in terms of character
             | sequence likelihoods and punctuation rules.
             | 
             | I don't think you need a reasoning model for that, just
             | better training; although conversely a reasoning model
             | should hopefully notice the errors.
        
       | neom wrote:
       | I gave it a bunch of my wifes 18th century English scans to
       | transcribe, mostly couldn't do it, and it's been doing this for
       | 15 minutes now, not sure why but i find quite amusing:
       | https://share.zight.com/L1u2jZYl
        
       | Zopieux wrote:
       | Saving you a click: no, it cannot be self hosted (unless you have
       | a few million dollars laying around)
        
       | dotnetkow wrote:
       | Congrats to the Mistral team for launching! A general-purpose OCR
       | model is useful, of course. However, more purpose-built solutions
       | are a must to convert business documents reliably. AI models pre-
       | trained on specific document types perform better and are more
       | accurate. Coming soon from the ABBYY team, we're shipping a new
       | OCR API designed to be consistent, reliable, and hallucination-
       | free. Check it out if you're looking for best-in-class DX:
       | https://digital.abbyy.com/code-extract-automate-your-new-mus...
        
       | riffic wrote:
       | It'd be great if this could be tested against genealogical
       | documents written in cursive like oh most of the documents on
       | microfilm stored by the LDS on familysearch, or eastern european
       | archival projects etc.
        
       | peterburkimsher wrote:
       | Does it work for video subtitles? And in Chinese? I'm looking to
       | transcribe subtitles of live music recordings from ANHOP and
       | KHOP.
        
       | nyeah wrote:
       | It's not fair to call it a "Mistrial" just because it
       | hallucinates a little bit.
        
       ___________________________________________________________________
       (page generated 2025-03-06 23:00 UTC)