[HN Gopher] Nanonets-OCR-s - OCR model that transforms documents...
       ___________________________________________________________________
        
       Nanonets-OCR-s - OCR model that transforms documents into
       structured markdown
        
       Author : PixelPanda
       Score  : 265 points
       Date   : 2025-06-16 06:14 UTC (16 hours ago)
        
 (HTM) web link (huggingface.co)
 (TXT) w3m dump (huggingface.co)
        
       | PixelPanda wrote:
       | Full disclaimer: I work at Nanonets
       | 
       | Excited to share Nanonets-OCR-s, a powerful and lightweight (3B)
       | VLM model that converts documents into clean, structured
       | Markdown. This model is trained to understand document structure
       | and content context (like tables, equations, images, plots,
       | watermarks, checkboxes, etc.). Key Features:
       | 
       | LaTeX Equation Recognition Converts inline and block-level math
       | into properly formatted LaTeX, distinguishing between $...$ and
       | $$...$$.
       | 
       | Image Descriptions for LLMs Describes embedded images using
       | structured <img> tags. Handles logos, charts, plots, and so on.
       | 
       | Signature Detection & Isolation Finds and tags signatures in
       | scanned documents, outputting them in <signature> blocks.
       | 
       | Watermark Extraction Extracts watermark text and stores it within
       | <watermark> tag for traceability.
       | 
       | Smart Checkbox & Radio Button Handling Converts checkboxes to
       | Unicode symbols like , , and for reliable parsing in downstream
       | apps.
       | 
       | Complex Table Extraction Handles multi-row/column tables,
       | preserving structure and outputting both Markdown and HTML
       | formats.
       | 
       | Huggingface / GitHub / Try it out:
       | https://huggingface.co/nanonets/Nanonets-OCR-s
       | 
       | Try it with Docext in Colab:
       | https://github.com/NanoNets/docext/blob/main/PDF2MD_README.m...
        
         | mvac wrote:
         | Correct link for Docext:
         | https://github.com/NanoNets/docext/blob/main/PDF2MD_README.m...
        
         | generalizations wrote:
         | Does it have a way to extract the images themselves, or is that
         | still a separate process later?
        
           | j45 wrote:
           | If you are after extracting images from pdfs there's plenty
           | of tools that do that just fine without LLMs.
        
             | generalizations wrote:
             | I mean, ideally it would be in context, so the generated
             | markdown references the correct image at the correct
             | location in the doc. Unless that's what you're talking
             | about? In which case I don't know about those tools.
        
         | RicoElectrico wrote:
         | Could be it used to (maybe with help of a downstream LLM) parse
         | a photo/PDF of a restaurant menu into a JSON file conforming to
         | a schema? Or would bigger, hosted multimodal LLMs work better
         | in such case?
        
         | gibsonf1 wrote:
         | Does it hallucinate with the LLM being used?
        
           | nattaylor wrote:
           | The base model is Qwen2.5-VL-3B and the announcement says a
           | limitation is "Model can suffer from hallucination"
        
             | gibsonf1 wrote:
             | Seems a bit scary that the "source" text from the pdfs
             | could actually be hallucinated.
        
           | michaelt wrote:
           | Sometimes. I just fed the huggingface demo an image
           | containing some rather improbable details [1] and it OCRed
           | "Page 1000000000000" with one extra trailing zero.
           | 
           | Honestly I was expecting the opposite - a repetition penalty
           | to kick in having repeated zero too many times, resulting in
           | too _few_ zeros - but apparently not. So you might want to
           | steer clear of this model if your document has a trillion
           | pages.
           | 
           | Other than that, it did a solid job - I've certainly seen
           | worse attempts to OCR a table.
           | 
           | [1] https://imgur.com/a/8rJeHf8
        
       | silversmith wrote:
       | I'm curious, how does it do with non-english texts? It's my
       | understanding that LLM-based OCR solutions fall way behind
       | traditional ones once you introduce other languages.
        
         | wickedsight wrote:
         | Understanding or experience?
         | 
         | Because my experience is not at all like that. If I use both
         | Google Translate and ChatGPT on an image, ChatGPT is pretty
         | much always better. It can even translate Japanese hand written
         | menus quite well. With the added benefit of it being able to
         | add context and explain what the dishes are.
        
           | silversmith wrote:
           | I'm passively interested in small, local LLM OCR, due to
           | couple ideas kicking around between my ears. Tried some a
           | while ago, but most of my recent knowledge is second-hand.
           | Waiting for someone to exclaim "hey this works now!" before
           | committing more time :)
           | 
           | With the big commercial offerings like chatgpt I'd fully
           | expect them to work fine, due to the absolutely massive
           | horsepower in use.
        
       | raus22 wrote:
       | With models like these, when multilingual is not mentioned it
       | will perform really bad on real life non-english pdfs.
        
         | souvik3333 wrote:
         | The model was primarily trained on English documents, which is
         | why English is listed as the main language. However, the
         | training data did include a smaller proportion of Chinese and
         | various European languages. Additionally, the base model
         | (Qwen-2.5-VL-3B) is multilingual. Someone on Reddit mentioned
         | it worked on Chinese:
         | https://www.reddit.com/r/LocalLLaMA/comments/1l9p54x/comment...
        
       | progval wrote:
       | It's not open-source (nor open-weight):
       | https://huggingface.co/nanonets/Nanonets-OCR-s/discussions/2
        
         | souvik3333 wrote:
         | Hi, author of the model here. It is an open-weight model, you
         | can download it from here:
         | https://huggingface.co/nanonets/Nanonets-OCR-s
        
           | gardnr wrote:
           | Interestingly, another OCR model based on Qwen2.5-VL-3B just
           | dropped which also publishes as Apache 2. It's right next to
           | Nanonets-OCR-s on the HF "Trending" list.
           | 
           | https://huggingface.co/echo840/MonkeyOCR/blob/main/Recogniti.
           | ..
        
           | CaptainFever wrote:
           | IMO weights being downloadable doesn't mean it's open weight.
           | 
           | My understanding:                   - Weight available: You
           | can download the weights.         - Open weight: You can
           | download the weights, and it is licensed freely (e.g. public
           | domain, CC BY-SA, MIT).         - Open source: (Debated) You
           | can download the weights, it is licensed freely, and the
           | training dataset is also available and licensed freely.
           | 
           | For context:
           | 
           | > You're right. The Apache-2.0 license was mistakenly listed,
           | and I apologize for the confusion. Since it's a derivative of
           | Qwen-2.5-VL-3B, it will have the same license as the base
           | model (Qwen RESEARCH LICENSE AGREEMENT). Thanks for pointing
           | this out.
        
       | tensor wrote:
       | There are no benchmarks or accuracy measures on a hold out set?
        
         | souvik3333 wrote:
         | Hi, author of the model here..
         | 
         | We have a benchmark for evaluating VLM on document
         | understanding tasks: https://idp-leaderboard.org/ . But
         | unfortunately, it does not include image to markdown as a task.
         | The problem with evaluating an image to markdown is that even
         | if the order of two blocks are different, it can still be
         | correct. Eg: if you have both seller info and buyer info side
         | by side in the image one model can extract the seller info
         | first, and another model can extract the buyer info first. Both
         | model will be correct but depending on the ground truth if you
         | do fuzzy matching one model will have higher accuracy than the
         | other one.
         | 
         | Normally, a company will train and test on a dataset that is
         | trained on the same type of annotation (either left block first
         | or right block first), and all other models can get a low score
         | on their benchmark because they are trained on the opposite
         | order of annotations.
        
           | krapht wrote:
           | If this is the only issue, can't this be addressed by
           | normalizing the post-processed data before scoring? (that is,
           | if it really is just a matter of block ordering)
        
           | tensor wrote:
           | The more important thing to me with any VLM is base OCR
           | performance and hallucinations. It's not too hard to get
           | improved average accuracy on very low quality scans using
           | language models. Unfortunately these also typically produce
           | large numbers of hallucinations, which are a deal breaker if
           | you are trying to get out values for financial or legal
           | purposes.
           | 
           | OCR that has lower accuracy, but where the inaccurate parts
           | are left blank or flagged are far superior. Mistral OCR also
           | suffers from this problem.
           | 
           | If your OCR produced bounding boxes for every text line, and
           | ran a traditional OCR on the text, this could alleviate it.
           | Or at the very least bounding boxes let users cross-correlate
           | with output from traditional OCR.
           | 
           | Also a small note, it's probably best not to say your product
           | beats Mistral when it's not even tested against it. Having
           | more features doesn't make a product better if the accuracy
           | is not better on those features.
           | 
           | I don't mean to be discouraging, this is an important space
           | and it looks like you have a very feature rich model. I'd
           | like to see a good solution be developed!
        
       | Eisenstein wrote:
       | How does it do with handwriting?
        
         | souvik3333 wrote:
         | We have not trained explicitly on handwriting datasets
         | (completely handwritten documents). But, there are lots of
         | forms data with handwriting present in training. So, do try on
         | your files, there is a huggingface demo, you can quickly test
         | there: https://huggingface.co/spaces/Souvik3333/Nanonets-ocr-s
         | 
         | We are currently working on creating completely handwritten
         | document datasets for our next model release.
        
           | Eisenstein wrote:
           | Document:
           | 
           | * https://imgur.com/cAtM8Qn
           | 
           | Result:
           | 
           | * https://imgur.com/ElUlZys
           | 
           | Perhaps it needed more than 1K tokens? But it took about an
           | hour (number 28 in queue) to generate that and I didn't feel
           | like trying again.
           | 
           | How many tokens does it usually take to represent a page of
           | text with 554 characters?
        
             | souvik3333 wrote:
             | Hey, the reason for the long processing time is that lots
             | of people are using it, and with probably larger documents.
             | I tested your file locally seems to be working correctly.
             | https://ibb.co/C36RRjYs
             | 
             | Regarding the token limit, it depends on the text. We are
             | using the qwen-2.5-vl tokenizer in case you are interested
             | in reading about it.
             | 
             | You can run it very easily in a Colab notebook. This should
             | be faster than the demo https://github.com/NanoNets/docext/
             | blob/main/PDF2MD_README.m...
             | 
             | There are incorrect words in the extraction, so I would
             | suggest you to wait for the handwritten text model's
             | release.
        
               | mdaniel wrote:
               | > I tested your file locally seems to be working
               | correctly
               | 
               | Apologies if there's some unspoken nuance in this
               | exchange, but by "working correctly" did you just mean
               | that it ran to completion? I don't even recognize some of
               | the unicode characters that it emitted (or maybe you're
               | using some kind of strange font, I guess?)
               | 
               | Don't misunderstand me, a ginormous number of floating
               | point numbers attempting to read that handwriting is
               | already doing better than _I_ can, but I was just trying
               | to understand if you thought that outcome is what was
               | expected
        
               | Eisenstein wrote:
               | It actually did a decent. Perhaps the font is weird? For
               | reference here is the 'ground truth' content, not in
               | markdown:
               | 
               | Page# 8
               | 
               | Log: MA 6100 2.03.15
               | 
               | 34 cement emitter resistors - 0.33R 5W 5% measure 0.29R
               | 0.26R
               | 
               | 35 replaced R436, R430 emitter resistors on R-chn P.O.
               | brd w/new WW 5W .33R 5% w/ ceramic lead insulators
               | 
               | 36 applied de-oxit d100 to speaker outs, card terminals,
               | terminal blocks, output trans jacks
               | 
               | 37 replace R-chn drivers and class A BJTs w/ BD139/146, &
               | TIP31AG
               | 
               | 38 placed boards back in
               | 
               | 39 desoldered grnd lug from volume control
               | 
               | 40 contact cleaner, Deoxit D5, faderlube on pots &
               | switches teflon lube on rotor joint
               | 
               | 41 cleaned ground lug & resoldered, reattached panel
        
               | souvik3333 wrote:
               | This is the result. ``` Page 1 of 1 Page #
               | &lt;page_number&gt;8&lt;/page_number&gt;
               | 
               | Log: MA 6100 Z. O 3. 15
               | 
               | <table> <tr> <td>34</td> <td>cement emitter resistors
               | -</td> </tr> <tr> <td></td> <td>0.33 R SW 5% measure</td>
               | </tr> <tr> <td></td> <td>0.29 R, 0.26 R</td> </tr> <tr>
               | <td>35</td> <td>replaced R'4 36, R4 30</td> </tr> <tr>
               | <td></td> <td>emitter resistor on R-44</td> </tr> <tr>
               | <td></td> <td>0.0. 3rd w/ new WW 5W .33R</td> </tr> <tr>
               | <td>36</td> <td>% w/ ceramic lead insulators</td> </tr>
               | <tr> <td></td> <td>applied de-oat d100 to Speak</td>
               | </tr> <tr> <td></td> <td>outs, card terminals,
               | terminal</td> </tr> <tr> <td></td> <td>blocks, output
               | tran jacks</td> </tr> <tr> <td>37</td> <td>replace &-clun
               | diviers</td> </tr> <tr> <td></td> <td>and class A BJTs w/
               | BD139/140</td> </tr> <tr> <td></td> <td>& TIP37A2</td>
               | </tr> <tr> <td>38</td> <td>placed boards back in</td>
               | </tr> <tr> <td>39</td> <td>desoldered ground lus from
               | volume</td> </tr> <tr> <td></td> <td>(con 48)</td> </tr>
               | <tr> <td>40</td> <td>contact cleaner, Deox. t DS,
               | facel/42</td> </tr> <tr> <td></td> <td>on pots &
               | switches</td> </tr> <tr> <td></td> <td>* teflon lube on
               | rotor joint</td> </tr> <tr> <td>41</td> <td>reably
               | cleaned ground lus &</td> </tr> <tr> <td></td>
               | <td>resoldered, reattatched panel</td> </tr> </table> ```
               | 
               | You can paste it in https://markdownlivepreview.com/ and
               | see the extraction. This is using the Colab notebook I
               | have shared before.
               | 
               | Which Unicode characters are you mentioning here?
        
       | mvac wrote:
       | How does it compare to Datalab/Marker https://github.com/datalab-
       | to/marker ? We evaluated many PDF->MD converters and this one
       | performed the best, though it is not perfect.
        
         | wittjeff wrote:
         | I am just getting started with my own cross-comparison, would
         | appreciate your list of considered candidates if you have it
         | handy.
        
         | nxobject wrote:
         | As anecdotal evidence, it serves my complex-enough purposes
         | very well - mathematics and code interspersed together. One of
         | my "litmus test" papers is this old paper on a Fortran inverse-
         | Laplace transform algorithm [1] that intersperses inline and
         | display equations, and monospace code blocks, while requiring
         | OCR from scratch, and very few models currently do a
         | satisfactory job, i.e. in the following page transcribed by
         | Marker,
         | 
         | https://imgur.com/a/Q7UYIfW
         | 
         | the inline $\sigma_0$ is mangled as "<sup>s</sup> 0", and
         | $f(t)$ is mangled as " _f~_ ~t*!". The current model gets them
         | both correct.
        
       | ks2048 wrote:
       | It's a shame all these models target markdown and not something
       | with more structure and a specification. There are different
       | flavors of Markdown and limited support for footnotes,
       | references, figures, etc.
        
         | souvik3333 wrote:
         | Actually, we have trained the model to convert to markdown and
         | do semantic tagging at the same time. Eg, the equations will be
         | extracted as LaTeX equations, and images (plots, figures, and
         | so on) will be described within the `<img>` tags. Same with
         | `<signature>`, `<watermark>`, <page_number>.
         | 
         | Also, we extract the tables as HTML tables instead of markdown
         | for complex tables.
        
           | jtbayly wrote:
           | What happens to footnotes?
        
             | souvik3333 wrote:
             | They will be extracted in a new line as normal text. It
             | will be the last line.
        
               | jtbayly wrote:
               | So I'm left to manually link them up?
               | 
               | Have you considered using something like Pandoc's method
               | of marking them up? Footnotes are a fairly common part of
               | scanned pages, and markdown that doesn't indicate that a
               | footnote is a footnote can be fairly incomprehensible.
        
           | mgr86 wrote:
           | Have you considered XML. TEI, for example, is very robust and
           | mature for marking up documents.
        
             | esafak wrote:
             | First I heard of it.
             | https://en.wikipedia.org/wiki/Text_Encoding_Initiative
        
               | mgr86 wrote:
               | Understandable. I work in academic publishing, and while
               | the XML is everywhere crowd is graying, retiring, or even
               | dying :( it still remains an excellent option for
               | document markup. Additionally, a lot of government data
               | produced in the US and EU make heavy use of XML
               | technologies. I imagine they could be an interested
               | consumer of Nanonets-OCR. TEI could be a good choice as
               | well tested and developed conversions exist to other
               | popular, less structured, formats.
        
               | jxramos wrote:
               | maybe even epub, which is xhtml
        
               | agoose77 wrote:
               | Do check out MyST Markdown (https://mystmd.org)! Academic
               | publishing is a space that MyST is being used, such as
               | https://www.elementalmicroscopy.com/ via Curvenote.
               | 
               | (I'm a MyST contributor)
        
               | viraptor wrote:
               | Do you know why myst got traction, instead of RST which
               | seems to have all the custom tagging and extensibility
               | build in from the beginning?
        
             | lukev wrote:
             | Yeah this really hurts. If your goal is to precisely mark
             | up a document with some structural elements, XML is
             | strictly superior to Markdown.
             | 
             | The fact that someone would go to all the work to build a
             | model to extract the structure of documents, then choose an
             | output format strictly less expressive than XML, speaks
             | poorly of the state of cross-generational knowledge sharing
             | within the industry.
        
         | starkparker wrote:
         | I was more excited to hear about "structured Markdown" than the
         | LLM OCR model, but the extent of it just seems to be tagging
         | certain elements. It's useful in the LLM context but not as
         | much outside of it.
        
           | agoose77 wrote:
           | Feel free to check out MyST Markdown, which very much aims to
           | specify "structured Markdown": https://mystmd.org
        
       | el_don_almighty wrote:
       | I have been looking for something that would ingest a decade of
       | old Word and PowerPoint documents and convert them into a
       | standardized format where the individual elements could be
       | repurposed for other formats. This seems like a critical building
       | block for a system that would accomplish this task.
       | 
       | Now I need a catalog, archive, or historian function that
       | archives and pulls the elements easily. Amazing work!
        
         | pxc wrote:
         | Can't you just start with unoconv or pandoc, then maybe use an
         | LLM to clean up after converting to plain text?
        
       | constantinum wrote:
       | It would be interesting to know how it compares with Llamaparse,
       | LLMWhisperer, Marker, Reducto
        
       | nehalem wrote:
       | How does it do with multi-column text and headers and footers?
        
         | souvik3333 wrote:
         | We have trained the model on tables with hierarchical column
         | headers and with rowspan and colspan >1. So it should work
         | fine. This is the reason we predict the table in HTML instead
         | of markdown.
        
           | nehalem wrote:
           | Thank you. I was rather thinking of magazine like layouts
           | with columns of text and headers and footers on every page
           | holding article title and page number.
        
             | souvik3333 wrote:
             | It should work there also. We have trained on research
             | papers with two columns of text. Generally, papers have
             | references as a footer and contains page number.
        
       | b0a04gl wrote:
       | the interesting bit is it's tagging semantics during parsing
       | itself. knowing something's a signature or watermark or checkbox
       | before layout reconstruction. most pipelines bolt that on later
       | using heuristics or classifiers.
       | 
       | > curious what that pre-tagging does to downstream
       | simplification, especially for converting into json/html without
       | extra passes.
       | 
       | >also wondering how they handle ambiguity in visual cues without
       | layout metadat
        
       | kordlessagain wrote:
       | I created a Powershell script to run this locally on any PDF:
       | https://gist.github.com/kordless/652234bf0b32b02e39cef32c71e...
       | 
       | It does work, but it is very slow on my older GPU (Nvidia 1080
       | 8GB). I would say it's taking at least 5 minutes per page right
       | now, but maybe more.
       | 
       | Edit: If anyone is interested in trying a PDF to markdown
       | conversion utility built this that is hosted on Cloud Run (with
       | GPU support), let me know. It should be done in about an hour or
       | so and I will post a link up here when it's done.
        
         | kordlessagain wrote:
         | Reporting back on this, here's some sample output from
         | https://www.sidis.net/animate.pdf:                 THE ANIMATE
         | AND THE INANIMATE            WILLIAM JAMES SIDIS
         | <img>A black-and-white illustration of a figure holding a book
         | with the Latin phrase "ARTI et VERITATI" below it.</img>
         | BOSTON            RICHARD G. BADGER, PUBLISHER            THE
         | GORHAM PRESS            Digitized by Google
         | 
         | I haven't see ANY errors in what it has done, which is quite
         | impressive.
         | 
         | Here, it's doing tables of contents (I used a slightly
         | different copy of the PDF than I linked to):
         | <table>         <tr>           <td>Chapter</td>
         | <td>Page</td>         </tr>         <tr>
         | <td>PREFACE</td>           <td>3</td>         </tr>
         | <tr>           <td>I. THE REVERSE UNIVERSE</td>
         | <td>9</td>         </tr>         <tr>           <td>II.
         | REVERSIBLE LAWS</td>           <td>14</td>         </tr>
         | 
         | Other than the fact it is ridiculously slow, this seems to be
         | quite good at doing what it says it does.
        
         | 2pointsomone wrote:
         | Very very interested!
        
       | ZQ-Dev8 wrote:
       | How's this compare with docling (https://github.com/docling-
       | project/docling)?
        
       | temp0826 wrote:
       | I have a Shipibo (indigenous Peruvian language) to Spanish
       | dictionary that I've been trying to translate into a Shipibo to
       | English dictionary using a couple different llms but keep
       | struggling with formatting (two columns, strange line breaks, but
       | also both Shipibo and Spanish in the definitions make it
       | difficult to grok). That all plus being pretty poorly scanned.
       | May need to give this a try.
        
       | Bestora wrote:
       | How does it handle documents with multi column or multi row
       | tables?
       | 
       | e.g. https://www.japanracing.de/Teilegutachten/Teilegutachten-
       | JR1... page 1 rowspan page29 colspan
        
       ___________________________________________________________________
       (page generated 2025-06-16 23:00 UTC)