[HN Gopher] Nanonets-OCR-s - OCR model that transforms documents...
___________________________________________________________________
Nanonets-OCR-s - OCR model that transforms documents into
structured markdown
Author : PixelPanda
Score : 265 points
Date : 2025-06-16 06:14 UTC (16 hours ago)
(HTM) web link (huggingface.co)
(TXT) w3m dump (huggingface.co)
| PixelPanda wrote:
| Full disclaimer: I work at Nanonets
|
| Excited to share Nanonets-OCR-s, a powerful and lightweight (3B)
| VLM model that converts documents into clean, structured
| Markdown. This model is trained to understand document structure
| and content context (like tables, equations, images, plots,
| watermarks, checkboxes, etc.). Key Features:
|
| LaTeX Equation Recognition Converts inline and block-level math
| into properly formatted LaTeX, distinguishing between $...$ and
| $$...$$.
|
| Image Descriptions for LLMs Describes embedded images using
| structured <img> tags. Handles logos, charts, plots, and so on.
|
| Signature Detection & Isolation Finds and tags signatures in
| scanned documents, outputting them in <signature> blocks.
|
| Watermark Extraction Extracts watermark text and stores it within
| <watermark> tag for traceability.
|
| Smart Checkbox & Radio Button Handling Converts checkboxes to
| Unicode symbols like , , and for reliable parsing in downstream
| apps.
|
| Complex Table Extraction Handles multi-row/column tables,
| preserving structure and outputting both Markdown and HTML
| formats.
|
| Huggingface / GitHub / Try it out:
| https://huggingface.co/nanonets/Nanonets-OCR-s
|
| Try it with Docext in Colab:
| https://github.com/NanoNets/docext/blob/main/PDF2MD_README.m...
| mvac wrote:
| Correct link for Docext:
| https://github.com/NanoNets/docext/blob/main/PDF2MD_README.m...
| generalizations wrote:
| Does it have a way to extract the images themselves, or is that
| still a separate process later?
| j45 wrote:
| If you are after extracting images from pdfs there's plenty
| of tools that do that just fine without LLMs.
| generalizations wrote:
| I mean, ideally it would be in context, so the generated
| markdown references the correct image at the correct
| location in the doc. Unless that's what you're talking
| about? In which case I don't know about those tools.
| RicoElectrico wrote:
| Could be it used to (maybe with help of a downstream LLM) parse
| a photo/PDF of a restaurant menu into a JSON file conforming to
| a schema? Or would bigger, hosted multimodal LLMs work better
| in such case?
| gibsonf1 wrote:
| Does it hallucinate with the LLM being used?
| nattaylor wrote:
| The base model is Qwen2.5-VL-3B and the announcement says a
| limitation is "Model can suffer from hallucination"
| gibsonf1 wrote:
| Seems a bit scary that the "source" text from the pdfs
| could actually be hallucinated.
| michaelt wrote:
| Sometimes. I just fed the huggingface demo an image
| containing some rather improbable details [1] and it OCRed
| "Page 1000000000000" with one extra trailing zero.
|
| Honestly I was expecting the opposite - a repetition penalty
| to kick in having repeated zero too many times, resulting in
| too _few_ zeros - but apparently not. So you might want to
| steer clear of this model if your document has a trillion
| pages.
|
| Other than that, it did a solid job - I've certainly seen
| worse attempts to OCR a table.
|
| [1] https://imgur.com/a/8rJeHf8
| silversmith wrote:
| I'm curious, how does it do with non-english texts? It's my
| understanding that LLM-based OCR solutions fall way behind
| traditional ones once you introduce other languages.
| wickedsight wrote:
| Understanding or experience?
|
| Because my experience is not at all like that. If I use both
| Google Translate and ChatGPT on an image, ChatGPT is pretty
| much always better. It can even translate Japanese hand written
| menus quite well. With the added benefit of it being able to
| add context and explain what the dishes are.
| silversmith wrote:
| I'm passively interested in small, local LLM OCR, due to
| couple ideas kicking around between my ears. Tried some a
| while ago, but most of my recent knowledge is second-hand.
| Waiting for someone to exclaim "hey this works now!" before
| committing more time :)
|
| With the big commercial offerings like chatgpt I'd fully
| expect them to work fine, due to the absolutely massive
| horsepower in use.
| raus22 wrote:
| With models like these, when multilingual is not mentioned it
| will perform really bad on real life non-english pdfs.
| souvik3333 wrote:
| The model was primarily trained on English documents, which is
| why English is listed as the main language. However, the
| training data did include a smaller proportion of Chinese and
| various European languages. Additionally, the base model
| (Qwen-2.5-VL-3B) is multilingual. Someone on Reddit mentioned
| it worked on Chinese:
| https://www.reddit.com/r/LocalLLaMA/comments/1l9p54x/comment...
| progval wrote:
| It's not open-source (nor open-weight):
| https://huggingface.co/nanonets/Nanonets-OCR-s/discussions/2
| souvik3333 wrote:
| Hi, author of the model here. It is an open-weight model, you
| can download it from here:
| https://huggingface.co/nanonets/Nanonets-OCR-s
| gardnr wrote:
| Interestingly, another OCR model based on Qwen2.5-VL-3B just
| dropped which also publishes as Apache 2. It's right next to
| Nanonets-OCR-s on the HF "Trending" list.
|
| https://huggingface.co/echo840/MonkeyOCR/blob/main/Recogniti.
| ..
| CaptainFever wrote:
| IMO weights being downloadable doesn't mean it's open weight.
|
| My understanding: - Weight available: You
| can download the weights. - Open weight: You can
| download the weights, and it is licensed freely (e.g. public
| domain, CC BY-SA, MIT). - Open source: (Debated) You
| can download the weights, it is licensed freely, and the
| training dataset is also available and licensed freely.
|
| For context:
|
| > You're right. The Apache-2.0 license was mistakenly listed,
| and I apologize for the confusion. Since it's a derivative of
| Qwen-2.5-VL-3B, it will have the same license as the base
| model (Qwen RESEARCH LICENSE AGREEMENT). Thanks for pointing
| this out.
| tensor wrote:
| There are no benchmarks or accuracy measures on a hold out set?
| souvik3333 wrote:
| Hi, author of the model here..
|
| We have a benchmark for evaluating VLM on document
| understanding tasks: https://idp-leaderboard.org/ . But
| unfortunately, it does not include image to markdown as a task.
| The problem with evaluating an image to markdown is that even
| if the order of two blocks are different, it can still be
| correct. Eg: if you have both seller info and buyer info side
| by side in the image one model can extract the seller info
| first, and another model can extract the buyer info first. Both
| model will be correct but depending on the ground truth if you
| do fuzzy matching one model will have higher accuracy than the
| other one.
|
| Normally, a company will train and test on a dataset that is
| trained on the same type of annotation (either left block first
| or right block first), and all other models can get a low score
| on their benchmark because they are trained on the opposite
| order of annotations.
| krapht wrote:
| If this is the only issue, can't this be addressed by
| normalizing the post-processed data before scoring? (that is,
| if it really is just a matter of block ordering)
| tensor wrote:
| The more important thing to me with any VLM is base OCR
| performance and hallucinations. It's not too hard to get
| improved average accuracy on very low quality scans using
| language models. Unfortunately these also typically produce
| large numbers of hallucinations, which are a deal breaker if
| you are trying to get out values for financial or legal
| purposes.
|
| OCR that has lower accuracy, but where the inaccurate parts
| are left blank or flagged are far superior. Mistral OCR also
| suffers from this problem.
|
| If your OCR produced bounding boxes for every text line, and
| ran a traditional OCR on the text, this could alleviate it.
| Or at the very least bounding boxes let users cross-correlate
| with output from traditional OCR.
|
| Also a small note, it's probably best not to say your product
| beats Mistral when it's not even tested against it. Having
| more features doesn't make a product better if the accuracy
| is not better on those features.
|
| I don't mean to be discouraging, this is an important space
| and it looks like you have a very feature rich model. I'd
| like to see a good solution be developed!
| Eisenstein wrote:
| How does it do with handwriting?
| souvik3333 wrote:
| We have not trained explicitly on handwriting datasets
| (completely handwritten documents). But, there are lots of
| forms data with handwriting present in training. So, do try on
| your files, there is a huggingface demo, you can quickly test
| there: https://huggingface.co/spaces/Souvik3333/Nanonets-ocr-s
|
| We are currently working on creating completely handwritten
| document datasets for our next model release.
| Eisenstein wrote:
| Document:
|
| * https://imgur.com/cAtM8Qn
|
| Result:
|
| * https://imgur.com/ElUlZys
|
| Perhaps it needed more than 1K tokens? But it took about an
| hour (number 28 in queue) to generate that and I didn't feel
| like trying again.
|
| How many tokens does it usually take to represent a page of
| text with 554 characters?
| souvik3333 wrote:
| Hey, the reason for the long processing time is that lots
| of people are using it, and with probably larger documents.
| I tested your file locally seems to be working correctly.
| https://ibb.co/C36RRjYs
|
| Regarding the token limit, it depends on the text. We are
| using the qwen-2.5-vl tokenizer in case you are interested
| in reading about it.
|
| You can run it very easily in a Colab notebook. This should
| be faster than the demo https://github.com/NanoNets/docext/
| blob/main/PDF2MD_README.m...
|
| There are incorrect words in the extraction, so I would
| suggest you to wait for the handwritten text model's
| release.
| mdaniel wrote:
| > I tested your file locally seems to be working
| correctly
|
| Apologies if there's some unspoken nuance in this
| exchange, but by "working correctly" did you just mean
| that it ran to completion? I don't even recognize some of
| the unicode characters that it emitted (or maybe you're
| using some kind of strange font, I guess?)
|
| Don't misunderstand me, a ginormous number of floating
| point numbers attempting to read that handwriting is
| already doing better than _I_ can, but I was just trying
| to understand if you thought that outcome is what was
| expected
| Eisenstein wrote:
| It actually did a decent. Perhaps the font is weird? For
| reference here is the 'ground truth' content, not in
| markdown:
|
| Page# 8
|
| Log: MA 6100 2.03.15
|
| 34 cement emitter resistors - 0.33R 5W 5% measure 0.29R
| 0.26R
|
| 35 replaced R436, R430 emitter resistors on R-chn P.O.
| brd w/new WW 5W .33R 5% w/ ceramic lead insulators
|
| 36 applied de-oxit d100 to speaker outs, card terminals,
| terminal blocks, output trans jacks
|
| 37 replace R-chn drivers and class A BJTs w/ BD139/146, &
| TIP31AG
|
| 38 placed boards back in
|
| 39 desoldered grnd lug from volume control
|
| 40 contact cleaner, Deoxit D5, faderlube on pots &
| switches teflon lube on rotor joint
|
| 41 cleaned ground lug & resoldered, reattached panel
| souvik3333 wrote:
| This is the result. ``` Page 1 of 1 Page #
| <page_number>8</page_number>
|
| Log: MA 6100 Z. O 3. 15
|
| <table> <tr> <td>34</td> <td>cement emitter resistors
| -</td> </tr> <tr> <td></td> <td>0.33 R SW 5% measure</td>
| </tr> <tr> <td></td> <td>0.29 R, 0.26 R</td> </tr> <tr>
| <td>35</td> <td>replaced R'4 36, R4 30</td> </tr> <tr>
| <td></td> <td>emitter resistor on R-44</td> </tr> <tr>
| <td></td> <td>0.0. 3rd w/ new WW 5W .33R</td> </tr> <tr>
| <td>36</td> <td>% w/ ceramic lead insulators</td> </tr>
| <tr> <td></td> <td>applied de-oat d100 to Speak</td>
| </tr> <tr> <td></td> <td>outs, card terminals,
| terminal</td> </tr> <tr> <td></td> <td>blocks, output
| tran jacks</td> </tr> <tr> <td>37</td> <td>replace &-clun
| diviers</td> </tr> <tr> <td></td> <td>and class A BJTs w/
| BD139/140</td> </tr> <tr> <td></td> <td>& TIP37A2</td>
| </tr> <tr> <td>38</td> <td>placed boards back in</td>
| </tr> <tr> <td>39</td> <td>desoldered ground lus from
| volume</td> </tr> <tr> <td></td> <td>(con 48)</td> </tr>
| <tr> <td>40</td> <td>contact cleaner, Deox. t DS,
| facel/42</td> </tr> <tr> <td></td> <td>on pots &
| switches</td> </tr> <tr> <td></td> <td>* teflon lube on
| rotor joint</td> </tr> <tr> <td>41</td> <td>reably
| cleaned ground lus &</td> </tr> <tr> <td></td>
| <td>resoldered, reattatched panel</td> </tr> </table> ```
|
| You can paste it in https://markdownlivepreview.com/ and
| see the extraction. This is using the Colab notebook I
| have shared before.
|
| Which Unicode characters are you mentioning here?
| mvac wrote:
| How does it compare to Datalab/Marker https://github.com/datalab-
| to/marker ? We evaluated many PDF->MD converters and this one
| performed the best, though it is not perfect.
| wittjeff wrote:
| I am just getting started with my own cross-comparison, would
| appreciate your list of considered candidates if you have it
| handy.
| nxobject wrote:
| As anecdotal evidence, it serves my complex-enough purposes
| very well - mathematics and code interspersed together. One of
| my "litmus test" papers is this old paper on a Fortran inverse-
| Laplace transform algorithm [1] that intersperses inline and
| display equations, and monospace code blocks, while requiring
| OCR from scratch, and very few models currently do a
| satisfactory job, i.e. in the following page transcribed by
| Marker,
|
| https://imgur.com/a/Q7UYIfW
|
| the inline $\sigma_0$ is mangled as "<sup>s</sup> 0", and
| $f(t)$ is mangled as " _f~_ ~t*!". The current model gets them
| both correct.
| ks2048 wrote:
| It's a shame all these models target markdown and not something
| with more structure and a specification. There are different
| flavors of Markdown and limited support for footnotes,
| references, figures, etc.
| souvik3333 wrote:
| Actually, we have trained the model to convert to markdown and
| do semantic tagging at the same time. Eg, the equations will be
| extracted as LaTeX equations, and images (plots, figures, and
| so on) will be described within the `<img>` tags. Same with
| `<signature>`, `<watermark>`, <page_number>.
|
| Also, we extract the tables as HTML tables instead of markdown
| for complex tables.
| jtbayly wrote:
| What happens to footnotes?
| souvik3333 wrote:
| They will be extracted in a new line as normal text. It
| will be the last line.
| jtbayly wrote:
| So I'm left to manually link them up?
|
| Have you considered using something like Pandoc's method
| of marking them up? Footnotes are a fairly common part of
| scanned pages, and markdown that doesn't indicate that a
| footnote is a footnote can be fairly incomprehensible.
| mgr86 wrote:
| Have you considered XML. TEI, for example, is very robust and
| mature for marking up documents.
| esafak wrote:
| First I heard of it.
| https://en.wikipedia.org/wiki/Text_Encoding_Initiative
| mgr86 wrote:
| Understandable. I work in academic publishing, and while
| the XML is everywhere crowd is graying, retiring, or even
| dying :( it still remains an excellent option for
| document markup. Additionally, a lot of government data
| produced in the US and EU make heavy use of XML
| technologies. I imagine they could be an interested
| consumer of Nanonets-OCR. TEI could be a good choice as
| well tested and developed conversions exist to other
| popular, less structured, formats.
| jxramos wrote:
| maybe even epub, which is xhtml
| agoose77 wrote:
| Do check out MyST Markdown (https://mystmd.org)! Academic
| publishing is a space that MyST is being used, such as
| https://www.elementalmicroscopy.com/ via Curvenote.
|
| (I'm a MyST contributor)
| viraptor wrote:
| Do you know why myst got traction, instead of RST which
| seems to have all the custom tagging and extensibility
| build in from the beginning?
| lukev wrote:
| Yeah this really hurts. If your goal is to precisely mark
| up a document with some structural elements, XML is
| strictly superior to Markdown.
|
| The fact that someone would go to all the work to build a
| model to extract the structure of documents, then choose an
| output format strictly less expressive than XML, speaks
| poorly of the state of cross-generational knowledge sharing
| within the industry.
| starkparker wrote:
| I was more excited to hear about "structured Markdown" than the
| LLM OCR model, but the extent of it just seems to be tagging
| certain elements. It's useful in the LLM context but not as
| much outside of it.
| agoose77 wrote:
| Feel free to check out MyST Markdown, which very much aims to
| specify "structured Markdown": https://mystmd.org
| el_don_almighty wrote:
| I have been looking for something that would ingest a decade of
| old Word and PowerPoint documents and convert them into a
| standardized format where the individual elements could be
| repurposed for other formats. This seems like a critical building
| block for a system that would accomplish this task.
|
| Now I need a catalog, archive, or historian function that
| archives and pulls the elements easily. Amazing work!
| pxc wrote:
| Can't you just start with unoconv or pandoc, then maybe use an
| LLM to clean up after converting to plain text?
| constantinum wrote:
| It would be interesting to know how it compares with Llamaparse,
| LLMWhisperer, Marker, Reducto
| nehalem wrote:
| How does it do with multi-column text and headers and footers?
| souvik3333 wrote:
| We have trained the model on tables with hierarchical column
| headers and with rowspan and colspan >1. So it should work
| fine. This is the reason we predict the table in HTML instead
| of markdown.
| nehalem wrote:
| Thank you. I was rather thinking of magazine like layouts
| with columns of text and headers and footers on every page
| holding article title and page number.
| souvik3333 wrote:
| It should work there also. We have trained on research
| papers with two columns of text. Generally, papers have
| references as a footer and contains page number.
| b0a04gl wrote:
| the interesting bit is it's tagging semantics during parsing
| itself. knowing something's a signature or watermark or checkbox
| before layout reconstruction. most pipelines bolt that on later
| using heuristics or classifiers.
|
| > curious what that pre-tagging does to downstream
| simplification, especially for converting into json/html without
| extra passes.
|
| >also wondering how they handle ambiguity in visual cues without
| layout metadat
| kordlessagain wrote:
| I created a Powershell script to run this locally on any PDF:
| https://gist.github.com/kordless/652234bf0b32b02e39cef32c71e...
|
| It does work, but it is very slow on my older GPU (Nvidia 1080
| 8GB). I would say it's taking at least 5 minutes per page right
| now, but maybe more.
|
| Edit: If anyone is interested in trying a PDF to markdown
| conversion utility built this that is hosted on Cloud Run (with
| GPU support), let me know. It should be done in about an hour or
| so and I will post a link up here when it's done.
| kordlessagain wrote:
| Reporting back on this, here's some sample output from
| https://www.sidis.net/animate.pdf: THE ANIMATE
| AND THE INANIMATE WILLIAM JAMES SIDIS
| <img>A black-and-white illustration of a figure holding a book
| with the Latin phrase "ARTI et VERITATI" below it.</img>
| BOSTON RICHARD G. BADGER, PUBLISHER THE
| GORHAM PRESS Digitized by Google
|
| I haven't see ANY errors in what it has done, which is quite
| impressive.
|
| Here, it's doing tables of contents (I used a slightly
| different copy of the PDF than I linked to):
| <table> <tr> <td>Chapter</td>
| <td>Page</td> </tr> <tr>
| <td>PREFACE</td> <td>3</td> </tr>
| <tr> <td>I. THE REVERSE UNIVERSE</td>
| <td>9</td> </tr> <tr> <td>II.
| REVERSIBLE LAWS</td> <td>14</td> </tr>
|
| Other than the fact it is ridiculously slow, this seems to be
| quite good at doing what it says it does.
| 2pointsomone wrote:
| Very very interested!
| ZQ-Dev8 wrote:
| How's this compare with docling (https://github.com/docling-
| project/docling)?
| temp0826 wrote:
| I have a Shipibo (indigenous Peruvian language) to Spanish
| dictionary that I've been trying to translate into a Shipibo to
| English dictionary using a couple different llms but keep
| struggling with formatting (two columns, strange line breaks, but
| also both Shipibo and Spanish in the definitions make it
| difficult to grok). That all plus being pretty poorly scanned.
| May need to give this a try.
| Bestora wrote:
| How does it handle documents with multi column or multi row
| tables?
|
| e.g. https://www.japanracing.de/Teilegutachten/Teilegutachten-
| JR1... page 1 rowspan page29 colspan
___________________________________________________________________
(page generated 2025-06-16 23:00 UTC)