[HN Gopher] Donut: OCR-Free Document Understanding Transformer
       ___________________________________________________________________
        
       Donut: OCR-Free Document Understanding Transformer
        
       Author : hectormalot
       Score  : 269 points
       Date   : 2023-05-29 08:19 UTC (14 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | AmazingTurtle wrote:
       | I tested it out with a bunch of personal documents. Results were
       | disappointing. Did not match up with the promised scores, not
       | even slightly.
       | 
       | I think the traditional approach to scanning and classifying
       | without AI/ML is the way to go, for the next 5 years at very
       | least.
        
         | paddw wrote:
         | For documents which are mostly pretty clean you are probably
         | right. The ceiling for AI/ML is definitely higher though, and
         | very useful right now if you know specifically what type of
         | document you expect to look at, but expect it to be messy.
        
         | loudmax wrote:
         | Developments in this space are coming really fast, and reading
         | words are squarely within the capabilities of neural engines. 5
         | years is a very long time in AI years.
        
         | jrpt wrote:
         | How does it compare to something like https://docalysis.com?
        
         | iamflimflam1 wrote:
         | I think the online demos have been fine tunes to work on
         | receipts.
        
         | nogridbag wrote:
         | What would you recommend for classifying documents? Most of the
         | companies I've evaluated market their product as using fancy
         | AI/ML, but instead they have hundreds of people, usually in
         | India, manually classifying the documents.
        
           | refulgentis wrote:
           | I strongly believe everything just has to go through OpenAI
           | or Anthropic, for now. These models are significantly better
           | than any NLP models I try swapping in.
           | 
           | But this isn't much help if you must classify images.
        
         | jstummbillig wrote:
         | https://cloud.google.com/use-cases/ocr
         | 
         | For my use cases, this already beats all "traditional
         | approaches" for at least a few month now. That's just inferring
         | from when I first stumbled across it. No clue for how long it's
         | been a thing.
        
           | subbu wrote:
           | have you tried Azure's OCR? https://learn.microsoft.com/en-
           | us/azure/cognitive-services/c.... Is it comparable to
           | Google's?
        
             | driscoll42 wrote:
             | I did some OCR tests on some 1960s era documents (all in
             | English). Mix of typed and handwritten. I had as results:
             | 
             | Google Vision: 95.62% HW - 99.4% Typed
             | 
             | Amazon Texttract: 95.63% HW - 99.3% Typed
             | 
             | Azure: 95.9% HW - 98.1% Typed
             | 
             | Then if curious, TrOCR was the best FOSS solution at 79.5%
             | HW and 97.4% Typed. (However it took roughly 200x longer
             | than Tesseract which was 43% HW and 97.0% Typed)
        
               | dimatura wrote:
               | When did you do this test? I don't have any numbers
               | handy, but a couple years ago I compared google's OCR vs
               | AWS's on "text in the wild" pictures. AWS' wasn't bad,
               | but it was definitely outperformed by the google one. The
               | open-source solutions I tried (tesseract and some
               | academic deep-learning code) were far behind.
        
               | driscoll42 wrote:
               | This was a couple months ago now, so not that long ago.
               | For OCR I have found that it _highly_ depends on the type
               | of image you are looking at. In my case these were all
               | scanned documents of good but not great scan quality, all
               | in English. I expect if you were talking about random
               | photos with text in them, you 'd see the FOSS solutions
               | do much worse, and much more variance in the Google vs
               | Amazon vs Azure. I would be curious about the academic
               | deep learning one you tried.
        
               | dimatura wrote:
               | The main one was https://github.com/JaidedAI/EasyOCR,
               | mostly because, as promised, it was pretty easy to use,
               | and uses pytorch (which I preferred in case I wanted to
               | tweak it). It has been updated since, but at the time it
               | was using CRNN, which is a solid model, especially for
               | the time - it wasn't (academic) SOTA but not far behind
               | that. I'm sure I could've coaxed better performance than
               | I got out of it with some retraining and hyperparameter
               | tuning.
        
           | j16sdiz wrote:
           | Do they feed your data to recaptcha ?
        
             | threeseed wrote:
             | Google has scanned 40 million+ physical books and magazines
             | 
             | Which it used OCR to produce digital text.
             | 
             | So one source of training data at least.
        
             | jstummbillig wrote:
             | I did not check. I also never checked if they share my
             | mails on google search with you -- but I trust their
             | ambition to not be sued into the ground for doing something
             | _immensely_ stupid.
             | 
             | Leaking sensitive data of enterprise customers as training
             | material for public recaptchas falls in that category.
        
         | mhitza wrote:
         | Do you (or anyone else) know which would be a good open source
         | OCR for PDFs and images?
        
           | bobosha wrote:
           | easyocr was the best of the bunch - however it still leaves
           | quite a bit to be desired.
        
           | lelandfe wrote:
           | OCRmyPDF is the typical answer:
           | https://github.com/ocrmypdf/OCRmyPDF
           | 
           | It uses Tesseract under the hood. Results tend to just be OK
           | in my experience.
        
             | xrd wrote:
             | Tesseract is amazing. It is simple and generally good
             | results.
        
               | version_five wrote:
               | I was just playing with tesseract last week (I'd used it
               | years ago) and wasn't too happy. I had a pretty simple
               | pdf that was in what you could think of as an old
               | typewritten font, but easily legible, and I got all kinds
               | of word fragments and nonsense characters in the output.
               | I know that high quality ocr systems include a language
               | model to coerce the read text into the most probable
               | words. Is tesseract just supposed to be the first stage
               | of such a system?
               | 
               | I'll note that when I put the tesseract output into
               | chatgpt and prompted it saying it was ocr'd text and
               | asking to clean it up, it worked very well.
        
               | flaviut wrote:
               | I was just processing a document with tesseract &
               | ocrmypdf, and two things:
               | 
               | My first time processing it, I used `ocrmypdf --redo-ocr`
               | because it looked like there was some existing OCR. After
               | processing, the OCR was crap because ocrmypdf didn't
               | realize it was OCR but thought it was real text in the
               | document that should be kept. This was fixable using
               | `ocrmypdf --force-ocr`.
               | 
               | Before realizing this, I discovered that Tesseract 4 & 5
               | use a neural network-based recognition. I then came
               | across this step-by-step guide on fine-tuning Tesseract
               | for a specific document set:
               | https://www.statworx.com/en/content-hub/blog/fine-tuning-
               | tes...
               | 
               | I didn't end up following the fine-tuning process because
               | at this point `ocrmypdf --force-ocr` worked excellently,
               | but I thought the draw_box_file_data.py script from their
               | example was particularly useful: https://gist.github.com/
               | flaviut/d901be509425098645e4ae527a9e...
        
               | denysvitali wrote:
               | FWIW, I'm using Google's ML Kit which runs completely on-
               | device and doesn't send the documents to Google. It works
               | better than tesseract for my use case.
               | 
               | I did a presentation on the topic recently: https://clis-
               | everywhere.k8s.best/16
               | 
               | I'll soon make the stack open source, but it shouldn't be
               | hard to recreate given the inputs I've already provided.
        
               | [deleted]
        
               | dataflow wrote:
               | I could never get tesseract to give good output. It
               | baffles me when people say it's good. Do I need to tune
               | it somehow or something?
        
           | driscoll42 wrote:
           | Tessearct is generally the overall best for typed documents,
           | though it struggles with handwriting. TrOCR is better than
           | Tesseract, especially with handwriting, but requires a GPU to
           | have any speed. Tesseract from my tests was roughly 200X
           | faster than TrOCR (not an exaggeration)
        
           | wahnfrieden wrote:
           | Not open but free - Apple's
        
             | occamrazor wrote:
             | Which Apple product?
        
               | imaurer wrote:
               | Two places I use it: Preview on my Mac, photos on my
               | phone. Haven't seen an api yet.
        
               | momo93 wrote:
               | https://developer.apple.com/documentation/vision/recogniz
               | ing...
        
               | vosper wrote:
               | If you want to OCR specific text you can use Textsniper
               | on Mac and draw a box on whatever part of the screen you
               | want to capture. I'm guessing under the hood it's just
               | using Apple's OCR tech, which does work very well (at
               | least if you're on Apple Silicon, it's not quite so fast
               | on my 2015 Intel Macbook Pro)
        
               | wahnfrieden wrote:
               | VNRecognizeTextRequest and DataScannerViewController
        
           | kcorbitt wrote:
           | When I was evaluating options a few months ago I found
           | https://github.com/PaddlePaddle/PaddleOCR to be a very strong
           | contender for my use case (reading product labels), but
           | you'll definitely want to put together some representative
           | docs/images and test a bunch of solutions to see what works
           | for you.
        
         | jameshart wrote:
         | Five years? Where's that number coming from?
         | 
         | This seems like exactly the kind of problem that will see rapid
         | improvements as people point more LLMs at multimodal input.
         | 
         | Right now making predictions for ML capabilities on a five year
         | timeframe seems foolhardy.
        
       | ryanjshaw wrote:
       | This is really cool if it delivers. I tried building an app to
       | scan till receipts. The image to text APIs out there really don't
       | perform as well as you'd think. AWS Text Extract performed far
       | better than GCP and Azure equivalents and traditional OCR
       | solutions, but it still made some really annoying errors that I
       | had to fix with heuristics.
        
         | werdnapk wrote:
         | Was Tesseract one of the APIs you tried?
        
           | ryanjshaw wrote:
           | Yup, it was solid but not as good as AWS out of the box. IIRC
           | preprocessing the image did help, but I didn't have enough
           | time to spend on fleshing that out for an MVP. (I gave up on
           | the project when I realised that recently introduced
           | protection of personal information laws in my country would
           | have made this project too risky to continue work on. The
           | intention was to automatically extract spending habits from
           | receipts to improve personal finance management.)
        
         | [deleted]
        
         | DannyBee wrote:
         | Unfortunately, trying this out, it seems to be nowhere near the
         | claimed quality. Definitely not ready for prime time.
         | 
         | Feels like someone trying to throw a stake in the ground rather
         | than releasing a quality product, honestly.
        
       | xavriley wrote:
       | There's a model for music transcription (audio to midi) called
       | MT3 which takes an end-to-end transformer approach and claims
       | SOTA on some datasets. However, from my own research and
       | comparing with other models it seems that MT3 is very prone to
       | overfitting and the real world results are not as impressive. A
       | similar story seems to be playing out in the comments here
        
         | onnodigcomplex wrote:
         | What would you say is a good model for audio to midi
         | transcription?
        
       | vosper wrote:
       | I want to build an application that scans restaurant and cafe
       | menus (PDFs, photos, webpages) to identify which items are
       | vegetarian or vegan. Would this work for that? If not, I would
       | love to hear peoples ideas and suggestions.
        
         | Alifatisk wrote:
         | We have a similar idea, but mine includes a few other
         | categories!
        
           | amelius wrote:
           | I bet in a year or so you don't need a specialized app for
           | it, but you just ask your phone whatever you want to know
           | about anything around you, including menus.
        
             | Alifatisk wrote:
             | Google maps is pretty close to that, I can already find
             | lots of places or restaurants based on what I want to eat.
             | And now, thanks to crows sourcing, you can filter by a
             | range of options.
             | 
             | The only problem is that the data is closely walled by
             | Google and you can only access it through their api.
             | 
             | What I want to create is a tiny search engine that collects
             | all menus (somehow collected from images) and let users
             | find and filter what they like, and even get
             | recommendations nearby!
        
               | amelius wrote:
               | > The only problem is that the data is closely walled by
               | Google and you can only access it through their api.
               | 
               | Perhaps if iOS would allow an AI to access the screen
               | pixels directly ...
        
               | Alifatisk wrote:
               | And then OCR from the screen?
        
               | vosper wrote:
               | Possibly like this
               | 
               | https://news.ycombinator.com/item?id=34315782
        
         | tuardoui wrote:
         | You should look at LayoutLM models for a NER task. Then your
         | pipeline should look like : - Identity the menu sub structure
         | (title, item list ...) - Classify each item with 2 labels.
         | 
         | The training process is not hard, but the data gathering /
         | cleaning / labelling can be a little long.
        
           | vosper wrote:
           | Thanks! I haven't heard of LayoutLM but something that can
           | understand structure from a few examples could be just what I
           | need.
        
         | 7moritz7 wrote:
         | The non quantized models look relatively large, 800 MB. So
         | you'd probably need to do inference on a server and somehow
         | monetize that. Sounds difficult
        
         | kolinko wrote:
         | With vegan you can't estimate it 100% from menu alone - because
         | the sauce and other minor ingredients can be animal based.
         | 
         | If you want to do it, using "plant based" is probably better
         | than "vegan", and it's always good to make sure your users are
         | aware that the mark can be wrong and they should double-check
         | with the waiter.
         | 
         | As for your question - I didn't play with Donut, but ocr+gpt or
         | multimodal gpt4 once released should handle this smoothly.
        
           | ada1981 wrote:
           | 20+ years vegan here.
           | 
           | You could combine ingredient search, looking for symbols the
           | actually designate vegan as some places do, along with
           | long/lat data to determine what the restaurant actually is
           | and then check it with a database you maintain.
           | 
           | So I could scan a menu, and then ask the owner or server
           | about certain dishes, and then crowdsource an updated
           | 
           | It would be great if there was a standard API for all
           | restaurants that included all menu items, prices,
           | ingredients, preparation and sourcing information. I _could_
           | be maintained like a wiki I suppose, and restaurants could be
           | incentivized by including their restaurants.
        
             | vosper wrote:
             | Yeah crowd-sourcing updates is the way I'd like to go. I'm
             | hoping people will submit photos of menus, because lots of
             | bakeries, cafes etc don't have much of an online presence
             | or keep their menus up-to-date.
             | 
             | Trying to solve the problem of scanning through menus for
             | multiple restaurants to find something a vegan or
             | vegetarian can eat, and instead just showing all the
             | individual menu options in the area as a list.
        
       | tkanarsky wrote:
       | > Donut: DOcumeNt Understanding Transformer
       | 
       | Author: phew! I'm glad there's an 'n' in there somewhere
        
         | sebzim4500 wrote:
         | As AI goes that isn't too bad. See LION = evoLved sIgn
         | mOmeNtum, which I have to assume is parody.
        
         | saretup wrote:
         | Is there an online tool/piece of code that can find words like
         | this in a string?
        
           | layer8 wrote:
           | See https://stackoverflow.com/questions/56414347/find-the-
           | words-.... Replace _needles_ with a dictionary file of your
           | choice.
        
           | IanCal wrote:
           | I'm sure everyone is kinda tired of this answer, but gpt4. At
           | least I have the share thing now so those who want to avoid
           | it don't have to see a big pasted output.
           | 
           | https://chat.openai.com/share/25124525-0bad-4c13-ae5a-ae4bea.
           | ..
           | 
           | 3.5 doesn't really get it, 4 does. There are some that aren't
           | great but the context is pretty small and it can be a decent
           | launching point.
        
             | TeMPOraL wrote:
             | I took your prompt and tried to get it to generate a bit
             | less serious-sounding backronyms. GPT-4 is good at this.
             | Me, not so much.
             | 
             | https://cloud.typingmind.com/share/b49794e0-4b2b-4213-ae2e-
             | 8...
             | 
             | I'll grant one thing: the paper must go well with cheese.
        
               | IanCal wrote:
               | Fantastic. I tried a few dafter ones based on this https:
               | //chat.openai.com/share/d9a31442-96e5-4a18-92d9-d217c7...
        
               | leononame wrote:
               | This made me laugh out loud on the toilet. The cheese
               | ones are great, I really like FETA
        
               | JasonFruit wrote:
               | That was a beautiful misunderstanding, right out of 1960s
               | scifi.
        
             | JasonFruit wrote:
             | People are justifiably excited about these language tools,
             | but we're getting tired of this answer because it's not a
             | good answer: "Use GPT-4! Some of the answers aren't that
             | great, but it's at least a starting point." That's like if
             | you asked how to sort a list and remove duplicates, and the
             | answer was to import it into a spreadsheet program and
             | follow these steps instead of just "sort | uniq". It's
             | suggesting a general-purpose tool to do a specific job
             | maybe kinda acceptably instead of suggesting the right tool
             | for the job.
             | 
             | It reminds me of the microwave cookery books that came out
             | after consumer microwaves became available: there are
             | things a microwave is good at, but those books used it for
             | _everything_ , just like we're using GPT-4 today. We'll
             | calm down eventually.
        
               | IanCal wrote:
               | > but we're getting tired of this answer because it's not
               | a good answer: "Use GPT-4! Some of the answers aren't
               | that great, but it's at least a starting point."
               | 
               | Some of the answers are straight up usable, others if you
               | prefer you can go from there because _this a creative
               | language task_.
               | 
               | And there isn't really a specific tool for this, is
               | there? It's nothing like your comparison to a very well
               | specified problem. "Identify what this thing does, and
               | come up with a title that also contains a word, and the
               | word is related to the topic" is not the same as
               | sort|uniq Vs a spreadsheet.
        
               | adrianmonk wrote:
               | There are two issues here.
               | 
               | (1) The problem is not clearly defined. Does the word
               | need to be thematically related to the topic? (As far as
               | I can tell, "Donut" isn't thematically related to
               | document understanding.) Maybe you could say it's a nice,
               | optional bonus if it's related.
               | 
               | (2) The best solution would be good at _two_ things: (A)
               | satisfying constraints and (B) creativity. ChatGPT is
               | unlikely to be good at A, and a non-AI algorithm that
               | just finds valid words can 't do B.
               | 
               | Regarding #1, if people don't all have the same idea of
               | the problem, they're not going to agree on the solution.
               | 
               | Regarding #2, maybe a combined solution would be best.
               | Generate all allowable words, then feed them to ChatGPT
               | and have it say which ones are thematically good.
        
               | IanCal wrote:
               | Tbh I think gpt4 shines at this. People's requirements
               | will be different, in weird ways. Duckdb things are all
               | duck related. Rust is related to crabs. Your project may
               | all be sweets related. It might be serious or fun. Maybe
               | you want a name easy to draw.
               | 
               | These are hard to encode.
               | 
               | Instead I just asked "Make them more fun, and related to
               | literary characters", then Muppets and awkward ones based
               | on Harry potter which it described as "certainly a unique
               | request". It's faster than getting a word list related to
               | that. And they are frankly great - better than I'd come
               | up with given much longer.
               | 
               | https://chat.openai.com/share/d9a31442-96e5-4a18-92d9-d21
               | 7c7...
               | 
               | This problem is great for llms. It's language, works well
               | with a back and forth discussing good and bad options,
               | has no well defined output requirements but is easy to
               | explain to a person, has a human in the loop _and_ has
               | almost zero cost if it 's wrong.
               | 
               | > QUIXOTE: Quality Unstructured Information Extraction
               | and Organization Through End-to-end transformer
        
               | yyyk wrote:
               | >That's like if you asked how to sort a list and remove
               | duplicates, and the answer was to import it into a
               | spreadsheet program and follow these steps instead of
               | just "sort | uniq".
               | 
               | That's _exactly_ how many Windows office users do it,
               | they paste a list into Excel and use it to remove
               | duplicates. There are alternatives even on Windows, but
               | It 's must easier for them to use a single general-
               | purpose graphical tool (and let's not get started with
               | the abominations VLOOKUP is used for).
               | 
               | I used to look down on that but then I realized that
               | using a graphical program for list manipulation is kinda
               | cool and that this program is rather capable, and could
               | create combinations that are rather difficult to do with
               | the more specialized tools. I still use these specialized
               | tools (I'm used to them, and I can do some stuff they
               | can't easily do in Excel).
        
               | TeMPOraL wrote:
               | Yeah, I used to look down on that kind of things too, and
               | a decade later, I find myself doing those very things.
               | 
               | Yes, I know sort | uniq. I even have a couple Linux
               | shells open on my Windows work system. But I can't for
               | the life of me remember the magic flags, so I'll either
               | paste the list to Emacs and M-x sort-lines + M-x delete-
               | duplicate-lines, or paste it to Excel and do it there, or
               | do something even more cheesy - whatever is least likely
               | to break my flow.
               | 
               | There are tools more or less optimized for any specific
               | job, but the _best_ tool for the job is the one you have
               | handy, and are experienced in using.
               | 
               | I too increasingly often find myself using GPT-4 for
               | random, ad-hoc tasks. They may or may not be better tools
               | out there. I may even have some installed. But none of
               | them beat being able to just describe what you want,
               | paste some data, and get the results out few seconds
               | later.
        
               | thelastparadise wrote:
               | > Is there an online tool/piece of code that can find
               | words like this in a string?
               | 
               | ...but he's using a language model for a language task.
        
               | JasonFruit wrote:
               | Not all language tasks, even, are going to be best
               | handled by these models.
        
               | heyoni wrote:
               | But if there isn't a better one then this isn't the
               | microwave analogy you think it is
        
       | i2cmaster wrote:
       | I've started using Microsoft's TROCR (another transformer OCR
       | model) to read the cursive in my pocket journal (I have a habit
       | of writing programs there first while I'm out and then typing
       | them in manually, I just focus better that way.)
       | 
       | It's surprisingly accurate although you have to write your own
       | program to segment the image into lines. I think with some fine
       | tuning I could have the machine read my notebook with minimal
       | corrections.
        
         | driscoll42 wrote:
         | Have you looked into Craft or EAST for segmenting the image
         | into lines? Those two work decently.
        
       | aosmith wrote:
       | So is this why IA had an outage? Timing is perfect.
        
       | dkatz23238 wrote:
       | As a developer who has been building IDP solutions I can assert
       | that although this model is a lot larger (more weights) than a
       | Graph Neural Network on OCR tokens, industry standard before
       | transformers, it outperforms given enough data. Depending on how
       | heterogenous the data is usually 200 documents can reach human
       | levels of accuracy on documents, scoring by levenshtein ratio.
       | 
       | Smaller graph models could get away with using less data. The
       | problem that the "traditional" approach had is the the quality of
       | the OCR was the bottleneck for overall model performance. It
       | amazes me how this problem shifted from a node classification
       | problem to a image to text problem.
       | 
       | Training on CPU was possible with GCN but not with Donut.
        
       | dowakin wrote:
       | If you want to train the Donut, check out this notebook on
       | Kaggle. It trains Donut to read plots for a competition. The
       | notebook contains full pipeline for finetuning.
       | https://www.kaggle.com/code/nbroad/donut-train-benetech
        
       | armchairhacker wrote:
       | These OCR tools are bringing us closer to msPaint as a viable IDE
        
       | siddiqi123 wrote:
       | what are some available prompts? for example, I see there are
       | "<s_synthdog>" and "<s_iitcdip>", what are some other options? I
       | tried to use made-up ones such as "<s_what>" and seems it also
       | works, why? what is the meaning of prompts here? Thanks
        
       | nestorD wrote:
       | I will have to investigate this, I am dreaming of a system that
       | can take a pdf scan of a book as input and produce one or more
       | properly formated (headings, italic, bold, underline, etc)
       | markdown files. In my tests, LLMs have proved very good at
       | cleaning a raw OCR but they need formating information to get me
       | all the way.
        
         | jcuenod wrote:
         | It's not ready to take a book, but I'm building an app that
         | takes scans of book chapters/journal articles (which I often
         | receive from my college library) and turns them into well
         | formatted PDFs (with OCR, consistent margins, rotation...)
         | https://fixpdfs.com
        
           | nestorD wrote:
           | I don't think it supports a markdown / formated text export
           | but it looks looks fantastic as far as pdf cleanup goes (i
           | currently rely on adobe scan for that when i am working from
           | a paper copy), I will try it soon.
        
             | jcuenod wrote:
             | No, it doesn't support markdown and it doesn't do analysis
             | of headers/page numbers. It's mainly aimed at making
             | academic PDFs better for reading and annotating (especially
             | on iPad-like devices). Hoping to start charging for it at
             | some point, but I'm still trialing it...
             | 
             | Some users have expected it to "unwarp" bad scans, it also
             | doesn't do that unfortunately. But that's a much harder
             | problem to solve...
        
               | nestorD wrote:
               | I found perspective transformation to be good enough for
               | 90% of cases. Going further would be a lot of effort
               | better spent elsewhere. Does it deal fine with
               | illustrations and pdf compression?
        
       ___________________________________________________________________
       (page generated 2023-05-29 23:00 UTC)