[HN Gopher] Donut: OCR-Free Document Understanding Transformer
___________________________________________________________________
Donut: OCR-Free Document Understanding Transformer
Author : hectormalot
Score : 269 points
Date : 2023-05-29 08:19 UTC (14 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| AmazingTurtle wrote:
| I tested it out with a bunch of personal documents. Results were
| disappointing. Did not match up with the promised scores, not
| even slightly.
|
| I think the traditional approach to scanning and classifying
| without AI/ML is the way to go, for the next 5 years at very
| least.
| paddw wrote:
| For documents which are mostly pretty clean you are probably
| right. The ceiling for AI/ML is definitely higher though, and
| very useful right now if you know specifically what type of
| document you expect to look at, but expect it to be messy.
| loudmax wrote:
| Developments in this space are coming really fast, and reading
| words are squarely within the capabilities of neural engines. 5
| years is a very long time in AI years.
| jrpt wrote:
| How does it compare to something like https://docalysis.com?
| iamflimflam1 wrote:
| I think the online demos have been fine tunes to work on
| receipts.
| nogridbag wrote:
| What would you recommend for classifying documents? Most of the
| companies I've evaluated market their product as using fancy
| AI/ML, but instead they have hundreds of people, usually in
| India, manually classifying the documents.
| refulgentis wrote:
| I strongly believe everything just has to go through OpenAI
| or Anthropic, for now. These models are significantly better
| than any NLP models I try swapping in.
|
| But this isn't much help if you must classify images.
| jstummbillig wrote:
| https://cloud.google.com/use-cases/ocr
|
| For my use cases, this already beats all "traditional
| approaches" for at least a few month now. That's just inferring
| from when I first stumbled across it. No clue for how long it's
| been a thing.
| subbu wrote:
| have you tried Azure's OCR? https://learn.microsoft.com/en-
| us/azure/cognitive-services/c.... Is it comparable to
| Google's?
| driscoll42 wrote:
| I did some OCR tests on some 1960s era documents (all in
| English). Mix of typed and handwritten. I had as results:
|
| Google Vision: 95.62% HW - 99.4% Typed
|
| Amazon Texttract: 95.63% HW - 99.3% Typed
|
| Azure: 95.9% HW - 98.1% Typed
|
| Then if curious, TrOCR was the best FOSS solution at 79.5%
| HW and 97.4% Typed. (However it took roughly 200x longer
| than Tesseract which was 43% HW and 97.0% Typed)
| dimatura wrote:
| When did you do this test? I don't have any numbers
| handy, but a couple years ago I compared google's OCR vs
| AWS's on "text in the wild" pictures. AWS' wasn't bad,
| but it was definitely outperformed by the google one. The
| open-source solutions I tried (tesseract and some
| academic deep-learning code) were far behind.
| driscoll42 wrote:
| This was a couple months ago now, so not that long ago.
| For OCR I have found that it _highly_ depends on the type
| of image you are looking at. In my case these were all
| scanned documents of good but not great scan quality, all
| in English. I expect if you were talking about random
| photos with text in them, you 'd see the FOSS solutions
| do much worse, and much more variance in the Google vs
| Amazon vs Azure. I would be curious about the academic
| deep learning one you tried.
| dimatura wrote:
| The main one was https://github.com/JaidedAI/EasyOCR,
| mostly because, as promised, it was pretty easy to use,
| and uses pytorch (which I preferred in case I wanted to
| tweak it). It has been updated since, but at the time it
| was using CRNN, which is a solid model, especially for
| the time - it wasn't (academic) SOTA but not far behind
| that. I'm sure I could've coaxed better performance than
| I got out of it with some retraining and hyperparameter
| tuning.
| j16sdiz wrote:
| Do they feed your data to recaptcha ?
| threeseed wrote:
| Google has scanned 40 million+ physical books and magazines
|
| Which it used OCR to produce digital text.
|
| So one source of training data at least.
| jstummbillig wrote:
| I did not check. I also never checked if they share my
| mails on google search with you -- but I trust their
| ambition to not be sued into the ground for doing something
| _immensely_ stupid.
|
| Leaking sensitive data of enterprise customers as training
| material for public recaptchas falls in that category.
| mhitza wrote:
| Do you (or anyone else) know which would be a good open source
| OCR for PDFs and images?
| bobosha wrote:
| easyocr was the best of the bunch - however it still leaves
| quite a bit to be desired.
| lelandfe wrote:
| OCRmyPDF is the typical answer:
| https://github.com/ocrmypdf/OCRmyPDF
|
| It uses Tesseract under the hood. Results tend to just be OK
| in my experience.
| xrd wrote:
| Tesseract is amazing. It is simple and generally good
| results.
| version_five wrote:
| I was just playing with tesseract last week (I'd used it
| years ago) and wasn't too happy. I had a pretty simple
| pdf that was in what you could think of as an old
| typewritten font, but easily legible, and I got all kinds
| of word fragments and nonsense characters in the output.
| I know that high quality ocr systems include a language
| model to coerce the read text into the most probable
| words. Is tesseract just supposed to be the first stage
| of such a system?
|
| I'll note that when I put the tesseract output into
| chatgpt and prompted it saying it was ocr'd text and
| asking to clean it up, it worked very well.
| flaviut wrote:
| I was just processing a document with tesseract &
| ocrmypdf, and two things:
|
| My first time processing it, I used `ocrmypdf --redo-ocr`
| because it looked like there was some existing OCR. After
| processing, the OCR was crap because ocrmypdf didn't
| realize it was OCR but thought it was real text in the
| document that should be kept. This was fixable using
| `ocrmypdf --force-ocr`.
|
| Before realizing this, I discovered that Tesseract 4 & 5
| use a neural network-based recognition. I then came
| across this step-by-step guide on fine-tuning Tesseract
| for a specific document set:
| https://www.statworx.com/en/content-hub/blog/fine-tuning-
| tes...
|
| I didn't end up following the fine-tuning process because
| at this point `ocrmypdf --force-ocr` worked excellently,
| but I thought the draw_box_file_data.py script from their
| example was particularly useful: https://gist.github.com/
| flaviut/d901be509425098645e4ae527a9e...
| denysvitali wrote:
| FWIW, I'm using Google's ML Kit which runs completely on-
| device and doesn't send the documents to Google. It works
| better than tesseract for my use case.
|
| I did a presentation on the topic recently: https://clis-
| everywhere.k8s.best/16
|
| I'll soon make the stack open source, but it shouldn't be
| hard to recreate given the inputs I've already provided.
| [deleted]
| dataflow wrote:
| I could never get tesseract to give good output. It
| baffles me when people say it's good. Do I need to tune
| it somehow or something?
| driscoll42 wrote:
| Tessearct is generally the overall best for typed documents,
| though it struggles with handwriting. TrOCR is better than
| Tesseract, especially with handwriting, but requires a GPU to
| have any speed. Tesseract from my tests was roughly 200X
| faster than TrOCR (not an exaggeration)
| wahnfrieden wrote:
| Not open but free - Apple's
| occamrazor wrote:
| Which Apple product?
| imaurer wrote:
| Two places I use it: Preview on my Mac, photos on my
| phone. Haven't seen an api yet.
| momo93 wrote:
| https://developer.apple.com/documentation/vision/recogniz
| ing...
| vosper wrote:
| If you want to OCR specific text you can use Textsniper
| on Mac and draw a box on whatever part of the screen you
| want to capture. I'm guessing under the hood it's just
| using Apple's OCR tech, which does work very well (at
| least if you're on Apple Silicon, it's not quite so fast
| on my 2015 Intel Macbook Pro)
| wahnfrieden wrote:
| VNRecognizeTextRequest and DataScannerViewController
| kcorbitt wrote:
| When I was evaluating options a few months ago I found
| https://github.com/PaddlePaddle/PaddleOCR to be a very strong
| contender for my use case (reading product labels), but
| you'll definitely want to put together some representative
| docs/images and test a bunch of solutions to see what works
| for you.
| jameshart wrote:
| Five years? Where's that number coming from?
|
| This seems like exactly the kind of problem that will see rapid
| improvements as people point more LLMs at multimodal input.
|
| Right now making predictions for ML capabilities on a five year
| timeframe seems foolhardy.
| ryanjshaw wrote:
| This is really cool if it delivers. I tried building an app to
| scan till receipts. The image to text APIs out there really don't
| perform as well as you'd think. AWS Text Extract performed far
| better than GCP and Azure equivalents and traditional OCR
| solutions, but it still made some really annoying errors that I
| had to fix with heuristics.
| werdnapk wrote:
| Was Tesseract one of the APIs you tried?
| ryanjshaw wrote:
| Yup, it was solid but not as good as AWS out of the box. IIRC
| preprocessing the image did help, but I didn't have enough
| time to spend on fleshing that out for an MVP. (I gave up on
| the project when I realised that recently introduced
| protection of personal information laws in my country would
| have made this project too risky to continue work on. The
| intention was to automatically extract spending habits from
| receipts to improve personal finance management.)
| [deleted]
| DannyBee wrote:
| Unfortunately, trying this out, it seems to be nowhere near the
| claimed quality. Definitely not ready for prime time.
|
| Feels like someone trying to throw a stake in the ground rather
| than releasing a quality product, honestly.
| xavriley wrote:
| There's a model for music transcription (audio to midi) called
| MT3 which takes an end-to-end transformer approach and claims
| SOTA on some datasets. However, from my own research and
| comparing with other models it seems that MT3 is very prone to
| overfitting and the real world results are not as impressive. A
| similar story seems to be playing out in the comments here
| onnodigcomplex wrote:
| What would you say is a good model for audio to midi
| transcription?
| vosper wrote:
| I want to build an application that scans restaurant and cafe
| menus (PDFs, photos, webpages) to identify which items are
| vegetarian or vegan. Would this work for that? If not, I would
| love to hear peoples ideas and suggestions.
| Alifatisk wrote:
| We have a similar idea, but mine includes a few other
| categories!
| amelius wrote:
| I bet in a year or so you don't need a specialized app for
| it, but you just ask your phone whatever you want to know
| about anything around you, including menus.
| Alifatisk wrote:
| Google maps is pretty close to that, I can already find
| lots of places or restaurants based on what I want to eat.
| And now, thanks to crows sourcing, you can filter by a
| range of options.
|
| The only problem is that the data is closely walled by
| Google and you can only access it through their api.
|
| What I want to create is a tiny search engine that collects
| all menus (somehow collected from images) and let users
| find and filter what they like, and even get
| recommendations nearby!
| amelius wrote:
| > The only problem is that the data is closely walled by
| Google and you can only access it through their api.
|
| Perhaps if iOS would allow an AI to access the screen
| pixels directly ...
| Alifatisk wrote:
| And then OCR from the screen?
| vosper wrote:
| Possibly like this
|
| https://news.ycombinator.com/item?id=34315782
| tuardoui wrote:
| You should look at LayoutLM models for a NER task. Then your
| pipeline should look like : - Identity the menu sub structure
| (title, item list ...) - Classify each item with 2 labels.
|
| The training process is not hard, but the data gathering /
| cleaning / labelling can be a little long.
| vosper wrote:
| Thanks! I haven't heard of LayoutLM but something that can
| understand structure from a few examples could be just what I
| need.
| 7moritz7 wrote:
| The non quantized models look relatively large, 800 MB. So
| you'd probably need to do inference on a server and somehow
| monetize that. Sounds difficult
| kolinko wrote:
| With vegan you can't estimate it 100% from menu alone - because
| the sauce and other minor ingredients can be animal based.
|
| If you want to do it, using "plant based" is probably better
| than "vegan", and it's always good to make sure your users are
| aware that the mark can be wrong and they should double-check
| with the waiter.
|
| As for your question - I didn't play with Donut, but ocr+gpt or
| multimodal gpt4 once released should handle this smoothly.
| ada1981 wrote:
| 20+ years vegan here.
|
| You could combine ingredient search, looking for symbols the
| actually designate vegan as some places do, along with
| long/lat data to determine what the restaurant actually is
| and then check it with a database you maintain.
|
| So I could scan a menu, and then ask the owner or server
| about certain dishes, and then crowdsource an updated
|
| It would be great if there was a standard API for all
| restaurants that included all menu items, prices,
| ingredients, preparation and sourcing information. I _could_
| be maintained like a wiki I suppose, and restaurants could be
| incentivized by including their restaurants.
| vosper wrote:
| Yeah crowd-sourcing updates is the way I'd like to go. I'm
| hoping people will submit photos of menus, because lots of
| bakeries, cafes etc don't have much of an online presence
| or keep their menus up-to-date.
|
| Trying to solve the problem of scanning through menus for
| multiple restaurants to find something a vegan or
| vegetarian can eat, and instead just showing all the
| individual menu options in the area as a list.
| tkanarsky wrote:
| > Donut: DOcumeNt Understanding Transformer
|
| Author: phew! I'm glad there's an 'n' in there somewhere
| sebzim4500 wrote:
| As AI goes that isn't too bad. See LION = evoLved sIgn
| mOmeNtum, which I have to assume is parody.
| saretup wrote:
| Is there an online tool/piece of code that can find words like
| this in a string?
| layer8 wrote:
| See https://stackoverflow.com/questions/56414347/find-the-
| words-.... Replace _needles_ with a dictionary file of your
| choice.
| IanCal wrote:
| I'm sure everyone is kinda tired of this answer, but gpt4. At
| least I have the share thing now so those who want to avoid
| it don't have to see a big pasted output.
|
| https://chat.openai.com/share/25124525-0bad-4c13-ae5a-ae4bea.
| ..
|
| 3.5 doesn't really get it, 4 does. There are some that aren't
| great but the context is pretty small and it can be a decent
| launching point.
| TeMPOraL wrote:
| I took your prompt and tried to get it to generate a bit
| less serious-sounding backronyms. GPT-4 is good at this.
| Me, not so much.
|
| https://cloud.typingmind.com/share/b49794e0-4b2b-4213-ae2e-
| 8...
|
| I'll grant one thing: the paper must go well with cheese.
| IanCal wrote:
| Fantastic. I tried a few dafter ones based on this https:
| //chat.openai.com/share/d9a31442-96e5-4a18-92d9-d217c7...
| leononame wrote:
| This made me laugh out loud on the toilet. The cheese
| ones are great, I really like FETA
| JasonFruit wrote:
| That was a beautiful misunderstanding, right out of 1960s
| scifi.
| JasonFruit wrote:
| People are justifiably excited about these language tools,
| but we're getting tired of this answer because it's not a
| good answer: "Use GPT-4! Some of the answers aren't that
| great, but it's at least a starting point." That's like if
| you asked how to sort a list and remove duplicates, and the
| answer was to import it into a spreadsheet program and
| follow these steps instead of just "sort | uniq". It's
| suggesting a general-purpose tool to do a specific job
| maybe kinda acceptably instead of suggesting the right tool
| for the job.
|
| It reminds me of the microwave cookery books that came out
| after consumer microwaves became available: there are
| things a microwave is good at, but those books used it for
| _everything_ , just like we're using GPT-4 today. We'll
| calm down eventually.
| IanCal wrote:
| > but we're getting tired of this answer because it's not
| a good answer: "Use GPT-4! Some of the answers aren't
| that great, but it's at least a starting point."
|
| Some of the answers are straight up usable, others if you
| prefer you can go from there because _this a creative
| language task_.
|
| And there isn't really a specific tool for this, is
| there? It's nothing like your comparison to a very well
| specified problem. "Identify what this thing does, and
| come up with a title that also contains a word, and the
| word is related to the topic" is not the same as
| sort|uniq Vs a spreadsheet.
| adrianmonk wrote:
| There are two issues here.
|
| (1) The problem is not clearly defined. Does the word
| need to be thematically related to the topic? (As far as
| I can tell, "Donut" isn't thematically related to
| document understanding.) Maybe you could say it's a nice,
| optional bonus if it's related.
|
| (2) The best solution would be good at _two_ things: (A)
| satisfying constraints and (B) creativity. ChatGPT is
| unlikely to be good at A, and a non-AI algorithm that
| just finds valid words can 't do B.
|
| Regarding #1, if people don't all have the same idea of
| the problem, they're not going to agree on the solution.
|
| Regarding #2, maybe a combined solution would be best.
| Generate all allowable words, then feed them to ChatGPT
| and have it say which ones are thematically good.
| IanCal wrote:
| Tbh I think gpt4 shines at this. People's requirements
| will be different, in weird ways. Duckdb things are all
| duck related. Rust is related to crabs. Your project may
| all be sweets related. It might be serious or fun. Maybe
| you want a name easy to draw.
|
| These are hard to encode.
|
| Instead I just asked "Make them more fun, and related to
| literary characters", then Muppets and awkward ones based
| on Harry potter which it described as "certainly a unique
| request". It's faster than getting a word list related to
| that. And they are frankly great - better than I'd come
| up with given much longer.
|
| https://chat.openai.com/share/d9a31442-96e5-4a18-92d9-d21
| 7c7...
|
| This problem is great for llms. It's language, works well
| with a back and forth discussing good and bad options,
| has no well defined output requirements but is easy to
| explain to a person, has a human in the loop _and_ has
| almost zero cost if it 's wrong.
|
| > QUIXOTE: Quality Unstructured Information Extraction
| and Organization Through End-to-end transformer
| yyyk wrote:
| >That's like if you asked how to sort a list and remove
| duplicates, and the answer was to import it into a
| spreadsheet program and follow these steps instead of
| just "sort | uniq".
|
| That's _exactly_ how many Windows office users do it,
| they paste a list into Excel and use it to remove
| duplicates. There are alternatives even on Windows, but
| It 's must easier for them to use a single general-
| purpose graphical tool (and let's not get started with
| the abominations VLOOKUP is used for).
|
| I used to look down on that but then I realized that
| using a graphical program for list manipulation is kinda
| cool and that this program is rather capable, and could
| create combinations that are rather difficult to do with
| the more specialized tools. I still use these specialized
| tools (I'm used to them, and I can do some stuff they
| can't easily do in Excel).
| TeMPOraL wrote:
| Yeah, I used to look down on that kind of things too, and
| a decade later, I find myself doing those very things.
|
| Yes, I know sort | uniq. I even have a couple Linux
| shells open on my Windows work system. But I can't for
| the life of me remember the magic flags, so I'll either
| paste the list to Emacs and M-x sort-lines + M-x delete-
| duplicate-lines, or paste it to Excel and do it there, or
| do something even more cheesy - whatever is least likely
| to break my flow.
|
| There are tools more or less optimized for any specific
| job, but the _best_ tool for the job is the one you have
| handy, and are experienced in using.
|
| I too increasingly often find myself using GPT-4 for
| random, ad-hoc tasks. They may or may not be better tools
| out there. I may even have some installed. But none of
| them beat being able to just describe what you want,
| paste some data, and get the results out few seconds
| later.
| thelastparadise wrote:
| > Is there an online tool/piece of code that can find
| words like this in a string?
|
| ...but he's using a language model for a language task.
| JasonFruit wrote:
| Not all language tasks, even, are going to be best
| handled by these models.
| heyoni wrote:
| But if there isn't a better one then this isn't the
| microwave analogy you think it is
| i2cmaster wrote:
| I've started using Microsoft's TROCR (another transformer OCR
| model) to read the cursive in my pocket journal (I have a habit
| of writing programs there first while I'm out and then typing
| them in manually, I just focus better that way.)
|
| It's surprisingly accurate although you have to write your own
| program to segment the image into lines. I think with some fine
| tuning I could have the machine read my notebook with minimal
| corrections.
| driscoll42 wrote:
| Have you looked into Craft or EAST for segmenting the image
| into lines? Those two work decently.
| aosmith wrote:
| So is this why IA had an outage? Timing is perfect.
| dkatz23238 wrote:
| As a developer who has been building IDP solutions I can assert
| that although this model is a lot larger (more weights) than a
| Graph Neural Network on OCR tokens, industry standard before
| transformers, it outperforms given enough data. Depending on how
| heterogenous the data is usually 200 documents can reach human
| levels of accuracy on documents, scoring by levenshtein ratio.
|
| Smaller graph models could get away with using less data. The
| problem that the "traditional" approach had is the the quality of
| the OCR was the bottleneck for overall model performance. It
| amazes me how this problem shifted from a node classification
| problem to a image to text problem.
|
| Training on CPU was possible with GCN but not with Donut.
| dowakin wrote:
| If you want to train the Donut, check out this notebook on
| Kaggle. It trains Donut to read plots for a competition. The
| notebook contains full pipeline for finetuning.
| https://www.kaggle.com/code/nbroad/donut-train-benetech
| armchairhacker wrote:
| These OCR tools are bringing us closer to msPaint as a viable IDE
| siddiqi123 wrote:
| what are some available prompts? for example, I see there are
| "<s_synthdog>" and "<s_iitcdip>", what are some other options? I
| tried to use made-up ones such as "<s_what>" and seems it also
| works, why? what is the meaning of prompts here? Thanks
| nestorD wrote:
| I will have to investigate this, I am dreaming of a system that
| can take a pdf scan of a book as input and produce one or more
| properly formated (headings, italic, bold, underline, etc)
| markdown files. In my tests, LLMs have proved very good at
| cleaning a raw OCR but they need formating information to get me
| all the way.
| jcuenod wrote:
| It's not ready to take a book, but I'm building an app that
| takes scans of book chapters/journal articles (which I often
| receive from my college library) and turns them into well
| formatted PDFs (with OCR, consistent margins, rotation...)
| https://fixpdfs.com
| nestorD wrote:
| I don't think it supports a markdown / formated text export
| but it looks looks fantastic as far as pdf cleanup goes (i
| currently rely on adobe scan for that when i am working from
| a paper copy), I will try it soon.
| jcuenod wrote:
| No, it doesn't support markdown and it doesn't do analysis
| of headers/page numbers. It's mainly aimed at making
| academic PDFs better for reading and annotating (especially
| on iPad-like devices). Hoping to start charging for it at
| some point, but I'm still trialing it...
|
| Some users have expected it to "unwarp" bad scans, it also
| doesn't do that unfortunately. But that's a much harder
| problem to solve...
| nestorD wrote:
| I found perspective transformation to be good enough for
| 90% of cases. Going further would be a lot of effort
| better spent elsewhere. Does it deal fine with
| illustrations and pdf compression?
___________________________________________________________________
(page generated 2023-05-29 23:00 UTC)