[HN Gopher] Ask HN: What is the best method for turning a scanne...
___________________________________________________________________
Ask HN: What is the best method for turning a scanned book as a PDF
into text?
I like reading philosophy, particularly from the authors rather
than a secondhand account. However I often run into that these
come as scanned documents, Discourses on Livy and Politics Among
Nations for example. I would greatly benefit from turning these
into text. I can snipping tool pages and put them in ChatGPT and it
turns out perfect. If I used classic methods, it often screws up
words. My final goal is to turn these into audiobooks, (or even
just make it easier to copypaste for my personal notes) Given the
state of AI, I'm wondering what my options are. I don't mind
paying.
Author : resource_waste
Score : 107 points
Date : 2025-02-14 14:32 UTC (2 days ago)
| ssuds wrote:
| Perhaps building an agent in something like gumloop that loops
| page by page, does AI ocr, and then exports to a Google doc?
| Should take like 10 minutes to set up
| brudgers wrote:
| https://linux.die.net/man/1/pdftotext
|
| is the simplest thing that might work.
|
| It is free and mature.
| jbaiter wrote:
| That will not work for scanned PDFs without a text layer and
| even if it has one, it's not guaranteed to work.
| lquist wrote:
| My understanding is that Gemini OCR is now considered state of
| the art and a material step forward in OCR accuracy
| Etheryte wrote:
| Is this from the article that was on the front page a few days
| ago? If so, it's not true. The title was intentionally
| misleading, they said they're the best, but if you read the
| article it was that they're actually the best in some
| subproblem, not the actual thing.
| kumarm wrote:
| We do this in our Text to speech app (Read4Me):
| https://apps.apple.com/us/app/read4me-talk-browser-pdf-doc/i...
|
| You can scan a book and listen (also copy and paste the text
| extracted to other apps).
|
| If you are looking to do this on large scale in your own UI, I
| would recommend either of Google solutions:
|
| 1. Google Cloud Vision API
| (https://cloud.google.com/vision?hl=en)
|
| 2. Using Gemini API OCR capabilities.(Start here:
| https://aistudio.google.com/prompts/new_chat)
| constantinum wrote:
| Give LLMWhisperer a try. Here is a playground for testing
| https://pg.llmwhisperer.unstract.com/
| chpatrick wrote:
| I think you answered it yourself, stick it into a multimodal LLM.
| Larrikin wrote:
| Paperless uses the latest traditional method. There are LLM
| enhancements you can download
| ekianjo wrote:
| Docling
| maurycy wrote:
| I had very good experience with `gemini-2.0-flash-exp`:
|
| https://github.com/maurycy/gemini-json-ocr
| staplung wrote:
| It's hard to know what to make of this because while you've
| included the output JSON you haven't included the input PDF so
| I have no idea how to interpret what it's actually doing.
| maurycy wrote:
| Give it a try on any PDF! This is just 100 LOC, easy to
| audit.
| fucalost wrote:
| depending on the length of these texts -- and your technical
| ability -- you might want to check out AWS Textract
|
| it would be easy to set up a pipeline like:
|
| > drop pdf in s3 bucket > eventbridge triggers step function >
| sfn calls textract > output saved to s3 & emailed to you
| srameshc wrote:
| Google Cloud Document AI is amazing, I love it
| https://cloud.google.com/document-ai?hl=en
|
| It can correctly read many other languages than English if that
| is something you need. Previously I tried others and there were
| many errors in conversion. This does it well.
| arctangos wrote:
| I'm curious about this api. I'd looked at it before but it
| didn't seem like it could handle arbitrary input that didn't
| fit one of the predefined schemas. I also wasn't sure how much
| training data it needed. What has your experience been like?
| pronoiac wrote:
| I made a high-quality scan of PAIP (Paradigms of Artificial
| Intelligence Programming), and worked on OCR'ing and
| incorporating that into an admittedly imperfect git repo of
| Markdown files. I used Scantailor to deskew and do other
| adjustments before applying Tesseract, via OCRmyPDF. I wrote
| notes for some of my process over at
| https://github.com/norvig/paip-lisp/releases/tag/v1.2 .
|
| I'd also tried ocrit, which uses Apple's Vision framework for
| OCR, with some success - https://github.com/insidegui/ocrit
|
| It's an ongoing, iterative process. I'll watch this thread with
| interest.
|
| Some recent threads that might be helpful:
|
| * https://news.ycombinator.com/item?id=42443022 - Show HN:
| Adventures in OCR
|
| * https://news.ycombinator.com/item?id=43045801 - Benchmarking
| vision-language models on OCR in dynamic video environments -
| driscoll42 posted some stats from research
|
| * https://news.ycombinator.com/item?id=43043671 - OCR4all
|
| (Meaning, I have these browser tabs open, I haven't fully
| digested them yet)
| kingkongjaffa wrote:
| Was technology the right approach here? Is it essentially done
| now? I couldn't tell if it was completed entirely.
|
| I can't help but think a few amateur humans could have read the
| pdf with their eyes and written the markdown by hand if the OCR
| was a little sketchy.
| pronoiac wrote:
| It's still in progress! It's looong - about a thousand pages.
| There's an ebook, but the printed book got more editing.
| lherron wrote:
| Also this:
|
| https://news.ycombinator.com/item?id=42952605 - Ingesting PDFs
| and why Gemini 2.0 changes everything
| tyronehed wrote:
| For years I have been printing PDFs off on regular paper and then
| binding them into books. 1. Print it at work when no one is
| looking. 2. Get two rigid boards and squeeze the stack of paper
| together. I customarily use two wooden armrests that originally
| came from a garden-furniture lounger. 3. Squeeze the paper with
| just a 1/4-inch showing. 4. Use wood glue and with your finger
| working like a toothbrush, work the glue into the pages at the
| gluing end. 5. Get a 14-inch X 4-inch strip of canvas. I use
| cutoff painter's canvas. 6. Hang all this by the boards and put
| glue also on top of the canvas strip. 7. When it dries, remove
| the boards and glue down the sides. You have a strong, bound book
| out of those printed pages.
| thinkmassive wrote:
| It's unclear how this is related to the article, but I'm
| intrigued by your simple DIY bookbinding process.
|
| It seems straightforward except for the canvas strip (I assume
| this is part of the binding?), and whether you add thicker
| pages/boards on each side as covers.
|
| Do you have any photos of the process, or at least of a
| finished product? Thanks!
| therealmarv wrote:
| Docling is great for PDFs https://github.com/DS4SD/docling but if
| the input is really only images (in PDF) than cloud AI based
| solutions (like latest models from Google) may be better.
| sciencesama wrote:
| As of now Google Gemini
| knicholes wrote:
| I'm biased as an employee, but who knows PDFs better than Adobe?
| Use their PDF text extraction API.
| aragonite wrote:
| I did this very recently for a 19th century book in German with
| occasionally some Greek. The method that produces the highest
| level of accuracy I've found is to use ImageMagick to extract
| each page as a image, then send each image file to Claude Sonnet
| (encoded as base64) with a simple user prompt like "Transcribe
| the complete text from this image verbatim with no additional
| commentary or explanations". The whole thing is completed in
| under an hour & the result is near perfect and certainly much
| better than from standard OCR softwares.
| HarHarVeryFunny wrote:
| Is it really necessary to split it into pages? Not so bad if
| you automate it I suppose, but aren't there models that will
| accept a large PDF directly (I know Sonnet has a 32MB limit)?
| 7thpower wrote:
| They are limited on how much they can output and there is
| generally an inverse relationship between the amount of
| tokens you send vs quality after the first 20-30 thousand
| tokens.
| siva7 wrote:
| They all accept large PDFs (or any kind of input) but the
| quality of the output will suffer for various reasons.
| ant6n wrote:
| I recently did some OCRing with OpenAI. I found o3-mini-hi to
| be imagining and changing text, whereas the older (?) o4 was
| more accurate. It's a bit worrying that some of the models
| screw around with the text.
| jazzyjackson wrote:
| There's GPT4, then GPT4o (o for Omni, as in multi modal) and
| then GPT o1 (chain of thought / internal reasoning) then o3
| (because o2 is a stadium in London that I guess is very
| litigious about its trademark?), o3-mini is the latest but
| yes optimized to be faster and cheaper
| polshaw wrote:
| o2 is the UK's largest mobile network operator. They bought
| naming rights to what was known as the millennium dome (not
| even a stadium).
| jazzyjackson wrote:
| Ahh makes sense :)
| dotancohen wrote:
| What is the o3 model good for? Is it just an evolution of
| o1 (chain of thought / internal reasoning)?
| Oras wrote:
| Quick and easy: Gemini Flash 2
|
| More of a system: AWS Textract or Azure Document Intelligence.
| This option requires some coding and the cost is higher than
| using a vision model.
| __rito__ wrote:
| I have tried a bunch of things. This is what worked best for me:
| Surya [0]. It can run fully local on your laptop. I also tried
| EasyOCR [1], which is also quite good. I haven't tried this
| myself, but I will look at Paddle [2] if the previous two don't
| float your boat.
|
| All of these are OSS, and you don't need to pay a dime to anyone.
|
| [0]: https://github.com/VikParuchuri/surya
|
| [1]: https://github.com/JaidedAI/EasyOCR
|
| [2]: https://github.com/PaddlePaddle/Paddle
| carlosjobim wrote:
| I would like to pay a dime and more for any of these solutions
| discussed in the thread as a normal MacOS program with a
| graphical user interface.
| RobGR wrote:
| There was a thread here recently about OCR4All ( I haven't used
| any of these tools recently, but I'm keeping track because I
| might be doing that soon ).
|
| https://news.ycombinator.com/item?id=43043671
|
| https://www.ocr4all.org/
| quuxplusone wrote:
| Copyright issues aside (e.g. if your thing is public domain), the
| galaxy-brain approach is to upload your raw scanned PDF to the
| Internet Archive (archive.org), fill in the appropriate metadata,
| wait about 24 hours for their post-upload format-conversion tasks
| to run automatically, and then download the size-optimized and
| OCR-ized PDF from them.
|
| I've done this with a few documents from the French and Spanish
| national archives, which were originally provided as enormous
| non-OCRed PDFs but shrank to 10% the size (or less) after passage
| through archive.org and incidentally became full-text-searchable.
| huijzer wrote:
| Last time I checked a few months ago, LLMs were more accurate
| than the OCR that the archive is using. The web archive version
| is/was not using context to figure out that for example "in the
| garden was a trge" should be "in the garden was a tree". LLMs
| depending on the prompt do this.
| ritvikpandey21 wrote:
| hey, i recommend checking out the previous HN threads [1] on why
| LLMs shouldn't be used in production-grade OCR, especially if
| accuracy is super important (as in the audiobook case)
|
| we wrote the post and created Pulse [2] for these exact use
| cases, feel free to reach out for more info!
|
| [1]: https://news.ycombinator.com/item?id=42966958 [2]:
| https://runpulse.com
| dr_dshiv wrote:
| We are working on a project to have original language on the left
| and translation on the facing page. Instead of perfect
| translations or OCR, we try to report error rates with random
| sampling. We plan to have the texts editable on a Wikimedia
| server. Curious if you know similar efforts!
| 7thpower wrote:
| I have not seen this answer so I'll chime in:
|
| There is a lot of enthusiasm around language models for OCR and I
| have found that generally they work well, however I have had much
| better results, especially if there are tables etc., by sending
| the raw page to the llm along with the ocrd page, and asking it
| transcribe from the image and validate words/character sequences
| against the ocr.
|
| This largely solves for numbers and things being jumbled or
| hallucinated.
|
| I recently tested llamaparse after trying it a year prior and was
| very impressed. You may be able to do your project on the free
| tier, and it will do a lot of this for you.
| knallfrosch wrote:
| I had great results with "Azure Document Intelligence Studio",
| followed by OpenAI's LLM. But this was half a year back and I
| wanted it to work via API.
| mindcrime wrote:
| Two possibilities are "top of mind" for me:
|
| You could script it using Gemini via the API[1].
|
| Or use Tesseract[2].
|
| [1]: https://ai.google.dev/
|
| [2]: https://github.com/tesseract-ocr/tesseract
| xyst wrote:
| Seeing blind recommendations for AI slop is very disappointing
| for HN.
|
| For OP, there is a library written in rust that can do exactly
| what you need with very high accuracy and performant [1].
|
| Would need to OCR dependencies to get it to work on scanned books
| [2].
|
| [1] https://github.com/yobix-ai/extractous
|
| [2] https://github.com/yobix-ai/extractous?tab=readme-ov-
| file#-s...
| cess11 wrote:
| That looks rather nice, actually. Thanks.
|
| I especially like the approach to graalify Tika.
| geebee wrote:
| I've found there's a big difference in OCR accuracy where it
| comes to handwriting. For printed text, I've used tesseract, but
| it seems to miss a lot for handwriting. In my experience, google
| cloud vision is far more accurate at transcribing handwriting.
| Haven't tried other cloud based tools, so I couldn't tell you if
| it's better, but I would say that overall, the cloud based ones
| seem to be much better at handwriting or oddly formed text, but
| that for basic typeset printed text, open source apps like
| tesseract do well.
| netfortius wrote:
| ocrmypdf with some specific prompts, depending on the source
| (language, force, etc.) worked for me most of the time. The
| biggest issue for which I have not been able to find a solution
| yet, is proper conversion of pdf to epub. I read a lot on my
| phone, and the inflexibility of pdf format, with the ugliness of
| "reflow" as the only apparent option to give reading the look of
| true epub, on phones, is frustrating.
| briga wrote:
| I was actually just working on a project like this to digitize an
| old manuscript. I used a PDF scanning app (there are plenty, I
| used Naps32, simple but it works). And then I piped the images
| into a `tesseract-ocr`. This will extract the text from the image
| but it won't deal with formatting or obvious typos. For that
| you're going to want to feed the text into an LLM with some
| prompt telling the model to correct errors, fix formatting, and
| provide clean text. Smaller local models (<70b parameters) do not
| work very well on this task for big documents, but I found
| ChatGPT's reasoning model does a fine job. My goal is to find a
| model that can run locally with similar performance.
| arctangos wrote:
| It's surprising to me that no one has mentioned llamaparse. My
| team has been using it for a while and is quite satisfied. If
| other people think that other services are better then I'd be
| interested in hearing why.
| brighac wrote:
| The best pdf tool ever. Ilovepdf.com
| brighac wrote:
| The best pdf tool ever. Ilovepdf.com
| Electricniko wrote:
| For classic books like those you mentioned, Project Gutenberg has
| text versions along with pdfs/epubs/etc.
|
| For instance, Discourses on Livy:
|
| https://www.gutenberg.org/cache/epub/10827/pg10827-images.ht...
|
| https://www.gutenberg.org/ebooks/10827
| tyrust wrote:
| Even better is when Standard Ebooks publishes a version:
| https://standardebooks.org/ebooks/niccolo-machiavelli/discou...
| jmrm wrote:
| Some time ago I was toying around with a library called
| [MuPDF](https://www.mupdf.com/) for something related, and with
| that library and a small Python script you can programmatically
| OCR any book you want.
|
| That library is free for personal or open source projects, but
| paid for commercial ones
| jonnycoder wrote:
| I recently used AWS Textract and had good results. There are
| accuracy benchmarks out there, I wish I saved the links, but I
| recall Gemini 2.0 and Textract towards the top in terms of
| accuracy. I also read that an LLM could extrapolate/conjure up
| cropped text therefore my idea would be to combine traditional
| OcR with LLM to determine conflicts.
| eigenvalue wrote:
| I made a site recently that works pretty well for this for a lot
| of sample scanned PDFs I tried, you might get good results:
|
| https://fixmydocuments.com/
|
| I also made a simple iOS app that basically just uses the built
| in OCR functionality on iPhones and automatically applies it to
| all the pages of a PDF. It won't preserve formatting, but it's
| quite accurate in terms of OCR:
|
| https://apps.apple.com/us/app/super-pdf-ocr/id6479674248
___________________________________________________________________
(page generated 2025-02-16 23:00 UTC)