[HN Gopher] Running OCR against PDFs and images directly in the ...
___________________________________________________________________
Running OCR against PDFs and images directly in the browser
Author : simonw
Score : 125 points
Date : 2024-03-30 18:33 UTC (4 hours ago)
(HTM) web link (simonwillison.net)
(TXT) w3m dump (simonwillison.net)
| kgbcia wrote:
| I was thinking of doing something like this for visually impaired
| users. The next step is to pipe it into the JavaScript web speech
| synthesis API.
|
| https://mdn.github.io/dom-examples/web-speech-api/speak-easy...
| mwcampbell wrote:
| We already have our own tools for that, either integrated into
| screen readers or available as add-ons. Thanks for the thought,
| though.
| reliablereason wrote:
| Safari already does that. Quite a useful feature.
| minimaxir wrote:
| Specifically, only Apple Silicon allows automatic OCR. Works on
| iOS too.
| pvg wrote:
| It works on Intel Safari as well.
| minimaxir wrote:
| It doesn't work on my Intel Macs unless it's really slow.
| aabhay wrote:
| I was really impressed until I realized that the app is basically
| a wrapper around tesseract.js, which is the actually cool part.
| Tesseract has a wasm port that can operate inside of a webworker.
|
| Not saying that the article was being misleading about this, just
| saying that the LLM part is basically doing some standard
| interfacing and HTML/CSS/JS around that core engine, which wasn't
| immediately obvious to me when scanning the screenshots.
| simonw wrote:
| The LLM part is almost irrelevant to the final result to be
| honest: I used LLMs to help me build an initial prototype in
| five minutes that would otherwise have taken me about an hour,
| but the code really isn't very complex.
|
| The point here is more about highlighting that browsers can do
| this stuff, and it doesn't take much to wire it all together
| into a useful interface.
| codazoda wrote:
| This is timely. I just completed a few experiments and wrote a
| little about doing OCR on my handwritten notes.
|
| https://notes.joeldare.com/handwritten-text-recognition
|
| Tesseract was one of the tools I tested, although I used the CLI
| instead of the WASM version.
| ignoramous wrote:
| Wow, this is _promising_. I tried on a few poorly scanned papers
| I 've lying about. A few observations:
|
| 1. Pre-process PDF images to detect letters better?
|
| 2. Use LLMs to spell/grammar check and perhaps even auto-complete
| missing pieces?
|
| 3. Employ rich text to capture style (ex: lexical.dev)?
|
| Unsure if it is feasible to bundle it all up for web.
|
| See also: https://github.com/RajSolai/TextSnatcher /
| https://github.com/VikParuchuri/surya
| simonw wrote:
| I've been trying out alternative versions of this that pass
| images through to e.g. the Claude 3 vision models, but they're
| harder to share with people because they need an API key!
| euazOn wrote:
| In case you wanted to add a pre-processing step, I found this
| ImageMagick script useful: https://www.fmwconcepts.com/imagem
| agick/textcleaner/index.ph...
|
| Not sure how difficult it is to run it in the browser,
| though.
| CharlesW wrote:
| FYI, cert is expired.
| CaffeinatedDev wrote:
| This is cool! I've also used tesseract OCR and found it to be
| pretty amazing in terms of speed and accuracy.
|
| I use it for ingest of image and pdf type files for my own
| website chatting tool: tinydesk.ai!
|
| I run the backend on an express js server so all js as well.
|
| Smaller docs I do on the client side, but larger ones (>1.5mb)
| I've found take forever so those process in the backend.
| fbdab103 wrote:
| The example on the Tesseract.js page shows it highlighting the
| rectangles of where the selected text originated. Does this level
| of information get surfaced through the library for consumption?
|
| I just grabbed a two-column academic PDF, which performed as well
| as you would expect. If I was returned a json list of text +
| coordinates, I could do some dirty munging (eg footer is anything
| below this y index, column 1 is between these x ranges, column 2
| is between these other x ranges) to self-assemble it a bit
| better.
| simonw wrote:
| Yes it does, but I've not dug into the more sophisticated parts
| of the API at all yet. I'm using it in the most basic way
| possible right now: const {data: {text}} =
| await worker.recognize(imageUrl);
| Oras wrote:
| This is nice but Tesseract does not perform well when it comes to
| tables, at least when I tried it on multiple documents.
|
| It would miss some cells from a table, or does not recognise all
| the numbers when they have commas.
| simonw wrote:
| Tables are still the big unsolved problem for me.
|
| There are a ton of potential tools out there like Tabula and
| AWS Textract table mode but none of them have felt like the
| perfect solution.
|
| I've been trying Gemini Pro 1.5 and Claude 3 Opus and they
| looked like they worked... but in both cases I spotted them
| getting confused and copying in numbers form the wrong rows.
|
| I think the best I've tried is the camera import mode in iOS
| Excel! Just wish there was an API for calling that one
| programmatically.
___________________________________________________________________
(page generated 2024-03-30 23:00 UTC)