[HN Gopher] Running OCR against PDFs and images directly in the ...
       ___________________________________________________________________
        
       Running OCR against PDFs and images directly in the browser
        
       Author : simonw
       Score  : 125 points
       Date   : 2024-03-30 18:33 UTC (4 hours ago)
        
 (HTM) web link (simonwillison.net)
 (TXT) w3m dump (simonwillison.net)
        
       | kgbcia wrote:
       | I was thinking of doing something like this for visually impaired
       | users. The next step is to pipe it into the JavaScript web speech
       | synthesis API.
       | 
       | https://mdn.github.io/dom-examples/web-speech-api/speak-easy...
        
         | mwcampbell wrote:
         | We already have our own tools for that, either integrated into
         | screen readers or available as add-ons. Thanks for the thought,
         | though.
        
       | reliablereason wrote:
       | Safari already does that. Quite a useful feature.
        
         | minimaxir wrote:
         | Specifically, only Apple Silicon allows automatic OCR. Works on
         | iOS too.
        
           | pvg wrote:
           | It works on Intel Safari as well.
        
             | minimaxir wrote:
             | It doesn't work on my Intel Macs unless it's really slow.
        
       | aabhay wrote:
       | I was really impressed until I realized that the app is basically
       | a wrapper around tesseract.js, which is the actually cool part.
       | Tesseract has a wasm port that can operate inside of a webworker.
       | 
       | Not saying that the article was being misleading about this, just
       | saying that the LLM part is basically doing some standard
       | interfacing and HTML/CSS/JS around that core engine, which wasn't
       | immediately obvious to me when scanning the screenshots.
        
         | simonw wrote:
         | The LLM part is almost irrelevant to the final result to be
         | honest: I used LLMs to help me build an initial prototype in
         | five minutes that would otherwise have taken me about an hour,
         | but the code really isn't very complex.
         | 
         | The point here is more about highlighting that browsers can do
         | this stuff, and it doesn't take much to wire it all together
         | into a useful interface.
        
       | codazoda wrote:
       | This is timely. I just completed a few experiments and wrote a
       | little about doing OCR on my handwritten notes.
       | 
       | https://notes.joeldare.com/handwritten-text-recognition
       | 
       | Tesseract was one of the tools I tested, although I used the CLI
       | instead of the WASM version.
        
       | ignoramous wrote:
       | Wow, this is _promising_. I tried on a few poorly scanned papers
       | I 've lying about. A few observations:
       | 
       | 1. Pre-process PDF images to detect letters better?
       | 
       | 2. Use LLMs to spell/grammar check and perhaps even auto-complete
       | missing pieces?
       | 
       | 3. Employ rich text to capture style (ex: lexical.dev)?
       | 
       | Unsure if it is feasible to bundle it all up for web.
       | 
       | See also: https://github.com/RajSolai/TextSnatcher /
       | https://github.com/VikParuchuri/surya
        
         | simonw wrote:
         | I've been trying out alternative versions of this that pass
         | images through to e.g. the Claude 3 vision models, but they're
         | harder to share with people because they need an API key!
        
           | euazOn wrote:
           | In case you wanted to add a pre-processing step, I found this
           | ImageMagick script useful: https://www.fmwconcepts.com/imagem
           | agick/textcleaner/index.ph...
           | 
           | Not sure how difficult it is to run it in the browser,
           | though.
        
             | CharlesW wrote:
             | FYI, cert is expired.
        
       | CaffeinatedDev wrote:
       | This is cool! I've also used tesseract OCR and found it to be
       | pretty amazing in terms of speed and accuracy.
       | 
       | I use it for ingest of image and pdf type files for my own
       | website chatting tool: tinydesk.ai!
       | 
       | I run the backend on an express js server so all js as well.
       | 
       | Smaller docs I do on the client side, but larger ones (>1.5mb)
       | I've found take forever so those process in the backend.
        
       | fbdab103 wrote:
       | The example on the Tesseract.js page shows it highlighting the
       | rectangles of where the selected text originated. Does this level
       | of information get surfaced through the library for consumption?
       | 
       | I just grabbed a two-column academic PDF, which performed as well
       | as you would expect. If I was returned a json list of text +
       | coordinates, I could do some dirty munging (eg footer is anything
       | below this y index, column 1 is between these x ranges, column 2
       | is between these other x ranges) to self-assemble it a bit
       | better.
        
         | simonw wrote:
         | Yes it does, but I've not dug into the more sophisticated parts
         | of the API at all yet. I'm using it in the most basic way
         | possible right now:                   const {data: {text}} =
         | await worker.recognize(imageUrl);
        
       | Oras wrote:
       | This is nice but Tesseract does not perform well when it comes to
       | tables, at least when I tried it on multiple documents.
       | 
       | It would miss some cells from a table, or does not recognise all
       | the numbers when they have commas.
        
         | simonw wrote:
         | Tables are still the big unsolved problem for me.
         | 
         | There are a ton of potential tools out there like Tabula and
         | AWS Textract table mode but none of them have felt like the
         | perfect solution.
         | 
         | I've been trying Gemini Pro 1.5 and Claude 3 Opus and they
         | looked like they worked... but in both cases I spotted them
         | getting confused and copying in numbers form the wrong rows.
         | 
         | I think the best I've tried is the camera import mode in iOS
         | Excel! Just wish there was an API for calling that one
         | programmatically.
        
       ___________________________________________________________________
       (page generated 2024-03-30 23:00 UTC)