[HN Gopher] Tesseract.js - A Javascript port of the Tesseract OC...
       ___________________________________________________________________
        
       Tesseract.js - A Javascript port of the Tesseract OCR engine
        
       Author : kiyanwang
       Score  : 114 points
       Date   : 2021-08-08 11:06 UTC (11 hours ago)
        
 (HTM) web link (tesseract.projectnaptha.com)
 (TXT) w3m dump (tesseract.projectnaptha.com)
        
       | villgax wrote:
       | Tesseract is decent for scanned imagery, whether in actual images
       | or in PDFs but definitely not for text in the wild.
        
         | PretzelFisch wrote:
         | what would you suggest to use instead?
        
       | postalrat wrote:
       | I wanted to use Tesseract for a project but found it to be a bit
       | too slow for my needs. Doesn't it have options to speed up it's
       | recognition or is there another OCR project out there that's made
       | to be faster?
        
       | zeptonix wrote:
       | Tesseract sucked for me. Had a simple use case where I was trying
       | to read numbers (in a computer font) from .png files and at
       | completely predictable locations in the image -- and Tesseract
       | was getting it horribly wrong a huge percent of the time. Went
       | with AWS Rekognition and results were instantly 1000x better.
        
         | ikornaselur wrote:
         | Had the _exact_ same situation! Was just trying to OCR values
         | of screenshots, which were always of the same screen (app
         | screenshots taken by users) and it was so bad. Ended up just
         | using AWS Rekognition and it worked really well.
        
         | kn100 wrote:
         | Post processing is absolutely essential with tesseract. Not to
         | self promote but I discussed this at some length in this blog
         | post, if you're interested: https://kn100.me/taking-back-data-
         | from-eufy/
        
         | dheera wrote:
         | We really need better open source {OCR, TTS, dictation, ...}.
         | All of the common FOSS tools for these tasks are so horribly
         | behind the state of the art.
         | 
         | The sad thing is most of the state of the art models and
         | algorithms are open research, they just are usually not written
         | by software engineers and need to be rewritten to be
         | deployable. Usually you just get some shell script like
         | "run_eval.sh" that generates the figures in the paper through a
         | bunch of spaghetti code, and most of the time it will depend on
         | a specific old version of Tensorflow, that probably isn't
         | available for your CUDA version, and probably won't compile on
         | your system without hours of Googling.
        
       | jtdev wrote:
       | Congrats! Why would one create such a project with JS given all
       | of the languages available to them?
        
         | villgax wrote:
         | To reduce server side burden/costs
        
         | simonw wrote:
         | Ease of deployment. Deploying a client-side JavaScript
         | application remains far, far easier and less expensive than
         | anything that runs server-side (or native compiled) code.
         | 
         | Also privacy: running OCR in someone's browser rather than
         | sending the images back to the server keeps them fully in
         | control of the data they are working with.
        
         | Tajnymag wrote:
         | Client-side web apps. With today's smartphones, it does make
         | sense to not do everything solely on the server side.
         | 
         | Theretically, cross platform support would be another
         | possibility. But one could argue native C code could be bundled
         | as well, albeit with separate integration being needed.
         | (Android and iOS do support such extensions).
        
         | mkl wrote:
         | It's mostly not JavaScript, since it uses the emscripten port
         | of the Tesseract OCR Engine, and if you want to do things in
         | the browser, JavaScript has to be involved.
        
         | IshKebab wrote:
         | To use it on the web.
        
       | tobyhinloopen wrote:
       | 'Drop an image'? Mobile devices exist...
        
       | peteretep wrote:
       | > Tesseract.js wraps an emscripten port of the Tesseract OCR
       | Engine
       | 
       | Calling this "pure JavaScript" seems misleading
        
         | mkl wrote:
         | Yes, it's kind of weird, since there's no benefit to claiming
         | false things like "Tesseract.js is a pure Javascript port
         | [...]". Say it's WASM, since people associate that with speed
         | and newness (and heavyweight dependencies, but there's no
         | hiding that).
        
           | azakai wrote:
           | Skimming the download, this does indeed use wasm, but it's
           | also possible to build to pure JS with emscripten (in WASM=0
           | mode, wasm2js compiles the wasm to JS). Perhaps that's what
           | they used to do and the docs have not been updated or
           | something like that.
        
             | mkl wrote:
             | Still not a "port" though.
        
       | zikero wrote:
       | Nice stuff!
       | 
       | I found an error in the chinese demo, with the example you
       | provided (4th character wasn't the same). I know no OCR is
       | perfect, but IMHO at least your own demo should be free of
       | errors.
        
         | mdp2021 wrote:
         | > _at least your own demo should be free of errors_
         | 
         | :) That would be a dishonest demo.
         | 
         | You try to show how well it works, not that it works perfectly
         | well (which is false). Edit: especially since we know that OCR
         | is hardly perfect - we expect errors to be minimized, not
         | absent, and the first interest is to see where the engine
         | fails.
        
         | mkl wrote:
         | There's one in the English demo too: "hail!" -> "haill".
         | They're both pretty bad images though. In practice I've found
         | (command line) Tesseract very accurate on 300dpi scans of
         | printed documents, with colour/greyscale, not binary.
        
       | holoduke wrote:
       | I spend 2 months 2 years ago on building a passport data
       | extractor. For KYC (know your customer) purposes. Unfortunately I
       | did not manage to get to a situation where the extracted data was
       | really useful. I just tried this JS version (sure the native one
       | is the same) and without changing anything (apart from the
       | training dataset) I got much better results. Exciting.
        
         | WrtCdEvrydy wrote:
         | For passports, I would use the MRZ instead. All of the passport
         | data is encoded there and it's machine readable.
         | 
         | http://writecodeeveryday.github.io/projects/passportjs/
        
           | holoduke wrote:
           | Mrz is just encoded string. But it says nothing. For proper
           | validation you need to get the readable values as well.
        
       | collin08 wrote:
       | I've used this library in the past for prototyping a project to
       | extract Chinese subtitles from youtube videos in a chrome
       | extension. It worked pretty well. The only problem is the library
       | couldn't really handle realtime video. Can't really fault it for
       | that though I was sending it every frame. The throughput was good
       | but latency kept increasing probably because I was giving it to
       | much data.
       | 
       | There's a mode where you can increase the number of worker
       | threads. Tesseract is also designed for text documents and the
       | preprocessing filter I made to convert the images to look more
       | like a text document was pretty naive.
       | 
       | I'm taking an online computer vision class next semester and hope
       | to pick the project back up after learning a bit more.
        
       | gtsoukas wrote:
       | Being disappointed by classic open source OCR I started an
       | attempt to package neural net based approaches
       | (https://github.com/gtsoukas/scene_text, don't use it, it is
       | crap), then I found out that Googles' ML Kit
       | (https://developers.google.com/ml-kit/vision/text-recognition)
       | gives quite good results, as long as it is for latin based
       | character sets.
        
       | mkl wrote:
       | The total size of the download seems to be 3-4MB (based on
       | https://github.com/naptha/tesseract.js/blob/master/docs/loca...),
       | which is actually less than I expected.
        
         | azakai wrote:
         | It could be even smaller, it seems, as the wasm file is
         | base64-encoded (so that it all fits in a single file - which is
         | convenient, but larger).
        
         | turminal wrote:
         | Only English language support is included. Additional downloads
         | are required for other languages.
        
       | laurent123456 wrote:
       | Anyone has any experience with the JS version of Tesseract? Is it
       | accurate in general? And is it English only or does it work with
       | any language?
        
       | epmaybe wrote:
       | What is the best way, paid or otherwise, to attempt OCR on a pdf
       | of old typewritten text?
        
         | umvi wrote:
         | Upload to Google drive and it will do it automatically
        
           | epmaybe wrote:
           | I tried that but it wouldn't load for me
        
         | LeSaucy wrote:
         | ABBYY FineReader has always come out ahead for me in terms of
         | OCR accuracy.
        
       | trekhleb wrote:
       | I've used Tesseract.js to recognise the https://** links from the
       | camera input and to make them clickable.
       | 
       | First issue I've encountered was the text recognition
       | performance. Depending on the camera input (if the image
       | contained something that looked like the text or not) I've got
       | 2-20+ seconds per 640x640px image for text recognition on iPhone
       | X. Not so fast as you may see. But the recognition was pretty
       | accurate though.
       | 
       | The performance, as expected, improves when the image size is
       | getting smaller and the amount of text on the image is also
       | smaller.
       | 
       | Since I did't want to recognise the whole text, but only the
       | links, I've used the TensorFlow Object Detection model to quickly
       | find the areas with the text http://**. Then, instead of
       | recognising the whole image I needed to do it only for smaller
       | parts of the image. This gave some improvements to the
       | performance: from the variable 2-20 seconds per frame I've got
       | more stable 0.5-1 seconds. Also not good, but several times
       | faster.
       | 
       | I've described the challenges in more details here
       | https://trekhleb.dev/blog/2020/printed-links-detection/. But to
       | sum up, I had a good recognition quality with an arguable
       | performance with Tesseract.js
        
       ___________________________________________________________________
       (page generated 2021-08-08 23:01 UTC)