[HN Gopher] Tesseract.js - A Javascript port of the Tesseract OC...
___________________________________________________________________
Tesseract.js - A Javascript port of the Tesseract OCR engine
Author : kiyanwang
Score : 114 points
Date : 2021-08-08 11:06 UTC (11 hours ago)
(HTM) web link (tesseract.projectnaptha.com)
(TXT) w3m dump (tesseract.projectnaptha.com)
| villgax wrote:
| Tesseract is decent for scanned imagery, whether in actual images
| or in PDFs but definitely not for text in the wild.
| PretzelFisch wrote:
| what would you suggest to use instead?
| postalrat wrote:
| I wanted to use Tesseract for a project but found it to be a bit
| too slow for my needs. Doesn't it have options to speed up it's
| recognition or is there another OCR project out there that's made
| to be faster?
| zeptonix wrote:
| Tesseract sucked for me. Had a simple use case where I was trying
| to read numbers (in a computer font) from .png files and at
| completely predictable locations in the image -- and Tesseract
| was getting it horribly wrong a huge percent of the time. Went
| with AWS Rekognition and results were instantly 1000x better.
| ikornaselur wrote:
| Had the _exact_ same situation! Was just trying to OCR values
| of screenshots, which were always of the same screen (app
| screenshots taken by users) and it was so bad. Ended up just
| using AWS Rekognition and it worked really well.
| kn100 wrote:
| Post processing is absolutely essential with tesseract. Not to
| self promote but I discussed this at some length in this blog
| post, if you're interested: https://kn100.me/taking-back-data-
| from-eufy/
| dheera wrote:
| We really need better open source {OCR, TTS, dictation, ...}.
| All of the common FOSS tools for these tasks are so horribly
| behind the state of the art.
|
| The sad thing is most of the state of the art models and
| algorithms are open research, they just are usually not written
| by software engineers and need to be rewritten to be
| deployable. Usually you just get some shell script like
| "run_eval.sh" that generates the figures in the paper through a
| bunch of spaghetti code, and most of the time it will depend on
| a specific old version of Tensorflow, that probably isn't
| available for your CUDA version, and probably won't compile on
| your system without hours of Googling.
| jtdev wrote:
| Congrats! Why would one create such a project with JS given all
| of the languages available to them?
| villgax wrote:
| To reduce server side burden/costs
| simonw wrote:
| Ease of deployment. Deploying a client-side JavaScript
| application remains far, far easier and less expensive than
| anything that runs server-side (or native compiled) code.
|
| Also privacy: running OCR in someone's browser rather than
| sending the images back to the server keeps them fully in
| control of the data they are working with.
| Tajnymag wrote:
| Client-side web apps. With today's smartphones, it does make
| sense to not do everything solely on the server side.
|
| Theretically, cross platform support would be another
| possibility. But one could argue native C code could be bundled
| as well, albeit with separate integration being needed.
| (Android and iOS do support such extensions).
| mkl wrote:
| It's mostly not JavaScript, since it uses the emscripten port
| of the Tesseract OCR Engine, and if you want to do things in
| the browser, JavaScript has to be involved.
| IshKebab wrote:
| To use it on the web.
| tobyhinloopen wrote:
| 'Drop an image'? Mobile devices exist...
| peteretep wrote:
| > Tesseract.js wraps an emscripten port of the Tesseract OCR
| Engine
|
| Calling this "pure JavaScript" seems misleading
| mkl wrote:
| Yes, it's kind of weird, since there's no benefit to claiming
| false things like "Tesseract.js is a pure Javascript port
| [...]". Say it's WASM, since people associate that with speed
| and newness (and heavyweight dependencies, but there's no
| hiding that).
| azakai wrote:
| Skimming the download, this does indeed use wasm, but it's
| also possible to build to pure JS with emscripten (in WASM=0
| mode, wasm2js compiles the wasm to JS). Perhaps that's what
| they used to do and the docs have not been updated or
| something like that.
| mkl wrote:
| Still not a "port" though.
| zikero wrote:
| Nice stuff!
|
| I found an error in the chinese demo, with the example you
| provided (4th character wasn't the same). I know no OCR is
| perfect, but IMHO at least your own demo should be free of
| errors.
| mdp2021 wrote:
| > _at least your own demo should be free of errors_
|
| :) That would be a dishonest demo.
|
| You try to show how well it works, not that it works perfectly
| well (which is false). Edit: especially since we know that OCR
| is hardly perfect - we expect errors to be minimized, not
| absent, and the first interest is to see where the engine
| fails.
| mkl wrote:
| There's one in the English demo too: "hail!" -> "haill".
| They're both pretty bad images though. In practice I've found
| (command line) Tesseract very accurate on 300dpi scans of
| printed documents, with colour/greyscale, not binary.
| holoduke wrote:
| I spend 2 months 2 years ago on building a passport data
| extractor. For KYC (know your customer) purposes. Unfortunately I
| did not manage to get to a situation where the extracted data was
| really useful. I just tried this JS version (sure the native one
| is the same) and without changing anything (apart from the
| training dataset) I got much better results. Exciting.
| WrtCdEvrydy wrote:
| For passports, I would use the MRZ instead. All of the passport
| data is encoded there and it's machine readable.
|
| http://writecodeeveryday.github.io/projects/passportjs/
| holoduke wrote:
| Mrz is just encoded string. But it says nothing. For proper
| validation you need to get the readable values as well.
| collin08 wrote:
| I've used this library in the past for prototyping a project to
| extract Chinese subtitles from youtube videos in a chrome
| extension. It worked pretty well. The only problem is the library
| couldn't really handle realtime video. Can't really fault it for
| that though I was sending it every frame. The throughput was good
| but latency kept increasing probably because I was giving it to
| much data.
|
| There's a mode where you can increase the number of worker
| threads. Tesseract is also designed for text documents and the
| preprocessing filter I made to convert the images to look more
| like a text document was pretty naive.
|
| I'm taking an online computer vision class next semester and hope
| to pick the project back up after learning a bit more.
| gtsoukas wrote:
| Being disappointed by classic open source OCR I started an
| attempt to package neural net based approaches
| (https://github.com/gtsoukas/scene_text, don't use it, it is
| crap), then I found out that Googles' ML Kit
| (https://developers.google.com/ml-kit/vision/text-recognition)
| gives quite good results, as long as it is for latin based
| character sets.
| mkl wrote:
| The total size of the download seems to be 3-4MB (based on
| https://github.com/naptha/tesseract.js/blob/master/docs/loca...),
| which is actually less than I expected.
| azakai wrote:
| It could be even smaller, it seems, as the wasm file is
| base64-encoded (so that it all fits in a single file - which is
| convenient, but larger).
| turminal wrote:
| Only English language support is included. Additional downloads
| are required for other languages.
| laurent123456 wrote:
| Anyone has any experience with the JS version of Tesseract? Is it
| accurate in general? And is it English only or does it work with
| any language?
| epmaybe wrote:
| What is the best way, paid or otherwise, to attempt OCR on a pdf
| of old typewritten text?
| umvi wrote:
| Upload to Google drive and it will do it automatically
| epmaybe wrote:
| I tried that but it wouldn't load for me
| LeSaucy wrote:
| ABBYY FineReader has always come out ahead for me in terms of
| OCR accuracy.
| trekhleb wrote:
| I've used Tesseract.js to recognise the https://** links from the
| camera input and to make them clickable.
|
| First issue I've encountered was the text recognition
| performance. Depending on the camera input (if the image
| contained something that looked like the text or not) I've got
| 2-20+ seconds per 640x640px image for text recognition on iPhone
| X. Not so fast as you may see. But the recognition was pretty
| accurate though.
|
| The performance, as expected, improves when the image size is
| getting smaller and the amount of text on the image is also
| smaller.
|
| Since I did't want to recognise the whole text, but only the
| links, I've used the TensorFlow Object Detection model to quickly
| find the areas with the text http://**. Then, instead of
| recognising the whole image I needed to do it only for smaller
| parts of the image. This gave some improvements to the
| performance: from the variable 2-20 seconds per frame I've got
| more stable 0.5-1 seconds. Also not good, but several times
| faster.
|
| I've described the challenges in more details here
| https://trekhleb.dev/blog/2020/printed-links-detection/. But to
| sum up, I had a good recognition quality with an arguable
| performance with Tesseract.js
___________________________________________________________________
(page generated 2021-08-08 23:01 UTC)