hngopher.com

       [HN Gopher] Ask HN: OCR Libraries for Receipt Scanning/Parsing?
       ___________________________________________________________________
        
       Ask HN: OCR Libraries for Receipt Scanning/Parsing?
        
       I'm interested in keeping tabs on my spending and comparing prices
       of items I buy at grocery stores, because I tend to not think about
       it when I need something. I am conscious of the extreme price
       discrepancies for the exact same items at stores just blocks apart
       here in NYC, but it's difficult to keep track of the prices of each
       item at various places to optimize shopping.  I want to build a
       system that can keep a running tab of my purchases by item, price,
       and store. I need to find a library that can effectively scan a
       receipt, recognize the store (usually name, number, address and
       logo at the top), and differentiate each item label and its price.
       I plan to manually tag each item label from a store's receipt with
       the item's barcode the first time it is seen.  I have been
       sporadically googling the past 6 months but am still unsure which
       OCR library(s) I should invest my time in. Or how low level I
       should start. Should I grab a library like tesseract and do my own
       feature extraction or libs that spit out semi-structured objects
       with text and hope it returns something similar enough across store
       receipts to make sense of consistently?  I'm ok with this being an
       extended project, but I would like some input on choosing a solid
       library with accurate OCR and advice on how to approach
       training/parsing from someone with more experience.  Other
       solutions and advice are also welcome++
        
       Author : selbyk
       Score  : 45 points
       Date   : 2021-04-03 18:51 UTC (4 hours ago)
        
       | wyiske wrote:
       | I built an app to scan receipts for bill splitting, although your
       | use case is certainly interesting.
       | 
       | Google's MLKit is very accurate for on device recognition. You
       | can even feed frames straight from the camera with almost real
       | time results. Your bigger problem will be parsing the results,
       | and handling very inconsistent receipts.
        
       | perssontm wrote:
       | I recently started using paperless-ng, check it out, perhaps you
       | can build on that. Includes tessarect for ocr for example.
        
       | gitowiec wrote:
       | I'm familiar with Camelot, it is used by UI called Excalibur. It
       | is more intended to scan invoices or bank statements. It is
       | perfect for tabularized data. It can handle tables without
       | explicit column edges.
        
       | xnx wrote:
       | If you're set on building your own, you're probably not
       | interested in using this:
       | https://blog.google/technology/area-120/stack , but it might be a
       | useful reference.
        
         | rexelhoff wrote:
         | If anyone is interested in keeping long-term records and
         | functionality, I'd suggest steering away from anything Google,
         | with their track record of killing things.
        
           | xnx wrote:
           | Always a possibility. I have no concern using this because:
           | "Stack can also automatically save a copy of your documents
           | to Google Drive. That way, should you ever decide to stop
           | using Stack, your documents will be accessible in Drive and
           | easy to export. "
        
             | eitland wrote:
             | Unless Google closes your account for uploading too many,
             | too few or just wrong documents - or, more likely because
             | of some unspecified violation of something.
        
       | dudus wrote:
       | Google launched an Android app called Stacks. It's out of their
       | area120 so it's not a fully supported product. But it scans and
       | upload to Google drive and does some ocr. It's been pleasant to
       | use.
        
       | rolisz wrote:
       | I worked on such a project 8 years ago. I actually ended up
       | building my own OCR engine, after annotating manually about 50
       | receipts (about 8000 characters if I remember correctly). Some of
       | the problems I encountered back then is that snapping a picture
       | of a receipt with your phone will result in weird lighting
       | conditions and angles which will mess with the OCR engine. The
       | second problem is that it's hard to keep the receipt straight
       | while taking the picture, so it will be hard to identify lines in
       | the picture, because they will be curved.
       | 
       | To some extent, all this is solved by some modern APIs, such as
       | what GCP or AWS offer, for doing OCR for you. But as far as I
       | know, there is still one more challenge: interpreting the text.
       | Inferring what each line is, what's the price for which item
       | (some receipts have the price on the same line, some on the next
       | line, some above) is quite hard. I tried to do it with rules
       | (regexes and lots of ifs), but even a 95% accuracy of the OCR
       | engine will trip you up.
       | 
       | You can probably frame this as an ML problem as well, but I don't
       | think you'll find any datasets for this.
        
         | rolisz wrote:
         | One thing that could help a lot is trying to get the receipt
         | data from some loyalty card some shops have.
         | 
         | In Romania, almost all big stores have mobile apps which allow
         | you to export your purchase lists. Granted, some of them have
         | dumb outputs (Lidl give you an image of your receipt, so you
         | still have to do OCR on it, Carrefour gives you a PDF), but it
         | does make the problem much easier. Of course, this won't work
         | for your random corner shop.
        
         | wcarss wrote:
         | As I mentioned in another comment, some friends of mine worked
         | at a startup that's entirely built around receipt scanning and
         | itemization, and your comment aligns with what I heard from
         | them ad nauseam: receipts are _hard_ , in large part because
         | there's just no standard way of putting things onto them.
         | 
         | How do you show subtotals, taxes, and totals? How do you flag
         | that an item is taxable or not? What's part of the header? What
         | format are the numbers? What kind of subtle background text is
         | on the receipt? Is the receipt at an angle? Is the picture
         | taken of _just_ a receipt, or a receipt held up in the air,
         | with stuff behind it?
         | 
         | Sometimes, there are lines on receipts that are just meant to
         | be ignored, maybe for old tax regulations that don't exist
         | anymore but were important when the receipt-printing software
         | was written.
         | 
         | It's a mess.
        
       | ampdepolymerase wrote:
       | If you need 99%+ accuracy go for AWS Mechanical Turk. They are
       | used by Wave Accounting and other office application companies
       | for receipt OCR. For 85-95%+ accuracy any off the shelf solution
       | like Google Cloud ML APIs or AWS textract will be fine. You can
       | get better results with both the cloud APIs and hand rolled ML
       | models if you have a good dataset. For this sort of applications
       | a large quantity of well annotated data is king. If you only have
       | <100 receipts per year and need very high accuracy it might be
       | cheaper to just go with AWS Mechanical Turk end-to-end. You have
       | to pay people to annotate the data anyways if you want to train a
       | model so it might be easier to just stick with humans.
        
       | pjc50 wrote:
       | I had this idea a while ago, tried a number of libraries include
       | Tesseract, and found all the results extremely poor. Be
       | interested to see if one that works is suggested.
        
         | foepys wrote:
         | In my experience preprocessing the image is extremely important
         | before feeding it into Tesseract. I tried to do the same as OP
         | on a rainy afternoon but shelved the project after it became
         | more of a imagemagick research task than about creating a
         | database of my receipts. I got Tesseract to recognize about 80%
         | of the text but it was still missing some letters from slightly
         | worn-out receipts.
        
       | wcarss wrote:
       | I've had friends work at Sensibill[1] which sells tools (mostly
       | to banks) to build some of what you're imagining having right
       | into banking+expense tracking apps. Not sure if they have
       | anything a la carte but they might have _something_ of value to
       | look at.
       | 
       | 1 - https://getsensibill.com/
        
       | mkl wrote:
       | I've been experimenting with using tesseract to get information
       | out of scanned tutorial roll sheets, with surprising success. If
       | you ask it for tsv or hocr output, it will give you a bounding
       | box for each word. To extract a student's attendance information,
       | I grep the tsv files for a student ID number or name, get the y
       | position with sed, and combine slices of the page images with
       | Image Magick (in my case I want to see all the handwritten ticks
       | and numbers). You might be able to do something similar looking
       | for numbers on the same line as key words like "Total" or
       | "apples" or whatever. Some of your success will depend on how
       | well you scan the receipts.
        
       | screye wrote:
       | Microsoft's Form Recognizer is pretty good.
       | (https://docs.microsoft.com/en-us/azure/cognitive-services/fo...)
       | 
       | discl@imer - I verk 4 not-Macro-Hard. But, I have no connection
       | to this team.
       | 
       | edit: this might be terribly extra for personal use.
        
       | phenkdo wrote:
       | I found easyocr to be the most accurate. Tesseract was meh.
        
       | jonahbenton wrote:
       | As others have suggested- this is not a project where stitching
       | together OSS OCR bits is going to yield anywhere near useful
       | results. Overall at multiple levels of the stack the error bars
       | on the tech bits are really wide and narrowing them is still a
       | research project. This is why most of the suggestions are- if you
       | want a workable _solution_ , brute-force cheap human Mechanical
       | Turk is the only option.
       | 
       | However, if you are looking for a _project_ , picking _one_
       | grocery store with one receipt format and generally limited
       | /consistent product coding schemes is a reasonable thing to plug
       | away on. Speaking personally I did this with Whole Foods receipts
       | for a while and was able to get to almost, kinda usable. But then
       | the pandemic hit and I started ordering delivery which obviates
       | the whole receipt ingestion thing because I can get all those
       | details directly from Amazon (modulo doing some data scraping).
       | 
       | Analytics on food purchases are a tremendously interesting and
       | deeply underexplored space in which there is lots of future
       | commercial potential.
        
       | jka wrote:
       | For "middle ground" projects like this (criteria: a common enough
       | problem that lots of people _should_ have thought about it -- but
       | it may not be a lucrative core business area -- and there aren't
       | any household-name open source projects that cover it), I often
       | turn to GitHub repository search to see what's available.
       | 
       | Based on that, your best bet might be
       | https://github.com/ReceiptManager/receipt-parser-legacy, which is
       | a Python library built on top of the Tesseract OCR engine. You
       | can use it containerized, in Android/iOS applications, or via
       | your own Python scripts.
        
         | jka wrote:
         | NB: it's also previously been discussed on HN:
         | https://news.ycombinator.com/item?id=10338199
        
       | misiti3780 wrote:
       | Use textract. Super easy to integrate and results are pretty
       | impressive. Also, it is cheap.
        
       | MattGaiser wrote:
       | Why not just use Mechanical Turk? You can get receipts done for
       | pennies.
        
         | arbitrage wrote:
         | because that is not an OCR Library for Receipt Scanning/Parsing
        
           | yjftsjthsd-h wrote:
           | It's a service so it requires sending your data over the
           | network, but the rest is just ignorable implementation
           | details.
        
       | verelo wrote:
       | I was the tech founder at a company that built this exact
       | technology. checkout51.com (still running but we sold it and I've
       | since moved on)
       | 
       | If you want to chat feel free to reach out, i could talk all day
       | about this stuff.
        
       | eastendguy wrote:
       | For free ocr and quick prototyping, I use
       | https://ocr.space/receiptscanning - It is easy to use and has a
       | generous free tier of 25,000 free scans each month.
       | 
       | Having said that, I am sure there must be some existing
       | accounting software with built-in OCR? Probably even an app?
        
       | scandox wrote:
       | Well you should evaluate ABBYY to see how well it performs as it
       | is one of most widely used commercial applications for OCR.
       | 
       | I used it for years to scan our bank statements (before our bank
       | could export data).
       | 
       | It was the only thing I ever found that handled tabular data
       | properly.
        
       | sandreas wrote:
       | Maybe this is helpful: https://nanonets.com/blog/receipt-ocr/
       | 
       | In my Opinion Tesseract is the most sophisticated "free" OCR
       | solution out there. The problem with Tesseract is not its
       | recognition capabilities, but more the preprocessing steps.
       | - thresholding       - deskewing       - segmentation       - ...
       | 
       | There is a C# library (non-free), that improves recognition A
       | LOT, just by providing these abilities:
       | https://www.vintasoft.com/vsocr-dotnet-index.html
       | 
       | If you find a good Open Source solution, I would be interested,
       | too...
        
       ___________________________________________________________________
       (page generated 2021-04-03 23:01 UTC)