https://www.crosstab.io/product-comparisons/document-form-extraction [5fd14ed]ComparisonsTopicsAboutFAQJoin Document Form Extraction Overview Ratings Domain Guide Methodology Last updated: February 10, 2021 Much of society's most valuable data lives in formulaic documents, aka forms. Common documents like driver's licenses, passports, receipts, invoices, pay stubs, and--most recently and urgently--vaccination records are all forms. Forms have standardized fields, but each instance has its own values. This pay stub, for example, has fields Pay period, Pay Day, Company, Employee, etc. that are the same for all Gusto pay stubs. Pay stub example Example of a pay stub form. Source: Gusto blog. We would like to make the information in each document programmatically accessible. If we were verifying income, for example, we would want to convert the pay stub above into a key-value map like: { "Pay period": "Jul 12, 2019 - Jul 25, 2019", "Pay Day": "Aug 2, 2019", "Company": "Lockman-Klein", "Employee": "Belle Tremblay", "Gross Earnings": "$1,480.00" } Downstream processing might further parse the dates, date ranges, and currencies. Extracting form information is not the coolest topic, but it's extremely valuable and challenging. A large line-up of tools promises state-of-the-art results with the latest and greatest AI, but we put those claims to the test and came away unimpressed. Our study We annotated a small dataset of television ad invoices and ran it through four off-the-shelf APIs for generic form extraction. We graded each product on functionality, accuracy, response time, ease-of-use, and business considerations like cost and data security. Please see the Methodology tab for more detail about our test protocol and evaluation criteria. As a rule, we do not accept affiliate commissions, to keep our reports as unbiased as possible. Recommendations * None of the products we tested were able to consistently find all of the form data. The best one only found 2.8 of 4 correct form field pairs (key and value, paired together) on average. For complex documents, this technology simply isn't ready. * For simpler documents, we suspect the results would be much better. In this case, we recommend Google Form Parser--it was the most accurate service in our test and by far the fastest. For PDF documents, it's also the only service fast enough to be part of a synchronous pipeline. * One alternative strategy is to first use unstructured text extraction, then write your own post-processing to match keys to values. For this approach, we recommend Amazon Textract. With Textract's unstructured text output and a fairly naive post-processing heuristic, we found 2.9 out of 4 correct fields on average--better than any of the dedicated form extraction tools, and substantially cheaper. * ABBYY Cloud OCR SDK seems to require codification of form layouts in advance. This is not possible for the applications we have in mind, where many different flavors of a form have different layouts. * Microsoft Form Recognizer requires a custom model to be trained before extracting data. To us, this feels like a lot of extra overhead, which would only be justified if the model accuracy were far superior to competing products. Unfortunately, Form Recognizer's average recall was the worst of the services we tested. * There are specialized services to extract text from receipts, invoices, business cards, and tax documents. Given our poor results with general-purpose form extraction, we suggest exploring these specialized products if you have one of these form types. Our pick: Google Form Parser Ratings We organized our notes and test results for each form extraction service into five dimensions, scored from 0 (total failure) to 100 (perfect). These were then combined into an overall score with a weighted average. The dimensions and weights are: * Functionality (35%): does the product do what we need and what it says it does? * Business considerations (20%): cost, data policies, platform ecosystem * Accuracy (20%): response rate, and recall of correct key-value pairs. * Speed (15%): response time * Ease of use (10%): documentation quality, understandable response format, web-based demo, etc. Please see the Methodology page for detailed definitions of each dimension. The blue radar plots below show the dimension scores for each product. The gray background plot shows the breakdown for the best-rated product, for context. Google Form Parser ReviewDemoProduct Score: 75 None [601eed903aa01a68799a0b8a_Google_rating] Amazon Textract ReviewDemoProduct Score: 53 None [601c875ee16cc88225d0d91f_Textract_rating] Microsoft Form Recognizer ReviewDemoProduct Score: 47 None [601f1b5735c687f1a27e1981_Azure_rating] ABBYY Cloud OCR SDK ReviewDemoProduct Score: None [6021635ce083ceb194cb4348_ABBYY_rating] Forms are documents with standardized fields and variable values. They are one of the most elemental ways to store and communicate information, so they pop up everywhere. Some common examples include: * ID cards * tax forms * invoices and receipts * health and vaccination records Why you might want document form extraction Document form extraction is the process of turning forms into actionable data, in an automated, scalable fashion. With a pay stub, for example, we want to turn the document: Pay stub example Source: Gusto blog. into a key-value map: { "Pay period": "Jul 12, 2019 - Jul 25, 2019", "Pay Day": "Aug 2, 2019", "Company": "Lockman-Klein", "Employee": "Belle Tremblay", "Gross Earnings": "$1,480.00" } After a little bit of extra processing to cast the extracted strings into dates and numbers, we could use this data to verify the customer's employment, or help them track and forecast their savings over time, or compare their earnings to the industry standard--whatever our business use case might be. General-purpose document form extraction is relatively easy for most people, but very hard to automate. The pay stub example shows why. Pay stub example Source: Gusto blog. Annotations our own. * Some values have no explicit keys at all.^1 Others have two keys because they're in tables, where the row and column labels together define the field. Tables also have the additional problem of substantial distance between the keys and the values. * The association between key and value depends on a subjective reading of page layout, punctuation, and style. Some keys and values are arranged vertically, others horizontally. Some keys are delineated by colons, others bold font. * Every payment processor uses a different layout. We could hard-code the location of the fields in the Gusto form, but the layout, style, and field names of ADP and Paychex pay stubs are different, even though the underlying information is the same. A sprawling marketplace of solutions The marketplace of text extraction products was vast and confusing before the AI revolution, and it has only grown worse. Broadly speaking, these tools operate at one of three levels. Product types by complexity and value Broadly speaking, there are three types of products that extract data from form documents. At the most basic level is Optical Character Recognition (OCR) that extracts raw text from images. This is a well-established technology, but it doesn't do much to unlock the business value in form documents. The most potentially valuable task at the top of our pyramid is template-filling. In this scenario, we have our own fixed schema of keys and we want to find the values from each document that "fill" each slot in the template. As far as we can tell, this remains an ambitious research goal, rather than a solved technology. For this study, we focused on the second-level: key-value mapping. These tools construct key-value pairs from extracted text but they don't attempt to match the information to a predetermined schema. What to look for in a form extraction product Within the class of key-value mapping tools, form extraction products differ along several dimensions. Dimension Definition * Does the service generally do what it claims? * Can we get an answer, regardless of accuracy and Functionality speed? * Range of input types and sizes allowed * Quotas and rate limits * Pricing model and estimated total cost * Data policies: privacy, encryption, retention * Active iteration on product development Business * Customer support considerations * Reliability: service level agreement, up time * Ecosystem: how vibrant and developed is the surrounding platform? * Does the product find the values that we're looking for? * Does the product find the keys that reference the correct values, even if those keys aren't matched Accuracy to a standard schema? * Does the product correctly associate keys with values? * Is the tool more accurate than heuristic post-processing of unstructured text? * Synchronous vs. Asynchronous options: under what Speed constraints is a synchronous call possible? * Distribution of response times * Navigating the vendor's product landscape * Documentation quality and completeness * Is there a GUI demo for getting started and sanity checking? Ease of use * API design * Format of the output. Is it human-readable? Can it be serialized? How much post-processing is needed? * Other engineering "gotchas" or unpleasant surprises. Another key question to ask that doesn't quite fit into this rubric is whether a specialized tool is available for your use case. For invoices and receipts, Taggun, Rossum, Google Procurement DocAI, and Microsoft Azure Form Recognizer Prebuilt Receipt model all explicitly target these kinds of documents. For identity verification, try Onfido, Trulioo, or Jumio. Feature comparison We have compiled information about functionality and business considerations from each product's website. The ref link in each cell indicates the source of the information. Please see the individual product reviews for the results of our hands-on evaluation. Feature ABBYY Cloud OCR Amazon Google Form Parser Microsoft Form SDK Textract Recognizer BMP, DCX, PCX, JPG, PNG, PDF Input file PNG, JPEG, PDF, JPG, PNG, PDF, TIFF, GIF^ref (text or formats TIFF, GIF, PDF^ref scanned), TIFF^ DjVu, JBIG2^ref ref 10 MB for 50 MB. For 30 MB, 32K x JPG and PNG images: 10K x Input file 32K pixels for files, 500 20 MB^ref 10K pixels. For size limit images.^ref MB for PDFs. PDFs: 17 x 17 ^ref inches, 200 pages.^ref Async for Synchronous up to API docs call it all file 5 pages, async up a "Long-Running Processing Asynchronous^ types, to 100 pages or Operation (LRO) model ref synchronous 2,000 pages (the ". The call can option for docs are be blocking or JPG and PNG. contradictory).^ non-blocking.^ ^ref ref ref Varies by region and depends on the desired Pre-paid or layout monthly complexity: subscription lines, for a fixed forms, and/ number of or tables. $0.05/page, but pages. For $0.065/page for unstructured Subscriptions US-East-2 first 5M pages/ text and custom Cost run from $30/ with month, $0.05/page forms extraction month for 500 unstructured beyond 5M pages/ require separate pages up to text and month.^ref calls, so $0.10/ $840/month for forms output page total.^ref 30K pages. Each (but not form field tables): counts as 0.2 $0.05/page pages.^ref up to 1M pages, $0.04 /page for pages over 1M.^ref 2 synchronous transactions /sec for US Usage counts East and against total West. Google Cloud Project quotas. 1 synchronous 1,800 requests/ transaction/ user/min. Quotas are sec for listed for free other 600 online trials but not supported requests/project/ for paid regions. min. Quotas accounts, which Unclear might suggest 2 async 5 concurrent async there are no submissions/ batch requests per limits for paid sec for all project. accounts?^ref supported regions. 10,000 pages actively being 600 processed simultaneous simultaneously.^ async jobs ref in US East & West, 100 in other regions.^ref Max training set Miscellaneous size: 500 pages. ^ref Notes 1. We use the term key to mean the text that names a field within a given form. - Scope This article is meant for applied data scientists and engineers, as well as data science and engineering team leads who want to understand more about document form extraction, or need to choose a service to use for document form extraction. Scoring Rubric We grouped our notes and ratings into five areas, based on the dimensions described in the Domain Guide. For each area, we score the products from 0 (nonexistent) to 100 (perfect), then compute the total score as a weighted average of the dimensions. Our weights for the dimensions are: Dimension Weight Functionality 35% Business considerations 20% Accuracy 20% Speed 15% Ease of use 10% Ease of use in this comparison has only 10% weight, which is much lower than for the Data App Frameworks comparison. Data App Frameworks are much more about the development experience; Document Form Extraction tools, on the other hand, are much more standardized APIs. Our accuracy measures are recall-based. For each test document, we count how many of the 4 ground-truth key-value pairs a form extraction service returns, ignoring any other output from the API. The final score for that service is the average recall over all test documents. * We use the Jaro-Winkler string similarity function to compare extracted and ground-truth text and decide if they match. * Some services return a confidence score with output text. We ignore this; a product scores a match if its output text matches one of the ground-truth pairs, regardless of the confidence score. In our results table, we also have a row called "Mean recall of unstructured text plus custom key-value mapping". This is a baseline to compare the canonical recall against. For each service, we requested unstructured text, in addition to the semi-structured key-value pairs. We then created our own set of key-value candidates by associating each pair of successive text blocks. For example, if the unstructured output was ["Pay Day:", "Aug 2, 2019", "912 Silver St.", "Suite 1966"] then our heuristic approach would return: { "Pay Day:": "Aug 2, 2019", "Aug 2, 2019": "912 Silver St.", "912 Silver St.": "Suite 1966" } Most of these candidate pairs are nonsense, but because we evaluate based on recall, this method turns out to be a reasonable baseline. Selecting the challengers We first narrowed our set of potential products to those that: * Have either a free-trial or a pay-as-you-go pricing model, to avoid the enterprise sales process * Claim to be machine learning/AI-based, vs. human-processing * Have a self-service API. Of the tools that met these criteria, we chose the four that seemed to best fit the requirements for our test scenario (details below). For this evaluation, we're not worried about handwritten forms, languages other than English, or images of documents. We also assume we don't have a machine learning team on standby to train a custom model. The Challenge To extract metadata from political campaign advertising invoices Suppose we want to build a service that helps political campaigns verify and track their ad spending. When a campaign receives an invoice from a broadcaster they upload it to our hypothetical service, and we respond (quickly, if possible) with a verification that the invoice is legit and matches a planned outlay (or not). For this challenge, we want to extract the invoice number, advertiser, start date of the invoice, and the gross amount billed. For example, suppose our customer submits this invoice: Annotated invoice example 1 The correct answer for this invoice would be: { "Contract #": "26795081", "Advertiser:": "House Majority Forward", "Flight:": "2/5/20 - 2/18/20", "Total $": "$33,500.00" } To extract answers like this at scale, we need a text extraction service with the following features: * Key-value mapping, not just OCR for unstructured text * Accepts PDFs * Responds quickly, preferably synchronously. * Handles forms with different flavors. Each broadcast network uses its own form, with different layout, style, and keys, even though the information is the same. Here's a second example with the corresponding correct answer: Annotated invoice example 2 { "Contract #": "4382934", "Advertiser": "Diana Harshbarger-Congress-R (135459)", "schedule Dates": "05/28/20-06/03/20", "Grand Total:": "$1,230.00" } The first example uses the key "Flight" to indicate the starting and ending dates of the ad campaign, while the second says "Schedule Dates". There are other subtle differences in punctuation (colons), currency symbols, and date formatting. Data The documents in our test set are TV advertisement invoices for 2020 political campaigns. The documents were originally made available by the FCC, but we downloaded them from the Weights & Biases Project DeepForm competition (blog post, competition, code repo). Specifically, we randomly selected 51 documents from the 1,000 listed in Project DeepForm's 2020 manifest and downloaded the documents directly from the Project DeepForm Amazon S3 bucket fcc-updated-sample-2020. The DeepForm project did create ground truth annotations, but we ignored them. DeepForm is focused on end-to-end template filling, a much more challenging task than what we're asking our challenger products to do. We also noticed more errors in the DeepForm annotations than we were comfortable with. Creating our own ground truth allowed us to evaluate each service's ability to find relevant key-value pairs, without worrying about how those pairs should slot into our standard schema. Annotating form documents is tricky, and we made many small decisions to keep the comparison as uniform and fair as possible. * Sometimes a PDF document's embedded text differs from the naked eye interpretation. We've gone with the visible text as much as possible. * Sometimes a key and value that should be paired are far apart on the page, usually because they're part of a table. The second example above illustrates this: the key "Grand Total:" is separated from its value "$1,230.00" by a different data element. We have included these in our annotations knowing this is a very difficult task for any automated system, although we chose fields that are not usually reported in tables. * Dates are arranged in many different ways. When presented as a range, we have included the whole range string as the correct answer, but when separated we only include the start date. (c) 2021 Crosstab LLC Terms of UseGitHubRSSLinkedInTwitterContact