https://www.crosstab.io/product-comparisons/document-form-extraction

[5fd14ed]ComparisonsTopicsAboutFAQJoin

Document Form Extraction

Overview
Ratings
Domain Guide
Methodology
Last updated:
February 10, 2021

Much of society's most valuable data lives in formulaic documents,
aka forms. Common documents like driver's licenses, passports,
receipts, invoices, pay stubs, and--most recently and
urgently--vaccination records are all forms.

Forms have standardized fields, but each instance has its own values.
This pay stub, for example, has fields Pay period, Pay Day, Company,
Employee, etc. that are the same for all Gusto pay stubs.

Pay stub example Example of a pay stub form. Source: Gusto blog.


We would like to make the information in each document
programmatically accessible. If we were verifying income, for
example, we would want to convert the pay stub above into a key-value
map like:

{
    "Pay period": "Jul 12, 2019 - Jul 25, 2019",
    "Pay Day": "Aug 2, 2019",
    "Company": "Lockman-Klein",
    "Employee": "Belle Tremblay",
    "Gross Earnings": "$1,480.00"
}

Downstream processing might further parse the dates, date ranges, and
currencies.

Extracting form information is not the coolest topic, but it's
extremely valuable and challenging. A large line-up of tools promises
state-of-the-art results with the latest and greatest AI, but we put
those claims to the test and came away unimpressed.

Our study

We annotated a small dataset of television ad invoices and ran it
through four off-the-shelf APIs for generic form extraction. We
graded each product on functionality, accuracy, response time,
ease-of-use, and business considerations like cost and data security.
Please see the Methodology tab for more detail about our test
protocol and evaluation criteria.

As a rule, we do not accept affiliate commissions, to keep our
reports as unbiased as possible.

Recommendations

  * None of the products we tested were able to consistently find all
    of the form data. The best one only found 2.8 of 4 correct form
    field pairs (key and value, paired together) on average. For
    complex documents, this technology simply isn't ready.

  * For simpler documents, we suspect the results would be much
    better. In this case, we recommend Google Form Parser--it was the
    most accurate service in our test and by far the fastest. For PDF
    documents, it's also the only service fast enough to be part of a
    synchronous pipeline.

  * One alternative strategy is to first use unstructured text
    extraction, then write your own post-processing to match keys to
    values. For this approach, we recommend Amazon Textract. With
    Textract's unstructured text output and a fairly naive
    post-processing heuristic, we found 2.9 out of 4 correct fields
    on average--better than any of the dedicated form extraction
    tools, and substantially cheaper.

  * ABBYY Cloud OCR SDK seems to require codification of form layouts
    in advance. This is not possible for the applications we have in
    mind, where many different flavors of a form have different
    layouts.

  * Microsoft Form Recognizer requires a custom model to be trained
    before extracting data. To us, this feels like a lot of extra
    overhead, which would only be justified if the model accuracy
    were far superior to competing products. Unfortunately, Form
    Recognizer's average recall was the worst of the services we
    tested.

  * There are specialized services to extract text from receipts, 
    invoices, business cards, and tax documents. Given our poor
    results with general-purpose form extraction, we suggest
    exploring these specialized products if you have one of these
    form types.

Our pick: 

Google Form Parser

Ratings

We organized our notes and test results for each form extraction
service into five dimensions, scored from 0 (total failure) to 100
(perfect). These were then combined into an overall score with a
weighted average. The dimensions and weights are:

  * Functionality (35%): does the product do what we need and what it
    says it does?
  * Business considerations (20%): cost, data policies, platform
    ecosystem
  * Accuracy (20%): response rate, and recall of correct key-value
    pairs.
  * Speed (15%): response time
  * Ease of use (10%): documentation quality, understandable response
    format, web-based demo, etc.

Please see the Methodology page for detailed definitions of each
dimension.

The blue radar plots below show the dimension scores for each
product. The gray background plot shows the breakdown for the
best-rated product, for context.

Google Form Parser

ReviewDemoProduct

Score:

75
None
[601eed903aa01a68799a0b8a_Google_rating]

Amazon Textract

ReviewDemoProduct

Score:

53
None
[601c875ee16cc88225d0d91f_Textract_rating]

Microsoft Form Recognizer

ReviewDemoProduct

Score:

47
None
[601f1b5735c687f1a27e1981_Azure_rating]

ABBYY Cloud OCR SDK

ReviewDemoProduct

Score:

None
[6021635ce083ceb194cb4348_ABBYY_rating]

Forms are documents with standardized fields and variable values.
They are one of the most elemental ways to store and communicate
information, so they pop up everywhere. Some common examples include:

  * ID cards
  * tax forms
  * invoices and receipts
  * health and vaccination records

Why you might want document form extraction

Document form extraction is the process of turning forms into
actionable data, in an automated, scalable fashion. With a pay stub,
for example, we want to turn the document:

Pay stub example Source: Gusto blog.


into a key-value map:

{
    "Pay period": "Jul 12, 2019 - Jul 25, 2019",
    "Pay Day": "Aug 2, 2019",
    "Company": "Lockman-Klein",
    "Employee": "Belle Tremblay",
    "Gross Earnings": "$1,480.00"
}

After a little bit of extra processing to cast the extracted strings
into dates and numbers, we could use this data to verify the
customer's employment, or help them track and forecast their savings
over time, or compare their earnings to the industry
standard--whatever our business use case might be.

General-purpose document form extraction is relatively easy for most
people, but very hard to automate. The pay stub example shows why.

Pay stub example Source: Gusto blog. Annotations our own.


  * Some values have no explicit keys at all.^1 Others have two keys
    because they're in tables, where the row and column labels
    together define the field. Tables also have the additional
    problem of substantial distance between the keys and the values.

  * The association between key and value depends on a subjective
    reading of page layout, punctuation, and style. Some keys and
    values are arranged vertically, others horizontally. Some keys
    are delineated by colons, others bold font.

  * Every payment processor uses a different layout. We could
    hard-code the location of the fields in the Gusto form, but the
    layout, style, and field names of ADP and Paychex pay stubs are
    different, even though the underlying information is the same.

A sprawling marketplace of solutions

The marketplace of text extraction products was vast and confusing
before the AI revolution, and it has only grown worse. Broadly
speaking, these tools operate at one of three levels.

Product types by complexity and value Broadly speaking, there are
three types of products that extract data from form documents.


At the most basic level is Optical Character Recognition (OCR) that
extracts raw text from images. This is a well-established technology,
but it doesn't do much to unlock the business value in form
documents.

The most potentially valuable task at the top of our pyramid is
template-filling. In this scenario, we have our own fixed schema of
keys and we want to find the values from each document that "fill"
each slot in the template. As far as we can tell, this remains an
ambitious research goal, rather than a solved technology.

For this study, we focused on the second-level: key-value mapping.
These tools construct key-value pairs from extracted text but they
don't attempt to match the information to a predetermined schema.

What to look for in a form extraction product

Within the class of key-value mapping tools, form extraction products
differ along several dimensions.

  Dimension                          Definition
                 * Does the service generally do what it claims?
                 * Can we get an answer, regardless of accuracy and
Functionality      speed?
                 * Range of input types and sizes allowed
                 * Quotas and rate limits

                 * Pricing model and estimated total cost
                 * Data policies: privacy, encryption, retention
                 * Active iteration on product development
Business         * Customer support
considerations   * Reliability: service level agreement, up time
                 * Ecosystem: how vibrant and developed is the
                   surrounding platform?

                 * Does the product find the values that we're
                   looking for?
                 * Does the product find the keys that reference the
                   correct values, even if those keys aren't matched
Accuracy           to a standard schema?
                 * Does the product correctly associate keys with
                   values?
                 * Is the tool more accurate than heuristic
                   post-processing of unstructured text?

                 * Synchronous vs. Asynchronous options: under what
Speed              constraints is a synchronous call possible?
                 * Distribution of response times

                 * Navigating the vendor's product landscape
                 * Documentation quality and completeness
                 * Is there a GUI demo for getting started and sanity
                   checking?
Ease of use      * API design
                 * Format of the output. Is it human-readable? Can it
                   be serialized? How much post-processing is needed?
                 * Other engineering "gotchas" or unpleasant
                   surprises.


Another key question to ask that doesn't quite fit into this rubric
is whether a specialized tool is available for your use case. For
invoices and receipts, Taggun, Rossum, Google Procurement DocAI, and
Microsoft Azure Form Recognizer Prebuilt Receipt model all explicitly
target these kinds of documents. For identity verification, try
Onfido, Trulioo, or Jumio.

Feature comparison

We have compiled information about functionality and business
considerations from each product's website. The ref link in each cell
indicates the source of the information. Please see the individual
product reviews for the results of our hands-on evaluation.

   Feature    ABBYY Cloud OCR    Amazon    Google Form Parser  Microsoft Form
                    SDK         Textract                         Recognizer
              BMP, DCX, PCX,                                  JPG, PNG, PDF
Input file    PNG, JPEG, PDF, JPG, PNG,    PDF, TIFF, GIF^ref (text or
formats       TIFF, GIF,      PDF^ref                         scanned), TIFF^
              DjVu, JBIG2^ref                                 ref
                              10 MB for                       50 MB. For
              30 MB, 32K x    JPG and PNG                     images: 10K x
Input file    32K pixels for  files, 500   20 MB^ref          10K pixels. For
size limit    images.^ref     MB for PDFs.                    PDFs: 17 x 17
                              ^ref                            inches, 200
                                                              pages.^ref
                              Async for    Synchronous up to  API docs call it
                              all file     5 pages, async up  a "Long-Running
Processing    Asynchronous^   types,       to 100 pages or    Operation (LRO)
model         ref             synchronous  2,000 pages (the   ". The call can
                              option for   docs are           be blocking or
                              JPG and PNG. contradictory).^   non-blocking.^
                              ^ref         ref                ref
                              Varies by
                              region and
                              depends on
                              the desired
              Pre-paid or     layout
              monthly         complexity:
              subscription    lines,
              for a fixed     forms, and/
              number of       or tables.                      $0.05/page, but
              pages.          For          $0.065/page for    unstructured
              Subscriptions   US-East-2    first 5M pages/    text and custom
Cost          run from $30/   with         month, $0.05/page  forms extraction
              month for 500   unstructured beyond 5M pages/   require separate
              pages up to     text and     month.^ref         calls, so $0.10/
              $840/month for  forms output                    page total.^ref
              30K pages. Each (but not
              form field      tables):
              counts as 0.2   $0.05/page
              pages.^ref      up to 1M
                              pages, $0.04
                              /page for
                              pages over
                              1M.^ref
                              2
                              synchronous
                              transactions
                              /sec for US  Usage counts
                              East and     against total
                              West.        Google Cloud
                                           Project quotas.
                              1
                              synchronous  1,800 requests/
                              transaction/ user/min.
              Quotas are      sec for
              listed for free other        600 online
              trials but not  supported    requests/project/
              for paid        regions.     min.
Quotas        accounts, which                                 Unclear
              might suggest   2 async      5 concurrent async
              there are no    submissions/ batch requests per
              limits for paid sec for all  project.
              accounts?^ref   supported
                              regions.     10,000 pages
                                           actively being
                              600          processed
                              simultaneous simultaneously.^
                              async jobs   ref
                              in US East &
                              West, 100 in
                              other
                              regions.^ref

                                                              Max training set
Miscellaneous                                                 size: 500 pages.
                                                              ^ref


Notes

 1. We use the term key to mean the text that names a field within a
    given form. -

Scope

This article is meant for applied data scientists and engineers, as
well as data science and engineering team leads who want to
understand more about document form extraction, or need to choose a
service to use for document form extraction.

Scoring Rubric

We grouped our notes and ratings into five areas, based on the
dimensions described in the Domain Guide. For each area, we score the
products from 0 (nonexistent) to 100 (perfect), then compute the
total score as a weighted average of the dimensions. Our weights for
the dimensions are:

       Dimension        Weight
Functionality           35%
Business considerations 20%
Accuracy                20%
Speed                   15%
Ease of use             10%


Ease of use in this comparison has only 10% weight, which is much
lower than for the Data App Frameworks comparison. Data App
Frameworks are much more about the development experience; Document
Form Extraction tools, on the other hand, are much more standardized
APIs.

Our accuracy measures are recall-based. For each test document, we
count how many of the 4 ground-truth key-value pairs a form
extraction service returns, ignoring any other output from the API.
The final score for that service is the average recall over all test
documents.

  * We use the Jaro-Winkler string similarity function to compare
    extracted and ground-truth text and decide if they match.

  * Some services return a confidence score with output text. We
    ignore this; a product scores a match if its output text matches
    one of the ground-truth pairs, regardless of the confidence
    score.

In our results table, we also have a row called "Mean recall of
unstructured text plus custom key-value mapping". This is a baseline
to compare the canonical recall against. For each service, we
requested unstructured text, in addition to the semi-structured
key-value pairs. We then created our own set of key-value candidates
by associating each pair of successive text blocks. For example, if
the unstructured output was

["Pay Day:", "Aug 2, 2019", "912 Silver St.", "Suite 1966"]

then our heuristic approach would return:

{
    "Pay Day:": "Aug 2, 2019",
    "Aug 2, 2019": "912 Silver St.",
    "912 Silver St.": "Suite 1966"
}

Most of these candidate pairs are nonsense, but because we evaluate
based on recall, this method turns out to be a reasonable baseline.

Selecting the challengers

We first narrowed our set of potential products to those that:

  * Have either a free-trial or a pay-as-you-go pricing model, to
    avoid the enterprise sales process

  * Claim to be machine learning/AI-based, vs. human-processing

  * Have a self-service API.

Of the tools that met these criteria, we chose the four that seemed
to best fit the requirements for our test scenario (details below).

For this evaluation, we're not worried about handwritten forms,
languages other than English, or images of documents. We also assume
we don't have a machine learning team on standby to train a custom
model.

The Challenge

To extract metadata from political campaign advertising invoices

Suppose we want to build a service that helps political campaigns
verify and track their ad spending. When a campaign receives an
invoice from a broadcaster they upload it to our hypothetical
service, and we respond (quickly, if possible) with a verification
that the invoice is legit and matches a planned outlay (or not). For
this challenge, we want to extract the invoice number, advertiser,
start date of the invoice, and the gross amount billed.

For example, suppose our customer submits this invoice:

Annotated invoice example 1


The correct answer for this invoice would be:

{
    "Contract #": "26795081",
    "Advertiser:": "House Majority Forward",
    "Flight:": "2/5/20 - 2/18/20",
    "Total $": "$33,500.00"
}

To extract answers like this at scale, we need a text extraction
service with the following features:

  * Key-value mapping, not just OCR for unstructured text
  * Accepts PDFs
  * Responds quickly, preferably synchronously.
  * Handles forms with different flavors. Each broadcast network uses
    its own form, with different layout, style, and keys, even though
    the information is the same. Here's a second example with the
    corresponding correct answer:

Annotated invoice example 2

{
    "Contract #": "4382934",
    "Advertiser": "Diana Harshbarger-Congress-R (135459)",
    "schedule Dates": "05/28/20-06/03/20",
    "Grand Total:": "$1,230.00"
}

The first example uses the key "Flight" to indicate the starting and
ending dates of the ad campaign, while the second says "Schedule
Dates". There are other subtle differences in punctuation (colons),
currency symbols, and date formatting.

Data

The documents in our test set are TV advertisement invoices for 2020
political campaigns. The documents were originally made available by
the FCC, but we downloaded them from the Weights & Biases Project
DeepForm competition (blog post, competition, code repo).
Specifically, we randomly selected 51 documents from the 1,000 listed
in Project DeepForm's 2020 manifest and downloaded the documents
directly from the Project DeepForm Amazon S3 bucket
fcc-updated-sample-2020.

The DeepForm project did create ground truth annotations, but we
ignored them. DeepForm is focused on end-to-end template filling, a
much more challenging task than what we're asking our challenger
products to do. We also noticed more errors in the DeepForm
annotations than we were comfortable with. Creating our own ground
truth allowed us to evaluate each service's ability to find relevant
key-value pairs, without worrying about how those pairs should slot
into our standard schema.

Annotating form documents is tricky, and we made many small decisions
to keep the comparison as uniform and fair as possible.

  * Sometimes a PDF document's embedded text differs from the naked
    eye interpretation. We've gone with the visible text as much as
    possible.

  * Sometimes a key and value that should be paired are far apart on
    the page, usually because they're part of a table. The second
    example above illustrates this: the key "Grand Total:" is
    separated from its value "$1,230.00" by a different data element.
    We have included these in our annotations knowing this is a very
    difficult task for any automated system, although we chose fields
    that are not usually reported in tables.

  * Dates are arranged in many different ways. When presented as a
    range, we have included the whole range string as the correct
    answer, but when separated we only include the start date.

(c) 2021 Crosstab LLC
Terms of UseGitHubRSSLinkedInTwitterContact