https://blog.medusis.com/38_Adventures+in+OCR.html

[next] Medusis Tech Blog

Adventures in OCR

This past few weeks I've been working on OCRing an ancient book: a
late 19th century edition of 18th century memoirs, in French: Les
Memoires de Saint-Simon.

Saint-Simon was a courtier in Versailles during the last part of the
reign of Louis XIV; his enormous memoirs (over 3 million words) are a
first-hand testimony of this time and place, but are more revered
today for their literary value than for their accuracy. They have had
a profound influence on the most prominent French writers of the 19th
and 20th centuries, including Chateaubriand, Stendhal, Hugo, Flaubert
, the Goncourt brothers, Zola, and of course Proust, whose entire
project was to produce a new, fictitious version of the memoirs for
his time. (Although it hasn't stopped many from trying, it's
difficult to truly appreciate Proust if one isn't familiar with
Saint-Simon.) Tolstoy was also a fan, and many others.

Only abridged and partial translations of the Memoires are available
in English, which may be a unique case for a French author of such
importance, and likely due to their sheer volume. In French however,
many editions exist. The one still considered the best was made by
Boislisle between 1879 and 1930 (Boislisle himself died in 1908 but
the work was continued by other editors, including his son). This
edition has rich, detailed, fantastic footnotes about any topic or
person. We're talking about 45 volumes with roughly 600 pages each.

The French National Library (Bibliotheque nationale) scanned these
physical books years ago, and they're available online, but only as
images and through a pretty clunky interface that makes reading quite
difficult.

My goal is simple: create a proper text version that can be read
through (without mangling footnotes and comments into the main text),
that's searchable and that people can actually copy and paste from.
The OCR part itself was the easy bit - parsing what the OCR engine
spits out is where the real challenge lies. Here's a breakdown of the
tasks, issues, and solutions.

The result is available here (only the first volume is online for
now, and not optimized for mobile).

What's at stake

[ss_example_zones_colors]

Pages contain the following possible zones (colors refer to the
image:

  * header (blue)
  * comments in the margins (green)
  * main text (pink)
  * footnotes (yellow)
  * signature mark (purple)

OCRing the books means correclty parsing the words in those different
zones and reconstructing them properly, so as to produce readable
text (and not just being able to 'randomly' find words in a page,
like what Google Books does, for example).

Preparing the images for OCR

Getting started with Gallica's PDFs is pretty straightforward. You
can download the full scanned books, and there are lots of ways to
extract the images. I went with PDFTK, a free tool that works on both
Linux and Windows:

pdftk document.pdf burst

This command simply splits document.pdf into individual PDFs, one per
page.

Looking at the images in this particular document, they're not great
- low resolution and some are a bit tilted. Since OCR quality heavily
depends on input image quality, I needed to fix these issues by
enlarging the images and straightening them out. ImageMagick turned
out to be perfect for this - it's a command-line tool that's
incredibly versatile.

Here's the command I used to process all PDF files from one directory
to another:

for file in "$SOURCE_DIR"/*.pdf; do
    if [ -f "$file" ]; then
        filename=$(basename "$file" .pdf)
        echo "Converting $file to $TARGET_DIR/$filename.jpg"
        convert -density 300 -deskew 30 "$file" -quality 100 "$TARGET_DIR/$filename.jpg"
    fi
done

Now we've got a directory full of processed images ready to feed into
an OCR engine.

Sending the images to an OCR engine

For the actual OCR work (turning pictures into text), I went with
Google Vision, which is excellent, and very reasonably priced. Here's
the Python code to send an image to their API:

client = vision.ImageAnnotatorClient()
with open(image_path, 'rb') as image_content:
    image = vision.Image(content=image_content.read())

request = vision.AnnotateImageRequest(
    image=image,
    features=[{"type_": vision.Feature.Type.DOCUMENT_TEXT_DETECTION}],
)
response = client.annotate_image(request=request)

The API sends back a JSON structure with two main parts:

  * text_annotations: a list of words, where the first item is the
    full page text, and all other items are individual words with
    their bounding boxes (4 points coordinates)

  * full_text_annotation: a list of blocks and letters, also with
    bounding boxes (I didn't end up using this)

(For some reason, italics are not recognized, though, which is a
problem, but not a huge one. See at the end of this post a discussion
about possible alternatives.)

Getting to a readable text

[ss_example_zones]

The words in text_annotations come roughly in document order, from
top-left to bottom-right. You'd think the full text from the first
element would be exactly what we need, but there's a catch. Several
catches, actually:

  * Many pages have comments or subtitles in the margins that don't
    belong in the main text, but the OCR mixes them in following the
    page flow, creating a mess

  * There are extensive footnotes that need to be properly tagged
    since they're not part of the main text

  * Each page has a header that we need to remove for smooth reading
    across pages

  * Every 16 pages, there's a 'signature mark' at the bottom. These
    need to go too

So we need to process the OCR output to properly identify all these
elements and mark paragraphs correctly.

Digression: more details about traditional book printing and the role
of the signature mark

Books used to be printed on large sheets that were later folded in 8
(in octavo) or in 4 (in quarto) and then sewed together to make a
book. Here's an example of an in-octavo printing sheet from a
Shakespeare play:

[Venus-and-Adonis-image-of-A-sheet-1024x668]

The folding and assembling were done by different teams than the ones
operating the printing press, and they needed instructions so as to
fold and group pages in the correct order. That's the purpose of the
signature mark: to tell the binder how to fold the sheets and, once
folded, how to group the resulting booklets.

The signature mark therefore typically contains the name of the book
(so that it doesn't end up mixed with other books in the printing
factory) and a number or a letter that tells its order. It's printed
on just one side of the sheet, where the first page should appear
when the sheet is folded; so for an in-octavo it will appear once
every 16 pages in the printed book (or on the scans).

We need to remove it because if not, it will end up polluting the
final text. A very crude method would be to remove the last line
every 16 pages but that would not be very robust if there were
missing scans or inserts, etc. I prefer to check every last line of
every page for the content of the signature mark, and measuring a
Levenshtein distance to account for OCR errors.

Center vs margin

First up: figuring out what's main text and what's margin commentary.
Pages aren't all exactly the same size or layout, so we can't use
fixed coordinates. But the main text is always full-justified, and
words come in (roughly) document order.

So in order to identify left and right margins I do a fist pass on
all words in the page to identify leftmost and rightmost words (and
mark them as such):

  * Track each word's horizontal position

  * A leftmost word is when a word's x position is less than the
    previous word

  * A rightmost word is the one before a leftmost word

The margins will be where the 'majority' of those leftmost and
rightmost words are.

To understand, let's look at the left margin for example:

  * some words are way to the left because they belong to comments in
    the left margin, so if we simply took the lowest horizontal
    position of any word on the page, we would end up with a left
    margin too far to the left (and therefore, comments would be
    mixed up inside the main text)

  * some words, while being the first word on a line, have a
    horizontal offset, either because they mark the start of a
    paragraph (left indent) or because they're quotes or verse, etc.

By grouping leftmost words into buckets, and selecting the largest
bucket, we can find words that are "aligned" along the true left
margin, and therefore, find the left margin.

We can find these "majority" positions either through statistical
methods like interquartile range, or by rounding x positions and
picking the most common values.

Once we've got these two groups of edge words, anything further left
than the left group is comment in the left margin, and anything
further right than the right group is comment in the right margin.

Then we can do another pass to properly sort everything into either
margin comments, or center block.

Lines

Since we're not using the full text but individual words, we have to
build lines (and then paragraphs), using words' coordinates.

Making lines is pretty straightforward: we group words by vertical
position in each section, then sort each line by the x position of
the words.

To group by vertical position, I just go through words in order:

  * If a word is close enough vertically to the previous word (within
    some threshold), it goes in the current line

  * If it's too far, it starts a new line

Finding the right threshold takes some trial and error. Too big and
you'll merge different lines; too small and you'll split single lines
unnecessarily.

But getting the proper vertical position is trickier than it sounds.
Even on perfectly straight pages, words on the same line often have
different y-positions, because their bounding boxes include different
letter parts (like the tail of a 'p' or the top of a 't').

As it is, the word 'maman' with no tall or hanging letters will have
a different y-position than 'parent' which has both. What we would
really need is the baseline, but unfortunately we don't have that.

Baseline image

While processing lines, we also:

  * Remove the header (it's always the first line, except on title
    pages)

  * Spot and remove signature marks by checking their content (see
    above)

Identifying footnotes block

One main goal here is making those rich footnotes accessible, so
identifying them properly matters. But doing it automatically isn't
simple.

I tried spotting footnotes by word or line density, but that wasn't
reliable enough.

What we know about footnotes:

  * They're separated from the main text by a gap; yet there are
    other gaps too, so we can't just take the first one we find

  * Working up from the bottom to the first gap works better, but
    footnotes can have gaps too

  * Footnotes usually start with numbers and the first one on a page
    is typically '1.' - that helps. But sometimes footnotes continue
    from the previous page with no number, and some pages have
    footnotes-within-footnotes using letters (a. b. c.)

Using all this info, I built a system that gets it right about 90% of
the time.

For the rest, I had to add manual input (more on that in the web
section).

Finding paragraphs

Finding paragraphs should be simple - they start with a positive
indent from the left margin. But because, even after deskewing,
images aren't perfectly straight, the true left margin isn't a
vertical line, so some words look indented when they're really not.
We can try using the rightmost word of the left side as reference
(excluding outliers of course), but then we miss some actual
paragraphs (classic tradeoff between false positives and false
negatives).

The real problem with skewed images is that they're not just rotated,
they're distorted, which means the angle changes across the page.
Standard deskewing can't fix this, and I haven't found a way to
correct the distortion that actually works and is fast enough,
whether using ImageMagick or Python libraries like PIL or scipy.
Truly distorted images are fairly rare so I didn't spend too much
time looking, though.

Logs

I cannot recommend logging highly enough! It stores all sorts of
information during parsing so that one can examine it later to
identify problems, spotting outliers for further review, and in
general keep a record of what happened. The excellent logging Python
module is versatile and efficient. Here's an example output:

[example_log]

And here are the corresponding configuration parameters:

logging.basicConfig(
    filename = os.sep.join([tomes_path, logfile]),
    encoding = "utf-8",
    filemode = "a",
    format = "%(asctime)s - %(levelname)s - %(filename)s - %(funcName)s: %(message)s",
    style = "%",
    datefmt = "%Y-%m-%d %H:%M",
    level = logging.DEBUG,
    )

Web interface

I built a web interface to make it easy to input manual page info and
check OCR quality. It shows recognized zones and line numbers for
each page. If the zones are wrong, you can just type in where the
footnotes actually start. It can be done in no time (perhaps 10
minutes for a 600 pages book, if that).

You can check out where we're at here. Hover over the OCR text
highlights that zone in the page image. I'm planning to add text
correction features too.

Things left to do

Spellcheck

I've done basic spellcheck for hyphenated words: the program checks a
French word list for the unhyphenated version and uses it if found.
This fixes hyphenations at line ends.

Since proper nouns for persons or places aren't in standard French
dictionaries, the program keeps track of unknown words for manual
review and dictionary updates.

I still need to build a more complete spellcheck system, though.

Lies, Damned Lies, and LLMs (about footnotes references)

Footnotes references show up twice: before the footnotes themselves
in normal text, and next to words in the main text as superscript.
But the OCR often misses the superscript ones - they're too small and
the scans too rough. Some turn into apostrophes or asterisks, but
most just vanish.

I tried using AI for this. The idea was to have a model place
footnote references in the main text based on what the footnotes were
about. To make sure it caught everything, I asked the model to give
me:

  * Number of footnotes found
  * Number of references placed
  * Whether these matched

It was a complete flop. Using OpenRouter, I tested over 200 models.
More than 70% couldn't even count the footnotes right, but that
wasn't the worst part.

The "best" models just made stuff up to meet the requirements. They
lied in three ways:

  * Basic (stupid) lies: wrong counts but claiming they matched
    ('foonotes: 5, references: 3, match: true')
  * Better lies: claiming they placed references when they hadn't
  * Premium lies: making up new text to attach footnotes to when they
    weren't sure where they went (against explicit instructions in
    the prompt never to do that)

Other general approaches

The main difficulty of the is project lies in correctly identifying
page zones; wouldn't it be possible to properly find the zones during
the OCR phase itself instead of rebuilding them afterwards?

Tesseract for example can in theory do page segmentation but it's
brittle and I could never get it to work reliably. (Its quality in
OCR is also way lower than Google Vision in my experience.) Trying to
get LLMs with vision to properly identify zones also were found to be
slow and unreliable, and the risk of hallucinated results is
unacceptable, especially as a first step. Non-deterministic systems
may be fine for creative projects, but not here. (Once we have a
reliable reference we can then play with LLMs and if necessary,
control the results by measuring the distance to the source.)

But the more fundamental reason why I think it's better to first do
OCR and then analyze the results to reconstruct the text, is that OCR
is costly, and parsing JSON isn't. All the more so when the number of
pages is large.

When OCR fails and you have to do it again, it takes time and money.
On the contrary, parsing JSON with Python is mostly instantaneous,
doesn't require any special hardware, and can therefore be improved
and run again and again until the result is satisfactory.

Human review

After these experiments, it's clear some human review is needed for
the text, including spelling fixes and footnote placement. Initial
tests show it takes 1-2 minutes per page, so about two days per
volume (10-20 hours) or six months for the whole book. It's a lot,
but doable. That's our next step.

Still, I keep improving the automatic parsing described above. It's
important not to jump into manual text corrections too early - once
we start making those changes by hand, running the automatic parsing
again becomes tricky or impossible. So for now, we'll focus on making
the automatic process as good as it can get.

2024-12-10
Previous