[HN Gopher] Show HN: Adventures in OCR
___________________________________________________________________
Show HN: Adventures in OCR
Hello HN! In a recent "Ask HN: What are you working on?" thread, I
mentioned I was working on OCRing a large book:
https://news.ycombinator.com/item?id=41971614 The post generated
some interest so I thought I would keep HN posted. The book is
Saint-Simon's Memoirs -- an invaluable historical account of the
French court under Louis XIV, full of wit, sharp observations, and
of incredible literary value. I'm OCRing the edition of reference
made between 1879-1930, that contains a lot of comments and
footnotes: 45 volumes, ~27,000 pages. Here's a link to a blog post
that describes the techniques used so far (the project is still
ongoing): https://blog.medusis.com/38_Adventures+in+OCR.html But
you may also directly access the result here:
https://divers.medusis.net/boislisle/pub This web app (not
optimized for mobile, sorry) solves a tricky problem of preloading
images efficiently. In short: preloading the next image isn't
enough, since browsers will repaint if an image is moved, or
scaled. Or browsers won't paint at all if visibility is hidden or
opacity is zero, and will paint only when those values change. On
an average, slow machine, this takes visible time. But if an image
is simply behind another element, it will be painted, and the
removal of the covering element or changing the z-index will not
trigger a repaint. (Preloading is important because it lets one
review results fast; if one has to wait 150-200 ms between images
it's simply discouraging). Would love to hear feedback; happy to
answer any question!
Author : bambax
Score : 45 points
Date : 2024-12-17 17:00 UTC (5 hours ago)
(HTM) web link (blog.medusis.com)
(TXT) w3m dump (blog.medusis.com)
| complexworld wrote:
| Getting the footnotes right is going to be really tricky.
| Sometimes I couldn't even read the superscript numbering on the
| original scans. And that was after zooming in to the max.
|
| Reliably identifying the superscript locations should be enough
| since they are in the same order as the footnotes.
|
| It's a little early for feature requests... but I would love to
| see an EPUB edition! It shouldn't be too hard once done with the
| hard work of getting the data structured structured.
| bambax wrote:
| Yes. The original idea was to have some LLM place footnotes
| references in the text, based on the content of the footnotes
| themselves, but as I say in the blog post, that failed
| spectacularly.
|
| Now another idea is to manually put placeholders for footnotes
| references in the text, and then number them automatically.
| Before that, I manually enter the number of footnotes on each
| page, for verification. I have already done this for the first
| two volumes, it's pretty fast. Having the number of footnotes
| on a page lets:
|
| - check that the number of footnotes is correct
|
| - (and therefore) also check that footnotes numbers are also
| correct (from 1 to n, in order)
|
| - also check that the number of footnotes references is also
| correct (should exactly match the number of footnotes)
|
| - and finally, properly number the placeholders.
|
| Manually inputing numbers in the main text would be very
| difficult and error-prone, but simply putting placeholders and
| checking them automatically, should be much faster and safer.
| ksampath02 wrote:
| You could try Aryn DocParse, which segments your documents first
| before running OCR: https://www.aryn.ai/ (full disclosure: I work
| there).
| bambax wrote:
| I will try that, thanks.
| wll wrote:
| Use a ~SoTA VLM like Gemini 2.0 Flash on the images. It'll zero-
| shot de-hyphenated text in semantic HTML with linked footnotes.
| bambax wrote:
| Hallucinations are problematic, and they're hard to defend
| against, if there's only one source of truth. I was surprised
| by the creativity that LLMs showed for the simple task of
| placing footnotes references, as I explain in the post.
|
| .. But there's no harm in trying. At the very least it could be
| done in conjunction with traditional OCR to check for whole
| sentences of pure invention.
| TacticalCoder wrote:
| I've not done it at scale but so far I've had very good
| experience with OCR using AI models. Maintenance bill for my
| car in german: OCR, boom translation to french in no time.
| Works amazingly well.
| gregschlom wrote:
| "A very crude method would be to remove the last line every 16
| pages but that would not be very robust if there were missing
| scans or inserts, etc. I prefer to check every last line of every
| page for the content of the signature mark, and measuring a
| Levenshtein distance to account for OCR errors."
|
| I'm curious: did you also check whether the signature mark was
| indeed found every 16 pages? Were there any scans missing?
|
| Great project btw!
| bambax wrote:
| Yes, that's one of the (many) benefits of logging!
|
| And in fact, there is a hiatus, because the introduction at the
| beginning is from a different "sub-book", where the pages are
| numbered using roman numerals. Typically the introduction would
| be written and typeset after the main book had been typeset, so
| its number of pages would not known in advance and that's why
| it uses a different numbering system.
|
| So one finds a signature mark on pages 9, 25 41, 57, 73, 89,
| and then it starts again at page 93 109, 125, 141, 157, 173,
| 189, etc. (those numbers come from the filenames of the scans,
| not the numbers printed on the pages).
|
| => Another reason for not starting with the first signature
| mark and simply adding 16, is that would miss the changing of
| sub-book (or any irregular number of pages, for any reason).
| gregschlom wrote:
| For the human review part: maybe crowdsource it? Make the book
| available for reading online, with a UI to submit corrections
| (Wikipedia-style).
| pronoiac wrote:
| Oh wow! I've worked on turning PAIP (Paradigms of Artificial
| Intelligence Programming) from a book into a bunch of Markdown
| files, but that's "only" about a thousand pages long, compared to
| the roughly 27000 pages long of all those volumes. I have advice,
| possibly helpful, possibly not.
|
| Getting higher quality scans could save you some headaches. Check
| the Internet Archive. Or, get library copies, and the right
| camera setup.
|
| Scantailor might help; it lets you semi-automate a chunk of
| things, with interactive adjustments. I don't know how its
| deskewing would compare to ImageMagick. The signature marks might
| be filtered out here.
|
| I wrote out some of my process for handling scans here -
| https://github.com/norvig/paip-lisp/releases/tag/v1.2 . I maybe
| should blog about it.
|
| If you get to the point of collaborative proofreading, I highly
| recommend Semantic Linefeeds - each sentence gets its own line.
| https://rhodesmill.org/brandon/2012/one-sentence-per-line/ I got
| there by:
|
| * giving each paragraph its own line
|
| * then, linefeed at punctuation, maybe with quotation marks and
| parentheses? It's been a while
| bambax wrote:
| You are right that the quality of the scans is paramount!
| Unfortunately I don't have access to the physical books and
| have to work with the scans as they are (they're not good). But
| I will look at Scantailor, it looks interesting.
|
| For now I reconstruct paragraphs in html but I could do
| markdown just as well (where paragraph breaks are marked by
| double line breaks, and single line breaks don't count).
|
| Collaborative proofreading would be cool but it would require
| some way of properly tracking who wrote what, and I'm not sure
| what to use or if I should build a simple system from scratch.
| Do you have recommendations?
| TacticalCoder wrote:
| If it's to be really 100% automated I don't think there's much
| solution besides recreating the exact layout, using the very same
| font, and then superimposing the "OCR then re-rendered" text with
| the original scan and see if they're close enough. This means
| finding the various fonts, sizes, types (italic, bold, etc.).
|
| But we'll get there eventually with AIs. We'll be able to tell:
| _" Find me the exact font, styles, etc. And re-render it using
| InDesign (or LaTeX or whatever fancies you), then compare with
| the source and see what you got wrong. Rinse and repeat"_.
|
| We'll eventually have the ability to do just that.
| throwaway81523 wrote:
| You could upload the books to the Internet Archive and let their
| OCR pipeline take a try. It is (or at least was) written around
| Abbyy. Results weren't great but they were a start.
|
| I wonder what eventually happened with Ocropus which was supposed
| to help with page segmentation. I was a bit disappointed to see
| that this article used Google Vision as its OCR engine. I was
| hoping for something self hosted.
| zozbot234 wrote:
| The book is being worked on here
| https://fr.wikisource.org/wiki/Livre:Saint-Simon_-_M%C3%A9mo...
| already (volume 1 of 20). Not the same edition as what OP is
| working with, but it's a start.
___________________________________________________________________
(page generated 2024-12-17 23:00 UTC)