[HN Gopher] Ask HN: OCR for 100 year old (German) handwritten cu...
___________________________________________________________________
Ask HN: OCR for 100 year old (German) handwritten cursive script?
I'm looking for an OCR solution for about 200 pages of text. It's
handwritten German script from about 100 years old and I can barely
read the handwriting myself. Google Translate sometimes manages to
OCR certain parts, but nothing useful (I don't need the translation
part of GT). Which solutions out there would be able to recognize
old handwritten script?
Author : jbverschoor
Score : 28 points
Date : 2024-01-15 19:08 UTC (3 hours ago)
| robertknight wrote:
| You could try something like https://aws.amazon.com/textract/ or
| https://cloud.google.com/vision/docs/handwriting. Both have
| support for modern handwriting. I don't know if it will work with
| a script written a century ago though.
| 082349872349872 wrote:
| if it's https://en.wikipedia.org/wiki/Sutterlin I doubt
| anything trained on current script would make any more sense of
| it than we do
| sandreas wrote:
| You can train Tesseract to recognize Handwriting[1], but the
| first and most important step would be the preprocessing of your
| documents. I would recommend to start with a local adaptive
| thresholding algorithm[2] like Sauvola for binarization. The
| preprocessing steps would be[3] 1) Binarization
| 2) Skew Correction 3) Noise Removal 4) Thinning and
| Skeletonization
|
| Probably you are facing "Sutterlin"[4], which differs quite a bit
| from modern german handwriting.
|
| In your case (only 200 pages) it might be easier to use template
| matching[5] to identify similar characters and just
| "transliterate" matches into modern printed letters (like an
| overlay over the original text). This way you would have a quick
| solution while still being accurate enough to just read it.
|
| [1]: https://tesseract-ocr.github.io/tessdoc/#training-for-
| tesser...
|
| [2]: https://brandonmpetty.github.io/Doxa/WebAssembly/
|
| [3]: https://towardsdatascience.com/pre-processing-in-ocr-
| fc231c6...
|
| [4]: https://de.wikipedia.org/wiki/S%C3%BCtterlinschrift
|
| [5]:
| https://docs.opencv.org/3.4/d4/dc6/tutorial_py_template_matc...
| nadermx wrote:
| This is great, I also did find this
| https://github.com/IgorMeloS/OCR/blob/main/7%20-%20template-...
| which is part of this https://github.com/IgorMeloS/OCR could be
| useful for this as well.
| throwup238 wrote:
| The letters from [4] remind me of modern Russian cursive which
| has some similarly interesting changes to some letters to make
| them faster to write. I wonder if there's any research on
| Russian cursive OCR that could help
| lainga wrote:
| Does it look like Sutterlin? Are you familiar with it?
|
| [] https://en.wikipedia.org/wiki/S%C3%BCtterlin
| jbverschoor wrote:
| Not familiar with it, but it doesn't look like that. I wish
| haha
| weinzierl wrote:
| 100 years ago Sutterlin would be pretty likely. If your
| sample is not Sutterlin I would consider the possibility that
| it is older.
| ginko wrote:
| Sutterlin was introduced in schools at the beginning of the
| 20th century. Kurrent was still widely used by adults well
| into the century.
| ginko wrote:
| Does it look like Kurrentschrift?
|
| https://de.wikipedia.org/wiki/Deutsche_Kurrentschrift
| sneak wrote:
| For only 200 pages, I'd farm it out to humans.
| TillE wrote:
| "Humans" in this case means specialist historians, but yes.
| telotortium wrote:
| 100-year-old handwriting is probably not _that_ obscure - it
| 's not cuneiform or hieroglyphics. There are probably lots of
| people, especially older people, who could transcribe it.
| Finding them would be the issue.
| jbverschoor wrote:
| It's not obscure, but it's like a doctor's handwriting. It
| takes a lot of effort if possible at all for me
| Tomte wrote:
| There are many many Germans alive who learned Sutterlin and
| similar scripts in school and used them for decades. It's not
| exactly Linear B.
|
| Even I (below 45) read some small Sutterlin texts in school
| (mostly German or history books). Not fluent, but you can
| quite quickly get used to it and decipher things slowly.
| ginko wrote:
| For 100 year old German handwriting the historian I'd consult
| would by my grandmother. :)
| huijzer wrote:
| I threw some German medical handwriting images into ChatGPT a
| while back and asked it to transcribe it and it worked pretty
| well. ChatGPT knows a lot about language so that helped in
| filling in the gaps.
| interesse wrote:
| Not OCR, but https://www.paul-riebeck-stiftung.de/stiftung/ueber-
| uns/koop... ?
| AJJB_alt wrote:
| GPT-4 Vision. I have seen some examples of middly agy looking
| pages tried.
| re5i5tor wrote:
| I've been scripting GPT-4 Vision to extract structured recipe
| data from handwritten recipe cards, with very good success.
| Can't speak to the German language aspect.
|
| (Edited to clarify more than just transcribing)
| jbverschoor wrote:
| I'll look into that
| serjester wrote:
| I would try this but the downside is GPT-4 vision currently
| doesn't like to extract large text blocks. You could try
| extracting line bounding boxes with PyMupdf and feeding it
| individual lines.
| ebbes wrote:
| That's exactly what https://transkribus.ai/ was built for - works
| quite well in my experience, mainly transcribing Deutsche
| Kurrentschrift, c. 1980.
| WalterBright wrote:
| I tried it on an incomprehensible German postcard from 1900,
| and it worked great! Not perfect, but darn good.
| josefritz wrote:
| I've paid for manual transcription before. It's not that
| expensive. Technical solutions are cool, but that option is
| available today.
| sneed_chucker wrote:
| You probably want to put it in front of an actual person and get
| them to transcribe it for you. I don't think there's any off the
| shelf OCR that will work particularly well for it.
|
| I have a close family member who is a historian and frequently
| read and transcribed mid 19th to early 20th century German
| handwriting for his work.
|
| Many historians and archivists in Germany would have the ability
| to transcribe this for you if you reached out to them and paid
| for their time.
| herbst wrote:
| I've had surprisingly good results with
| https://readcoop.eu/transkribus/ I was going back in time with a
| family research until I couldn't identify a single word anymore.
| The 'AI' could.
| dimatura wrote:
| I don't know if it will be significantly different than what
| Google Translate does, but I would give the major cloud vendors
| (Google, Amazon, Microsoft and I guess OpenAI/ChatGPT) OCR
| services a shot. It's pretty simple and cheap to do (like, about
| a dollar for the whole thing). Last time I compared them,
| Google's OCR came out ahead, but it's task-dependent so in your
| case it might be different.
|
| General purpose open-source OCR solutions like Tesseract, TrOCR,
| etc will probably not be as good as the cloud ones, based on my
| experience.
|
| There's some specialized research work out there for antique
| manuscripts, but that will require some digging on your part with
| an uncertain outcome. I think at that point, I would also look
| into manual transcription - for 200 pages, it might be reasonably
| affordable.
| dbish wrote:
| Fwiw, we've found the Azure AI OCR service to be pretty good,
| much better then anything we could get from Tesseract out of
| the box (no tuning).
| WalterBright wrote:
| OCR and Translation are two entirely different endeavors.
| BenoitP wrote:
| I worked in the same space as a company that does this with ML
| (and charges for it), using some form of Recurrent Neural Network
| IIRC. Maybe LSTMs?
|
| They had a contract to index historical French archives composed
| of handwritten latin documents in elasticsearch.
|
| Depending of the historical relevance of your documents (read:
| some academic funds), they may be able to help. Doesn't hurt to
| contact them:
|
| https://teklia.com/
| weinzierl wrote:
| Low hanging fruit when reading these old German scripts is to get
| used to distinguish the different forms of the letter s. That
| alone will get you far. Same for OCR, it needs to be capable of
| that. Otherwise the result will read as if someone without front
| teeth has written how they speak.
| jackhack wrote:
| I ran a sample through my Apple Newton Messagepad: Iss Martha
| auf.
___________________________________________________________________
(page generated 2024-01-15 23:00 UTC)