Subj : Re: King/Gassner/Shanahan To : comp.programming,comp.lang.java.programmer,comp.lang.lisp From : Pascal Bourguignon Date : Mon Aug 29 2005 03:29 am "Daniel Dyer" writes: > On Sun, 28 Aug 2005 22:18:14 +0100, Robert Maas, see > http://tinyurl.com/uh3t wrote: >> >> Does anybody know of a free program that does OCR on PDF files and >> keeps track of layout so as to convert to reasonable HTML or plain >> ASCII? Or if I wrote such a program myself would anyone think it was a >> good thing and pay me money for all that effort? Or am I the only >> person who thinks that converting a megabyte PDF file to a 30K text >> file would be a useful utility? > > There are various command-line utilities. Search for "pdf2ascii", > "pdf2html", "pdftohtml", "pdf2txt" etc. Maybe your shell account > already has one of these available. But a megabyte PDF that reduce to 30K text will containt the text as scanned bitmaps, not as PDF text. On these file, pdf2ascii doesn't work, you need real OCR. Happily, it's not too difficult to find free OCR software with google... -- A: Because it messes up the order in which people normally read text. Q: Why is top-posting such a bad thing? A: Top-posting. Q: What is the most annoying thing on usenet and in e-mail? __Pascal Bourguignon__ http://www.informatimago.com/ .