[HN Gopher] Show HN: Zerox - document OCR with GPT-mini
___________________________________________________________________
Show HN: Zerox - document OCR with GPT-mini
This started out as a weekend hack with gpt-4-mini, using the very
basic strategy of "just ask the ai to ocr the document". But this
turned out to be better performing than our current implementation
of Unstructured/Textract. At pretty much the same cost. I've
tested almost every variant of document OCR over the past year,
especially trying things like table / chart extraction. I've found
the rules based extraction has always been lacking. Documents are
meant to be a visual representation after all. With weird layouts,
tables, charts, etc. Using a vision model just make sense! In
general, I'd categorize this solution as slow, expensive, and non
deterministic. But 6 months ago it was impossible. And 6 months
from now it'll be fast, cheap, and probably more reliable!
Author : themanmaran
Score : 33 points
Date : 2024-07-23 16:49 UTC (6 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| cmpaul wrote:
| Great example of how LLMs are eliminating/simplifying giant
| swathes of complex tech.
|
| I would love to use this in a project if it could also caption
| embedded images to produce something for RAG...
| hpen wrote:
| Yay! Now we can use more RAM, Network, Energy, etc to do the
| same thing! I just love hot phones!
| beklein wrote:
| Very interesting project, thank you for sharing.
|
| Are you supporting the Batch API from OpenAI? This would lower
| costs by 50%. Many OCR tasks are not time-sensitive, so this
| might be a very good tradeoff.
| refulgentis wrote:
| Fwiw have on good sourcing that OpenAI supplies Tesseract output
| to the LLM, so you're in a great place, best of all worlds
| 8organicbits wrote:
| I'm surprised by the name choice, there's a large company with an
| almost identical name that has products that do this. May be
| worth changing it sooner rather than later.
|
| https://duckduckgo.com/?q=xerox+ocr+software&t=fpas&ia=web
| pkaye wrote:
| Maybe call it ZeroPDF?
| froh wrote:
| gpterox
| ot wrote:
| > there's a large company with an almost identical name
|
| Are you suggesting that this wasn't intentional? The name is
| clearly a play on "zero shot" + "xerox"
| UncleOxidant wrote:
| I think they're suggesting that Xerox will likely sue them so
| might as well get ahead of that and change the name now.
| 8organicbits wrote:
| > And 6 months from now it'll be fast, cheap, and probably more
| reliable!
|
| I like the optimism.
|
| I've needed to include human review when using previous
| generation OCR software; when I needed the results to be
| accurate. It's painstaking, but the OCR offered a speedup over
| fully-manual transcription. Have you given any thought to human-
| in-the-loop processes?
| downrightmike wrote:
| Does it also produce a confidence number?
| ravetcofx wrote:
| I don't think openAI's api for gpt4o-mini has any such
| mechanism.
| wildzzz wrote:
| The AI says it's 100% confident that it's hallucinations are
| correct.
| surfingdino wrote:
| Xerox tried it a while ago. It didn't end well
| https://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres...
| merb wrote:
| > This is not an OCR problem (as we switched off OCR on
| purpose)
| yjftsjthsd-h wrote:
| It also says
|
| > This is not an OCR problem, but of course, I can't have a
| look into the software itself, maybe OCR is still fiddling
| with the data even though we switched it off.
|
| But the point stands either way; LLMs are prone to
| hallucinations already, so I would not trust them to not make
| a mistake in OCR because they thought the page would probably
| say something different than it does.
| mlyle wrote:
| > It also says...
|
| It was a problem with employing the JBIG2 compression
| codec, which cuts and pastes things from different parts of
| the page to save space.
|
| > But the point stands either way; LLMs are prone to
| hallucinations already, so I would not trust them to not
| make a mistake in OCR because they thought the page would
| probably say something different than it does.
|
| Anyone trying to solve for the contents of a page uses
| context clues. Even humans reading.
|
| You can OCR raw characters (performance is poor); use
| letter frequency information; use a dictionary; use word
| frequencies; or use even more context to know what content
| is more likely. More context is going to result in many
| fewer errors (of course, it may result in a bigger
| proportion of the remaining errors seeming to have
| significant meaning changes).
|
| A small LLM is just a good way to encode this kind of "how
| likely are these given alternatives" knowledge.
| ravetcofx wrote:
| I'd be more curious to see the performance over local models like
| LLaVa etc.
| hugodutka wrote:
| I used this approach extensively over the past couple of months
| with GPT-4 and GPT-4o while building https://hotseatai.com. Two
| things that helped me:
|
| 1. Prompt with examples. I included an example image with an
| example transcription as part of the prompt. This made GPT make
| fewer mistakes and improved output accuracy.
|
| 2. Confidence score. I extracted the embedded text from the PDF
| and compared the frequency of character triples in the source
| text and GPT's output. If there was a significant difference
| (less than 90% overlap) I would log a warning. This helped detect
| cases when GPT omitted entire paragraphs of text.
| sidmitra wrote:
| >frequency of character triples
|
| What are character triples? Are they trigrams?
| hugodutka wrote:
| I think so. I'd normalize the text first: lowercase it and
| remove all non-alphanumeric characters. E.g for the phrase
| "What now?" I'd create these trigrams: wha, hat, atn, tno,
| now.
| josefritzishere wrote:
| Xerox might want to have a word with you about that name.
___________________________________________________________________
(page generated 2024-07-23 23:06 UTC)