[HN Gopher] Show HN: Zerox - document OCR with GPT-mini
       ___________________________________________________________________
        
       Show HN: Zerox - document OCR with GPT-mini
        
       This started out as a weekend hack with gpt-4-mini, using the very
       basic strategy of "just ask the ai to ocr the document".  But this
       turned out to be better performing than our current implementation
       of Unstructured/Textract. At pretty much the same cost.  I've
       tested almost every variant of document OCR over the past year,
       especially trying things like table / chart extraction. I've found
       the rules based extraction has always been lacking. Documents are
       meant to be a visual representation after all. With weird layouts,
       tables, charts, etc. Using a vision model just make sense!  In
       general, I'd categorize this solution as slow, expensive, and non
       deterministic. But 6 months ago it was impossible. And 6 months
       from now it'll be fast, cheap, and probably more reliable!
        
       Author : themanmaran
       Score  : 33 points
       Date   : 2024-07-23 16:49 UTC (6 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | cmpaul wrote:
       | Great example of how LLMs are eliminating/simplifying giant
       | swathes of complex tech.
       | 
       | I would love to use this in a project if it could also caption
       | embedded images to produce something for RAG...
        
         | hpen wrote:
         | Yay! Now we can use more RAM, Network, Energy, etc to do the
         | same thing! I just love hot phones!
        
       | beklein wrote:
       | Very interesting project, thank you for sharing.
       | 
       | Are you supporting the Batch API from OpenAI? This would lower
       | costs by 50%. Many OCR tasks are not time-sensitive, so this
       | might be a very good tradeoff.
        
       | refulgentis wrote:
       | Fwiw have on good sourcing that OpenAI supplies Tesseract output
       | to the LLM, so you're in a great place, best of all worlds
        
       | 8organicbits wrote:
       | I'm surprised by the name choice, there's a large company with an
       | almost identical name that has products that do this. May be
       | worth changing it sooner rather than later.
       | 
       | https://duckduckgo.com/?q=xerox+ocr+software&t=fpas&ia=web
        
         | pkaye wrote:
         | Maybe call it ZeroPDF?
        
         | froh wrote:
         | gpterox
        
         | ot wrote:
         | > there's a large company with an almost identical name
         | 
         | Are you suggesting that this wasn't intentional? The name is
         | clearly a play on "zero shot" + "xerox"
        
           | UncleOxidant wrote:
           | I think they're suggesting that Xerox will likely sue them so
           | might as well get ahead of that and change the name now.
        
       | 8organicbits wrote:
       | > And 6 months from now it'll be fast, cheap, and probably more
       | reliable!
       | 
       | I like the optimism.
       | 
       | I've needed to include human review when using previous
       | generation OCR software; when I needed the results to be
       | accurate. It's painstaking, but the OCR offered a speedup over
       | fully-manual transcription. Have you given any thought to human-
       | in-the-loop processes?
        
       | downrightmike wrote:
       | Does it also produce a confidence number?
        
         | ravetcofx wrote:
         | I don't think openAI's api for gpt4o-mini has any such
         | mechanism.
        
         | wildzzz wrote:
         | The AI says it's 100% confident that it's hallucinations are
         | correct.
        
       | surfingdino wrote:
       | Xerox tried it a while ago. It didn't end well
       | https://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres...
        
         | merb wrote:
         | > This is not an OCR problem (as we switched off OCR on
         | purpose)
        
           | yjftsjthsd-h wrote:
           | It also says
           | 
           | > This is not an OCR problem, but of course, I can't have a
           | look into the software itself, maybe OCR is still fiddling
           | with the data even though we switched it off.
           | 
           | But the point stands either way; LLMs are prone to
           | hallucinations already, so I would not trust them to not make
           | a mistake in OCR because they thought the page would probably
           | say something different than it does.
        
             | mlyle wrote:
             | > It also says...
             | 
             | It was a problem with employing the JBIG2 compression
             | codec, which cuts and pastes things from different parts of
             | the page to save space.
             | 
             | > But the point stands either way; LLMs are prone to
             | hallucinations already, so I would not trust them to not
             | make a mistake in OCR because they thought the page would
             | probably say something different than it does.
             | 
             | Anyone trying to solve for the contents of a page uses
             | context clues. Even humans reading.
             | 
             | You can OCR raw characters (performance is poor); use
             | letter frequency information; use a dictionary; use word
             | frequencies; or use even more context to know what content
             | is more likely. More context is going to result in many
             | fewer errors (of course, it may result in a bigger
             | proportion of the remaining errors seeming to have
             | significant meaning changes).
             | 
             | A small LLM is just a good way to encode this kind of "how
             | likely are these given alternatives" knowledge.
        
       | ravetcofx wrote:
       | I'd be more curious to see the performance over local models like
       | LLaVa etc.
        
       | hugodutka wrote:
       | I used this approach extensively over the past couple of months
       | with GPT-4 and GPT-4o while building https://hotseatai.com. Two
       | things that helped me:
       | 
       | 1. Prompt with examples. I included an example image with an
       | example transcription as part of the prompt. This made GPT make
       | fewer mistakes and improved output accuracy.
       | 
       | 2. Confidence score. I extracted the embedded text from the PDF
       | and compared the frequency of character triples in the source
       | text and GPT's output. If there was a significant difference
       | (less than 90% overlap) I would log a warning. This helped detect
       | cases when GPT omitted entire paragraphs of text.
        
         | sidmitra wrote:
         | >frequency of character triples
         | 
         | What are character triples? Are they trigrams?
        
           | hugodutka wrote:
           | I think so. I'd normalize the text first: lowercase it and
           | remove all non-alphanumeric characters. E.g for the phrase
           | "What now?" I'd create these trigrams: wha, hat, atn, tno,
           | now.
        
       | josefritzishere wrote:
       | Xerox might want to have a word with you about that name.
        
       ___________________________________________________________________
       (page generated 2024-07-23 23:06 UTC)