[HN Gopher] Show HN: Zerox - Document OCR with GPT-mini
___________________________________________________________________
Show HN: Zerox - Document OCR with GPT-mini
This started out as a weekend hack with gpt-4-mini, using the very
basic strategy of "just ask the ai to ocr the document". But this
turned out to be better performing than our current implementation
of Unstructured/Textract. At pretty much the same cost. I've
tested almost every variant of document OCR over the past year,
especially trying things like table / chart extraction. I've found
the rules based extraction has always been lacking. Documents are
meant to be a visual representation after all. With weird layouts,
tables, charts, etc. Using a vision model just make sense! In
general, I'd categorize this solution as slow, expensive, and non
deterministic. But 6 months ago it was impossible. And 6 months
from now it'll be fast, cheap, and probably more reliable!
Author : themanmaran
Score : 217 points
Date : 2024-07-23 16:49 UTC (1 days ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| cmpaul wrote:
| Great example of how LLMs are eliminating/simplifying giant
| swathes of complex tech.
|
| I would love to use this in a project if it could also caption
| embedded images to produce something for RAG...
| hpen wrote:
| Yay! Now we can use more RAM, Network, Energy, etc to do the
| same thing! I just love hot phones!
| hpen wrote:
| Oops guess I'm not sippin' the koolaid huh?
| beklein wrote:
| Very interesting project, thank you for sharing.
|
| Are you supporting the Batch API from OpenAI? This would lower
| costs by 50%. Many OCR tasks are not time-sensitive, so this
| might be a very good tradeoff.
| themanmaran wrote:
| That's definitely the plan. Using batch requests would
| definitely move this closer to $2/1000 pages mark. Which is
| effectively the AWS pricing.
| refulgentis wrote:
| Fwiw have on good sourcing that OpenAI supplies Tesseract output
| to the LLM, so you're in a great place, best of all worlds
| davedunkin wrote:
| At inference time or during training?
| refulgentis wrote:
| Inference
| 8organicbits wrote:
| I'm surprised by the name choice, there's a large company with an
| almost identical name that has products that do this. May be
| worth changing it sooner rather than later.
|
| https://duckduckgo.com/?q=xerox+ocr+software&t=fpas&ia=web
| pkaye wrote:
| Maybe call it ZeroPDF?
| froh wrote:
| gpterox
| ot wrote:
| > there's a large company with an almost identical name
|
| Are you suggesting that this wasn't intentional? The name is
| clearly a play on "zero shot" + "xerox"
| UncleOxidant wrote:
| I think they're suggesting that Xerox will likely sue them so
| might as well get ahead of that and change the name now.
| 8organicbits wrote:
| Even if they don't sue, do you really want to deal with
| people getting confused and thinking you mean one of the
| many pre-existing OCR tools that Xerox produces? A search
| for "Zerox OCR" will lead to Xerox products, for example.
| Not worth the headache.
|
| https://duckduckgo.com/?q=Zerox+OCR
| themanmaran wrote:
| Yup definitely a play on the name. Also the idea of
| photocopying a page, since we do pdf => image => markdown.
|
| We're not planning to name a company after it or anything,
| just the OS tool. And if xerox sues I'm sure we could rename
| the repo lol.
| wewtyflakes wrote:
| It still seems reasonable someone may be confused,
| especially since the one letter of the company name that
| was changed has identical pronunciation (x --> z). It is
| like offering "Phacebook" or "Netfliks" competitors, but
| even less obviously different.
| qingcharles wrote:
| Surprisingly, http://phacebook.com/ is for sale.
| ned_at_codomain wrote:
| I would happily contribute to the legal defense fund.
| ssl-3 wrote:
| I was involved in a somewhat similar trademark issue once.
|
| I actually had a leg to stand on (my use was not infringing
| at all when I started using it), and I came out of it
| somewhat cash-positive, but I absolutely never want to go
| through anything like that ever again.
|
| > Yup definitely a play on the name. Also the idea of
| photocopying a page,
|
| But you? My God, man.
|
| With these words you have already doomed yourself.
|
| Best wishes.
| neilv wrote:
| > _With these words you have already doomed yourself._
|
| At least they didn't say " _xeroxing_ a page ".
| haswell wrote:
| If they sue, this comment will be used to make their case.
|
| I guess I just don't understand - how are you proceeding as
| if this is an acceptable starting point?
|
| With all respect, I don't think you're taking this
| seriously, and it reflects poorly on the team building the
| tool. It looks like this is also a way to raise awareness
| for Omni AI? If so, I've gotta be honest - this makes me
| want to steer clear.
|
| Bottom line, it's a bad idea/decision. And when bad ideas
| are this prominent, it makes me question the rest of the
| decisions underlying the product and whether I want to be
| trusting those decision makers in the many other ways trust
| is required to choose a vendor.
|
| Not trying to throw shade; just sharing how this hits me as
| someone who has built products and has been the person
| making decisions about which products to bring in. Start
| taking this seriously for your own sake.
| blacksmith_tb wrote:
| If imitation is the sincerest form of flattery, I'd have gone
| with "Xorex" myself.
| kevin_thibedeau wrote:
| We'll see what the new name is when the C&D is delivered.
| actionfromafar wrote:
| Let me xerox that C&D letter first...
| 627467 wrote:
| the commercial service is called OmniAI. zerox is just the name
| of a component (github repo, library) in a possible software
| stack.
|
| am I only one finding these sort of takes silly in a cumulative
| globalized world with instant communications? There are so many
| things to be named, everything named is instantly available
| around the world, so many jurisdictions to cover - not all
| providing the same levels of protections to "trademarks".
|
| Are we really suggesting this issue is worth defending and
| spending resources on?
|
| what is the ground for confusion here? that a developer
| stumbles on here and thinks zerox is developed/maintained by
| xerox? this developer gets confused but won't simply check who
| is the owner of the repository? What if there's a variable
| called zerox?
|
| I mean, I get it: the whole point of IP at this point is really
| just to create revenue streams for the legal/admin industry so
| we should all be scared and spend unproductive time naming a
| software dependency
| HumblyTossed wrote:
| > so we should all be scared and spend unproductive time
| naming a software dependency
|
| All 5 minutes it would take to name it something else?
| 8organicbits wrote:
| > Are we really suggesting this issue is worth defending and
| spending resources on?
|
| Absolutely.
|
| Sure, sometimes non-competing products have the same name. Or
| products sold exclusively in one country use the same name as
| a competitor in a different country. There's also companies
| that don't trademark or protect their names. Often no one
| even notices the common name.
|
| That's not whats happening here. Xerox is famously litigious
| about their trademark; often used as a case study. The
| product competes with Xerox OCR products in the same
| countries.
|
| It's a strange thing to be cavalier about and to openly
| document intent to use a sound-alike name. Besides, do you
| really want people searching for "Zerox OCR" to land on a
| Xerox page? There's no shortage of other names.
| HumblyTossed wrote:
| I'm sure that was on purpose.
|
| Edit: Reading the comments below, yes, it was.
|
| Very disrespectful behavior.
| 8organicbits wrote:
| > And 6 months from now it'll be fast, cheap, and probably more
| reliable!
|
| I like the optimism.
|
| I've needed to include human review when using previous
| generation OCR software; when I needed the results to be
| accurate. It's painstaking, but the OCR offered a speedup over
| fully-manual transcription. Have you given any thought to human-
| in-the-loop processes?
| themanmaran wrote:
| I've been surprised so far by llms capability, so I hope it
| continues.
|
| On the human in loop side, it's really use case specific. For a
| lot of my company's work, it's focused on getting trends from
| large sets of documents.
|
| Ex: "categorize building permits by municipality". If the OCR
| was wrong on a few documents, it's still going to capture the
| general trend. If the use case was "pull bank account info from
| wire forms" I would want a lot more double checking. But that
| said, humans also have a tendency to transpose numbers
| incorrectly.
| 8organicbits wrote:
| Hmm, sounds like different goals. I don't work on that
| project any longer but it was a very small set of documents
| and they needed to be transcribed perfectly. Every typo in
| the original needed to be preserved.
|
| That said, there's huge value in lossy transcription
| elsewhere, as long as you can account for the errors they
| introduce.
| raisedbyninjas wrote:
| Our human in the loop process with traditional OCR uses
| confidence scores from regions of interest and the page
| coordinates to speed-up the review process. I wish the LLM
| could provide that, but both seem far off on the horizon.
| throwthrowuknow wrote:
| Have you tried using the GraphRAG approach of just rerunning
| the same prompts multiple times and then giving the results
| along with a prompt to the model telling it to extract the true
| text and fix any mistakes? With mini this seems like a very
| workable solution. You could even incorporate one or more
| attempts from whatever OCR you were using previously.
|
| I think that is one of the key findings from GraphRAG paper:
| the gpt can replace the human in the loop.
| downrightmike wrote:
| Does it also produce a confidence number?
| ravetcofx wrote:
| I don't think openAI's api for gpt4o-mini has any such
| mechanism.
| wildzzz wrote:
| The AI says it's 100% confident that it's hallucinations are
| correct.
| tensor wrote:
| No, there is no vision LLM that produces confidence numbers to
| my knowledge.
| ndr_ wrote:
| The only thing close are the "logprobs":
| https://cookbook.openai.com/examples/using_logprobs
|
| However, commenters around here noted that these have likely
| not been fine-tuned to correlate with accuracy - for plaintext
| LLM uses. Would be interested in hearing finding for MLLM use-
| cases!
| surfingdino wrote:
| Xerox tried it a while ago. It didn't end well
| https://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres...
| merb wrote:
| > This is not an OCR problem (as we switched off OCR on
| purpose)
| yjftsjthsd-h wrote:
| It also says
|
| > This is not an OCR problem, but of course, I can't have a
| look into the software itself, maybe OCR is still fiddling
| with the data even though we switched it off.
|
| But the point stands either way; LLMs are prone to
| hallucinations already, so I would not trust them to not make
| a mistake in OCR because they thought the page would probably
| say something different than it does.
| mlyle wrote:
| > It also says...
|
| It was a problem with employing the JBIG2 compression
| codec, which cuts and pastes things from different parts of
| the page to save space.
|
| > But the point stands either way; LLMs are prone to
| hallucinations already, so I would not trust them to not
| make a mistake in OCR because they thought the page would
| probably say something different than it does.
|
| Anyone trying to solve for the contents of a page uses
| context clues. Even humans reading.
|
| You can OCR raw characters (performance is poor); use
| letter frequency information; use a dictionary; use word
| frequencies; or use even more context to know what content
| is more likely. More context is going to result in many
| fewer errors (of course, it may result in a bigger
| proportion of the remaining errors seeming to have
| significant meaning changes).
|
| A small LLM is just a good way to encode this kind of "how
| likely are these given alternatives" knowledge.
| surfingdino wrote:
| It's all fun and games until you need to prove something
| in court or to the tax office. I don't think that
| throwing an LLM into this mix helps.
| wmf wrote:
| Generally when OCRing documents you should keep the
| original scans so you can refer back to them in case of
| any questions or disputes.
| tensor wrote:
| Traditional OCR neural networks like tesseract crucially
| they have strong measures of their accuracy levels,
| including when they employ dictionaries or the like to
| help with accuracy. LLMs, on the other hand, give you
| zero guarantees, and have some pretty insane edge cases.
|
| With a traditional OCR architecture maybe you'll get a
| symbol or two wrong, but an LLM can give you entirely new
| words or numbers not in the document, or even omit
| sections of the document. I'd never use an LLM for OCR
| like this.
| qingcharles wrote:
| It depends what your use-case is. At a low enough cost this
| would work for a project I'm doing where I really just need
| to be able to mostly search large documents. 100% accuracy
| and a lost or hallucinated paragraph here and there
| wouldn't be a deal-killer, especially if the original page
| image is available to the user too.
|
| And additionally, this also might work if you are feeding
| the output into a bunch of humans to proof.
| ctm92 wrote:
| That was also what first came to my mind, I guess Zerox might
| be a reference to this
| ravetcofx wrote:
| I'd be more curious to see the performance over local models like
| LLaVa etc.
| hugodutka wrote:
| I used this approach extensively over the past couple of months
| with GPT-4 and GPT-4o while building https://hotseatai.com. Two
| things that helped me:
|
| 1. Prompt with examples. I included an example image with an
| example transcription as part of the prompt. This made GPT make
| fewer mistakes and improved output accuracy.
|
| 2. Confidence score. I extracted the embedded text from the PDF
| and compared the frequency of character triples in the source
| text and GPT's output. If there was a significant difference
| (less than 90% overlap) I would log a warning. This helped detect
| cases when GPT omitted entire paragraphs of text.
| sidmitra wrote:
| >frequency of character triples
|
| What are character triples? Are they trigrams?
| hugodutka wrote:
| I think so. I'd normalize the text first: lowercase it and
| remove all non-alphanumeric characters. E.g for the phrase
| "What now?" I'd create these trigrams: wha, hat, atn, tno,
| now.
| themanmaran wrote:
| One option we've been testing is the 'maintainFormat` mode.
| This tries to return the markdown in a consistent format by
| passing the output of a prior page in as additional context for
| the next page. Especially useful if you've got tables that span
| pages. The flow is pretty much:
|
| - Request #1 => page_1_image
|
| - Request #2 => page_1_markdown + page_2_image
|
| - Request #3 => page_2_markdown + page_3_image
| nbbaier wrote:
| > I extracted the embedded text from the PDF
|
| What did you use to extract the embedded text during this step?
| Other than some other OCR tech
| josefritzishere wrote:
| Xerox might want to have a word with you about that name.
| fudged71 wrote:
| Llama 3.1 now has images support right? Could this be adapted
| there as well, maybe with groq for speed?
| themanmaran wrote:
| Yup! I want to evaluate a couple different model options over
| time. Which should be pretty simple!
|
| The main thing we're doing is converting documents to a series
| of images, and then aggregating the response. So we should be
| model agnostic pretty soon.
| daemonologist wrote:
| Meta trained a vision encoder (page 54 of the Llama 3.1 paper)
| but has not released it as far as I can tell.
| serjester wrote:
| It should be noted for some reason OpenAI prices GPT-4o-mini
| image requests at the same price as GPT-4o. I have a similar
| library but we found OpenAI has subtle OCR inconsistencies with
| tables (numbers will be inaccurate). Gemini Flash, for all its
| faults, seems to do really well as a replacement while being
| significantly cheaper.
|
| Here's our pricing comparison:
|
| *Gemini Pro* - $0.66 per 1k image inputs (batch) - $1.88 per text
| output (batch API, 1k tokens) - 395 pages per dollar
|
| *Gemini Flash* - $0.066 per 1k images (batch) - $0.53 per text
| output (batch API, 1k tokens) - 1693 pages per dollar
|
| *GPT-4o* - $1.91 per 1k images (batch) - $3.75 per text output
| (batch API, 1k tokens) - 177 pages per dollar
|
| *GPT-4o-mini* - $1.91 per 1k images (batch) - $0.30 per text
| output (batch API, 1k tokens) - 452 pages per dollar
|
| [1] https://community.openai.com/t/super-high-token-usage-
| with-g...
|
| [2] https://github.com/Filimoa/open-parse
| themanmaran wrote:
| Interesting. It didn't seem like gpt-4o-mini was priced the
| same as gpt-4o during our testing. We're relying on OpenAI
| usage page of course, which doesn't give as much request by
| request pricing. But we didn't see any huge usage spike after
| testing all weekend.
|
| For our testing we ran a 1000 page document set, all treated as
| images. We got to about 25M input / 0.4M output tokens for 1000
| pages. Which would be a pretty noticeable difference based on
| the listed token prices.
|
| gpt-4o-mini => (24M/1M * $0.15) + (0.4M/1M * 0.60) = $4.10
|
| gpt-4o => (24M/1M * $5.00) + (0.4M/1M * 15.00) = $126.00
| serjester wrote:
| The pricing is strange because the same images will use up
| 30X more tokens with mini. They even show this in the pricing
| calculator.
|
| [1] https://openai.com/api/pricing/
| elvennn wrote:
| Indeed it does. But also the price for output tokens of the
| OCR is cheaper. So in total it's still much cheaper with
| gpt-4o-mini.
| raffraffraff wrote:
| That price compares favourably with AWS Textract. Has anyone
| compared their performance? Because a recent post about OCR had
| Textract at or near the top in terms of quality.
| aman2k4 wrote:
| I'm using AWS textract for scanning grocery receipts and i
| find it does it very well and fast. Can you say which
| performance metric you have in mind?
| ianhawes wrote:
| Can you locate that post? In my own experience, Google
| Document AI has superior quality but I'm looking for
| something a bit more objective and scientific.
| bearjaws wrote:
| I did this for images using Tesseract for OCR + Ollama for AI.
|
| Check it out, https://cluttr.ai
|
| Runs entirely in browser, using OPFS + WASM.
| jimmyechan wrote:
| Congrats! Cool project! I'd been curious about whether GPT would
| be good for this task. Looks like this answers it!
|
| Why did you choose markdown? Did you try other output formats and
| see if you get better results?
|
| Also, I wonder how HMTL performs. It would be a way to handle
| tables with groupings/merged cells
| themanmaran wrote:
| I think that I'll add an optional configuration for HTML vs
| Markdown. Which at the end of the day will just prompt the
| model differently.
|
| I've not seen a meaningful difference between either, except
| when it comes to tables. It seems like HTML tends to outperform
| markdown tables, especially when you have a lot of complexity
| (i.e. tables within tables, lots of subheaders).
| lootsauce wrote:
| In my own experiments I have had major failures where much of the
| text is fabricated by the LLM to the point where I just find it
| hard to trust even with great prompt engineering. What I have
| been very impressed with is it's ability to take medium quality
| ocr from acrobat with poor formatting, lots of errors and
| punctuation problems and render 100% accurate and properly
| formatted output by simply asking it to correct the ocr output.
| This approach using traditional cheap ocr for grounding might be
| a really robust and cheap option.
| ipkstef wrote:
| I think i'm missing something.. why would i pay to ocr the images
| when i can do it locally for free? Tesseract runs pretty well on
| just cpu, wouldn't even need something crazy powerful.
| gregolo wrote:
| And OpenAI uses Tesseract in the background, as it sometimes
| answers that Hungarian language is not installed for Tesseract
| for me
| s5ma6n wrote:
| I would be extremely surprised if that's the case. There are
| "open-source" multimodal LLMs can extract text from images as
| a proof that the idea works.
|
| Probably the model is hallucinating and adding "Hungarian
| language is not installed for Tesseract" to the response.
| daemonologist wrote:
| Tesseract works great for pure label-the-characters OCR, which
| is sufficient for books and other sources with straightforward
| layouts, but doesn't handle weird layouts (tables, columns,
| tables with columns in each cell, etc.) People will do
| absolutely depraved stuff with Word and PDF documents and you
| often need semantic understanding to decipher it.
|
| That said, sometimes no amount of understanding will improve
| the OCR output because a structure in a document cannot be
| converted to a one-dimensional string (short of using HTML/CSS
| or something). Maybe we'll get image -> HTML models eventually.
| jagermo wrote:
| ohh, that could finally be a great way to get my ttrpg books
| readable for kindle. I'll give it a try, thanks for that.
| amluto wrote:
| My intuition is that the best solution here would be a division
| of labor: have the big multimodal model identify tables,
| paragraphs, etc, and output a mapping between segments of the
| document and texture output. Then a much simpler model that
| doesn't try to hold entire conversations can process those
| segments into their contents.
|
| This will perform worse in cases where whatever understanding the
| large model has of the contents is needed to recognize indistinct
| symbols. But it will avoid cases where that very same
| understanding causes contents to be understood incorrectly due to
| the model's assumptions of what the contents should be.
|
| At least in my limited experiments with Claude, it's easy for
| models to lose track of where they're looking on the page and to
| omit things entirely. But if segmentation of the page is
| explicit, one can enforce that all contents end up in exactly one
| segment.
| jdthedisciple wrote:
| Very nice, seem to work pretty well!
|
| Just maintainFormat: true
|
| did not seem to have any effect in my testing.
| aman2k4 wrote:
| I am using AWS Textract + LLM (OpenAI/Claude) to read grocery
| receipts for <https://www.5outapp.com>
|
| So far, I have collected over 500 receipts from around 10
| countries with 30 different supermarkets in 5 different
| languages.
|
| What has worked for me so far is having control over OCR and
| processing (for formatting/structuring) separately. I don't have
| the figures to provide a cost structure, but I'm looking for
| other solutions to improve both speed and accuracy. Also, I need
| to figure out a way to put a metric around accuracy. I will
| definitely give this a shot. Thanks a lot.
| sleno wrote:
| Cool design. FYI the "Try now" card looks like it didn't render
| right, just seeing a blank box around the button.
| aman2k4 wrote:
| You meant in the web version? it is supposed to look like a
| blank box in the rectangle grocery bill shape, but i suppose
| the design can be a bit better there. Thanks for the
| feedback.
| constantinum wrote:
| If you want to do document OCR/PDF text extraction with decent
| accuracy without using an LLM, do give LLMWhisperer[1] a try.
|
| Try with any PDF document in the playground -
| https://pg.llmwhisperer.unstract.com/
|
| [1] - https://unstract.com/llmwhisperer/
| throwthrowuknow wrote:
| Have you compared the results to special purpose OCR free models
| that do image to text with layout? My intuition is mini should be
| just as good if not better.
| jerrygenser wrote:
| Azure document AI accuracy I would categorize as high not "mid".
| Including hand writing. However for the $1.5/1000 pages, it
| doesn't include layout detection.
|
| The $10/1000 pages model includes layout detection (headers,
| etc.) as well as key-value pairs and checkbox detection.
|
| I have continued to do proofs of concept with Gemini and GPT, and
| in general any new multimodal model that comes out but have it is
| not on par with the checkbox detection of azure.
|
| In fact the results from Gemini/GPT4 aren't even good enough to
| use as a teacher for distillation of a "small" multimodal model
| specializing in layout/checkbox.
|
| I would like to also shout out surya OCR which is up and coming.
| It's source available and free for under a certain funding or
| revenue milestone - I think $5m. It doesn't have word level
| detection yet but it's one of the more promising non-hyper
| scaler/ heavy commercial OCR tools I'm aware of.
| ianhawes wrote:
| Surya OCR is great in my test use cases! Hoping to try it out
| in production soon.
| samuell wrote:
| The problem I've not found one OCR solution to handle well is
| complex column based layouts in magazines. Perhaps one problem is
| that there are often images spanning anything from one to all
| columns, and so the text might flow in sometimes funny ways. But
| in this day and age, this must be possible to handle for the best
| AI-based tools?
| ndr_ wrote:
| Prompts in the background: const systemPrompt = `
| Convert the following PDF page to markdown. Return only
| the markdown with no explanation text. Do not exclude
| any content from the page. `;
|
| For each subsequent page: messages.push({ role: "system",
| content: `Markdown must maintain consistent formatting with the
| following page: \n\n """${priorPage}"""`, });
|
| Could be handy for general-purpose frontend tools.
| markous wrote:
| so this is just a wrapper around gpt-4o mini?
| binalpatel wrote:
| You can do some really cool things now with these models, like
| ask them to extract not just the text but figures/graphs as
| nodes/edges and it works very well. Back when GPT-4 with vision
| came out I tried this with a simple prompt + dumping in a
| pydantic schema of what I wanted and it was spot on, pretty much
| this (before json mode was a supported): You
| are an expert in PDFs. You are helping a user extract text from a
| PDF. Extract the text from the image as a structured
| json output. Extract the data using the following
| schema: {Page.model_json_schema()}
| Example: {{ "title": "Title",
| "page_number": 1, "sections": [ ...
| ], "figures": [ ... ] }}
|
| https://binal.pub/2023/12/structured-ocr-with-gpt-vision/
___________________________________________________________________
(page generated 2024-07-24 23:10 UTC)