[HN Gopher] Show HN: Zerox - Document OCR with GPT-mini
       ___________________________________________________________________
        
       Show HN: Zerox - Document OCR with GPT-mini
        
       This started out as a weekend hack with gpt-4-mini, using the very
       basic strategy of "just ask the ai to ocr the document".  But this
       turned out to be better performing than our current implementation
       of Unstructured/Textract. At pretty much the same cost.  I've
       tested almost every variant of document OCR over the past year,
       especially trying things like table / chart extraction. I've found
       the rules based extraction has always been lacking. Documents are
       meant to be a visual representation after all. With weird layouts,
       tables, charts, etc. Using a vision model just make sense!  In
       general, I'd categorize this solution as slow, expensive, and non
       deterministic. But 6 months ago it was impossible. And 6 months
       from now it'll be fast, cheap, and probably more reliable!
        
       Author : themanmaran
       Score  : 217 points
       Date   : 2024-07-23 16:49 UTC (1 days ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | cmpaul wrote:
       | Great example of how LLMs are eliminating/simplifying giant
       | swathes of complex tech.
       | 
       | I would love to use this in a project if it could also caption
       | embedded images to produce something for RAG...
        
         | hpen wrote:
         | Yay! Now we can use more RAM, Network, Energy, etc to do the
         | same thing! I just love hot phones!
        
           | hpen wrote:
           | Oops guess I'm not sippin' the koolaid huh?
        
       | beklein wrote:
       | Very interesting project, thank you for sharing.
       | 
       | Are you supporting the Batch API from OpenAI? This would lower
       | costs by 50%. Many OCR tasks are not time-sensitive, so this
       | might be a very good tradeoff.
        
         | themanmaran wrote:
         | That's definitely the plan. Using batch requests would
         | definitely move this closer to $2/1000 pages mark. Which is
         | effectively the AWS pricing.
        
       | refulgentis wrote:
       | Fwiw have on good sourcing that OpenAI supplies Tesseract output
       | to the LLM, so you're in a great place, best of all worlds
        
         | davedunkin wrote:
         | At inference time or during training?
        
           | refulgentis wrote:
           | Inference
        
       | 8organicbits wrote:
       | I'm surprised by the name choice, there's a large company with an
       | almost identical name that has products that do this. May be
       | worth changing it sooner rather than later.
       | 
       | https://duckduckgo.com/?q=xerox+ocr+software&t=fpas&ia=web
        
         | pkaye wrote:
         | Maybe call it ZeroPDF?
        
         | froh wrote:
         | gpterox
        
         | ot wrote:
         | > there's a large company with an almost identical name
         | 
         | Are you suggesting that this wasn't intentional? The name is
         | clearly a play on "zero shot" + "xerox"
        
           | UncleOxidant wrote:
           | I think they're suggesting that Xerox will likely sue them so
           | might as well get ahead of that and change the name now.
        
             | 8organicbits wrote:
             | Even if they don't sue, do you really want to deal with
             | people getting confused and thinking you mean one of the
             | many pre-existing OCR tools that Xerox produces? A search
             | for "Zerox OCR" will lead to Xerox products, for example.
             | Not worth the headache.
             | 
             | https://duckduckgo.com/?q=Zerox+OCR
        
           | themanmaran wrote:
           | Yup definitely a play on the name. Also the idea of
           | photocopying a page, since we do pdf => image => markdown.
           | 
           | We're not planning to name a company after it or anything,
           | just the OS tool. And if xerox sues I'm sure we could rename
           | the repo lol.
        
             | wewtyflakes wrote:
             | It still seems reasonable someone may be confused,
             | especially since the one letter of the company name that
             | was changed has identical pronunciation (x --> z). It is
             | like offering "Phacebook" or "Netfliks" competitors, but
             | even less obviously different.
        
               | qingcharles wrote:
               | Surprisingly, http://phacebook.com/ is for sale.
        
             | ned_at_codomain wrote:
             | I would happily contribute to the legal defense fund.
        
             | ssl-3 wrote:
             | I was involved in a somewhat similar trademark issue once.
             | 
             | I actually had a leg to stand on (my use was not infringing
             | at all when I started using it), and I came out of it
             | somewhat cash-positive, but I absolutely never want to go
             | through anything like that ever again.
             | 
             | > Yup definitely a play on the name. Also the idea of
             | photocopying a page,
             | 
             | But you? My God, man.
             | 
             | With these words you have already doomed yourself.
             | 
             | Best wishes.
        
               | neilv wrote:
               | > _With these words you have already doomed yourself._
               | 
               | At least they didn't say " _xeroxing_ a page ".
        
             | haswell wrote:
             | If they sue, this comment will be used to make their case.
             | 
             | I guess I just don't understand - how are you proceeding as
             | if this is an acceptable starting point?
             | 
             | With all respect, I don't think you're taking this
             | seriously, and it reflects poorly on the team building the
             | tool. It looks like this is also a way to raise awareness
             | for Omni AI? If so, I've gotta be honest - this makes me
             | want to steer clear.
             | 
             | Bottom line, it's a bad idea/decision. And when bad ideas
             | are this prominent, it makes me question the rest of the
             | decisions underlying the product and whether I want to be
             | trusting those decision makers in the many other ways trust
             | is required to choose a vendor.
             | 
             | Not trying to throw shade; just sharing how this hits me as
             | someone who has built products and has been the person
             | making decisions about which products to bring in. Start
             | taking this seriously for your own sake.
        
         | blacksmith_tb wrote:
         | If imitation is the sincerest form of flattery, I'd have gone
         | with "Xorex" myself.
        
           | kevin_thibedeau wrote:
           | We'll see what the new name is when the C&D is delivered.
        
             | actionfromafar wrote:
             | Let me xerox that C&D letter first...
        
         | 627467 wrote:
         | the commercial service is called OmniAI. zerox is just the name
         | of a component (github repo, library) in a possible software
         | stack.
         | 
         | am I only one finding these sort of takes silly in a cumulative
         | globalized world with instant communications? There are so many
         | things to be named, everything named is instantly available
         | around the world, so many jurisdictions to cover - not all
         | providing the same levels of protections to "trademarks".
         | 
         | Are we really suggesting this issue is worth defending and
         | spending resources on?
         | 
         | what is the ground for confusion here? that a developer
         | stumbles on here and thinks zerox is developed/maintained by
         | xerox? this developer gets confused but won't simply check who
         | is the owner of the repository? What if there's a variable
         | called zerox?
         | 
         | I mean, I get it: the whole point of IP at this point is really
         | just to create revenue streams for the legal/admin industry so
         | we should all be scared and spend unproductive time naming a
         | software dependency
        
           | HumblyTossed wrote:
           | > so we should all be scared and spend unproductive time
           | naming a software dependency
           | 
           | All 5 minutes it would take to name it something else?
        
           | 8organicbits wrote:
           | > Are we really suggesting this issue is worth defending and
           | spending resources on?
           | 
           | Absolutely.
           | 
           | Sure, sometimes non-competing products have the same name. Or
           | products sold exclusively in one country use the same name as
           | a competitor in a different country. There's also companies
           | that don't trademark or protect their names. Often no one
           | even notices the common name.
           | 
           | That's not whats happening here. Xerox is famously litigious
           | about their trademark; often used as a case study. The
           | product competes with Xerox OCR products in the same
           | countries.
           | 
           | It's a strange thing to be cavalier about and to openly
           | document intent to use a sound-alike name. Besides, do you
           | really want people searching for "Zerox OCR" to land on a
           | Xerox page? There's no shortage of other names.
        
         | HumblyTossed wrote:
         | I'm sure that was on purpose.
         | 
         | Edit: Reading the comments below, yes, it was.
         | 
         | Very disrespectful behavior.
        
       | 8organicbits wrote:
       | > And 6 months from now it'll be fast, cheap, and probably more
       | reliable!
       | 
       | I like the optimism.
       | 
       | I've needed to include human review when using previous
       | generation OCR software; when I needed the results to be
       | accurate. It's painstaking, but the OCR offered a speedup over
       | fully-manual transcription. Have you given any thought to human-
       | in-the-loop processes?
        
         | themanmaran wrote:
         | I've been surprised so far by llms capability, so I hope it
         | continues.
         | 
         | On the human in loop side, it's really use case specific. For a
         | lot of my company's work, it's focused on getting trends from
         | large sets of documents.
         | 
         | Ex: "categorize building permits by municipality". If the OCR
         | was wrong on a few documents, it's still going to capture the
         | general trend. If the use case was "pull bank account info from
         | wire forms" I would want a lot more double checking. But that
         | said, humans also have a tendency to transpose numbers
         | incorrectly.
        
           | 8organicbits wrote:
           | Hmm, sounds like different goals. I don't work on that
           | project any longer but it was a very small set of documents
           | and they needed to be transcribed perfectly. Every typo in
           | the original needed to be preserved.
           | 
           | That said, there's huge value in lossy transcription
           | elsewhere, as long as you can account for the errors they
           | introduce.
        
           | raisedbyninjas wrote:
           | Our human in the loop process with traditional OCR uses
           | confidence scores from regions of interest and the page
           | coordinates to speed-up the review process. I wish the LLM
           | could provide that, but both seem far off on the horizon.
        
         | throwthrowuknow wrote:
         | Have you tried using the GraphRAG approach of just rerunning
         | the same prompts multiple times and then giving the results
         | along with a prompt to the model telling it to extract the true
         | text and fix any mistakes? With mini this seems like a very
         | workable solution. You could even incorporate one or more
         | attempts from whatever OCR you were using previously.
         | 
         | I think that is one of the key findings from GraphRAG paper:
         | the gpt can replace the human in the loop.
        
       | downrightmike wrote:
       | Does it also produce a confidence number?
        
         | ravetcofx wrote:
         | I don't think openAI's api for gpt4o-mini has any such
         | mechanism.
        
         | wildzzz wrote:
         | The AI says it's 100% confident that it's hallucinations are
         | correct.
        
         | tensor wrote:
         | No, there is no vision LLM that produces confidence numbers to
         | my knowledge.
        
         | ndr_ wrote:
         | The only thing close are the "logprobs":
         | https://cookbook.openai.com/examples/using_logprobs
         | 
         | However, commenters around here noted that these have likely
         | not been fine-tuned to correlate with accuracy - for plaintext
         | LLM uses. Would be interested in hearing finding for MLLM use-
         | cases!
        
       | surfingdino wrote:
       | Xerox tried it a while ago. It didn't end well
       | https://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres...
        
         | merb wrote:
         | > This is not an OCR problem (as we switched off OCR on
         | purpose)
        
           | yjftsjthsd-h wrote:
           | It also says
           | 
           | > This is not an OCR problem, but of course, I can't have a
           | look into the software itself, maybe OCR is still fiddling
           | with the data even though we switched it off.
           | 
           | But the point stands either way; LLMs are prone to
           | hallucinations already, so I would not trust them to not make
           | a mistake in OCR because they thought the page would probably
           | say something different than it does.
        
             | mlyle wrote:
             | > It also says...
             | 
             | It was a problem with employing the JBIG2 compression
             | codec, which cuts and pastes things from different parts of
             | the page to save space.
             | 
             | > But the point stands either way; LLMs are prone to
             | hallucinations already, so I would not trust them to not
             | make a mistake in OCR because they thought the page would
             | probably say something different than it does.
             | 
             | Anyone trying to solve for the contents of a page uses
             | context clues. Even humans reading.
             | 
             | You can OCR raw characters (performance is poor); use
             | letter frequency information; use a dictionary; use word
             | frequencies; or use even more context to know what content
             | is more likely. More context is going to result in many
             | fewer errors (of course, it may result in a bigger
             | proportion of the remaining errors seeming to have
             | significant meaning changes).
             | 
             | A small LLM is just a good way to encode this kind of "how
             | likely are these given alternatives" knowledge.
        
               | surfingdino wrote:
               | It's all fun and games until you need to prove something
               | in court or to the tax office. I don't think that
               | throwing an LLM into this mix helps.
        
               | wmf wrote:
               | Generally when OCRing documents you should keep the
               | original scans so you can refer back to them in case of
               | any questions or disputes.
        
               | tensor wrote:
               | Traditional OCR neural networks like tesseract crucially
               | they have strong measures of their accuracy levels,
               | including when they employ dictionaries or the like to
               | help with accuracy. LLMs, on the other hand, give you
               | zero guarantees, and have some pretty insane edge cases.
               | 
               | With a traditional OCR architecture maybe you'll get a
               | symbol or two wrong, but an LLM can give you entirely new
               | words or numbers not in the document, or even omit
               | sections of the document. I'd never use an LLM for OCR
               | like this.
        
             | qingcharles wrote:
             | It depends what your use-case is. At a low enough cost this
             | would work for a project I'm doing where I really just need
             | to be able to mostly search large documents. 100% accuracy
             | and a lost or hallucinated paragraph here and there
             | wouldn't be a deal-killer, especially if the original page
             | image is available to the user too.
             | 
             | And additionally, this also might work if you are feeding
             | the output into a bunch of humans to proof.
        
         | ctm92 wrote:
         | That was also what first came to my mind, I guess Zerox might
         | be a reference to this
        
       | ravetcofx wrote:
       | I'd be more curious to see the performance over local models like
       | LLaVa etc.
        
       | hugodutka wrote:
       | I used this approach extensively over the past couple of months
       | with GPT-4 and GPT-4o while building https://hotseatai.com. Two
       | things that helped me:
       | 
       | 1. Prompt with examples. I included an example image with an
       | example transcription as part of the prompt. This made GPT make
       | fewer mistakes and improved output accuracy.
       | 
       | 2. Confidence score. I extracted the embedded text from the PDF
       | and compared the frequency of character triples in the source
       | text and GPT's output. If there was a significant difference
       | (less than 90% overlap) I would log a warning. This helped detect
       | cases when GPT omitted entire paragraphs of text.
        
         | sidmitra wrote:
         | >frequency of character triples
         | 
         | What are character triples? Are they trigrams?
        
           | hugodutka wrote:
           | I think so. I'd normalize the text first: lowercase it and
           | remove all non-alphanumeric characters. E.g for the phrase
           | "What now?" I'd create these trigrams: wha, hat, atn, tno,
           | now.
        
         | themanmaran wrote:
         | One option we've been testing is the 'maintainFormat` mode.
         | This tries to return the markdown in a consistent format by
         | passing the output of a prior page in as additional context for
         | the next page. Especially useful if you've got tables that span
         | pages. The flow is pretty much:
         | 
         | - Request #1 => page_1_image
         | 
         | - Request #2 => page_1_markdown + page_2_image
         | 
         | - Request #3 => page_2_markdown + page_3_image
        
         | nbbaier wrote:
         | > I extracted the embedded text from the PDF
         | 
         | What did you use to extract the embedded text during this step?
         | Other than some other OCR tech
        
       | josefritzishere wrote:
       | Xerox might want to have a word with you about that name.
        
       | fudged71 wrote:
       | Llama 3.1 now has images support right? Could this be adapted
       | there as well, maybe with groq for speed?
        
         | themanmaran wrote:
         | Yup! I want to evaluate a couple different model options over
         | time. Which should be pretty simple!
         | 
         | The main thing we're doing is converting documents to a series
         | of images, and then aggregating the response. So we should be
         | model agnostic pretty soon.
        
         | daemonologist wrote:
         | Meta trained a vision encoder (page 54 of the Llama 3.1 paper)
         | but has not released it as far as I can tell.
        
       | serjester wrote:
       | It should be noted for some reason OpenAI prices GPT-4o-mini
       | image requests at the same price as GPT-4o. I have a similar
       | library but we found OpenAI has subtle OCR inconsistencies with
       | tables (numbers will be inaccurate). Gemini Flash, for all its
       | faults, seems to do really well as a replacement while being
       | significantly cheaper.
       | 
       | Here's our pricing comparison:
       | 
       | *Gemini Pro* - $0.66 per 1k image inputs (batch) - $1.88 per text
       | output (batch API, 1k tokens) - 395 pages per dollar
       | 
       | *Gemini Flash* - $0.066 per 1k images (batch) - $0.53 per text
       | output (batch API, 1k tokens) - 1693 pages per dollar
       | 
       | *GPT-4o* - $1.91 per 1k images (batch) - $3.75 per text output
       | (batch API, 1k tokens) - 177 pages per dollar
       | 
       | *GPT-4o-mini* - $1.91 per 1k images (batch) - $0.30 per text
       | output (batch API, 1k tokens) - 452 pages per dollar
       | 
       | [1] https://community.openai.com/t/super-high-token-usage-
       | with-g...
       | 
       | [2] https://github.com/Filimoa/open-parse
        
         | themanmaran wrote:
         | Interesting. It didn't seem like gpt-4o-mini was priced the
         | same as gpt-4o during our testing. We're relying on OpenAI
         | usage page of course, which doesn't give as much request by
         | request pricing. But we didn't see any huge usage spike after
         | testing all weekend.
         | 
         | For our testing we ran a 1000 page document set, all treated as
         | images. We got to about 25M input / 0.4M output tokens for 1000
         | pages. Which would be a pretty noticeable difference based on
         | the listed token prices.
         | 
         | gpt-4o-mini => (24M/1M * $0.15) + (0.4M/1M * 0.60) = $4.10
         | 
         | gpt-4o => (24M/1M * $5.00) + (0.4M/1M * 15.00) = $126.00
        
           | serjester wrote:
           | The pricing is strange because the same images will use up
           | 30X more tokens with mini. They even show this in the pricing
           | calculator.
           | 
           | [1] https://openai.com/api/pricing/
        
             | elvennn wrote:
             | Indeed it does. But also the price for output tokens of the
             | OCR is cheaper. So in total it's still much cheaper with
             | gpt-4o-mini.
        
         | raffraffraff wrote:
         | That price compares favourably with AWS Textract. Has anyone
         | compared their performance? Because a recent post about OCR had
         | Textract at or near the top in terms of quality.
        
           | aman2k4 wrote:
           | I'm using AWS textract for scanning grocery receipts and i
           | find it does it very well and fast. Can you say which
           | performance metric you have in mind?
        
           | ianhawes wrote:
           | Can you locate that post? In my own experience, Google
           | Document AI has superior quality but I'm looking for
           | something a bit more objective and scientific.
        
       | bearjaws wrote:
       | I did this for images using Tesseract for OCR + Ollama for AI.
       | 
       | Check it out, https://cluttr.ai
       | 
       | Runs entirely in browser, using OPFS + WASM.
        
       | jimmyechan wrote:
       | Congrats! Cool project! I'd been curious about whether GPT would
       | be good for this task. Looks like this answers it!
       | 
       | Why did you choose markdown? Did you try other output formats and
       | see if you get better results?
       | 
       | Also, I wonder how HMTL performs. It would be a way to handle
       | tables with groupings/merged cells
        
         | themanmaran wrote:
         | I think that I'll add an optional configuration for HTML vs
         | Markdown. Which at the end of the day will just prompt the
         | model differently.
         | 
         | I've not seen a meaningful difference between either, except
         | when it comes to tables. It seems like HTML tends to outperform
         | markdown tables, especially when you have a lot of complexity
         | (i.e. tables within tables, lots of subheaders).
        
       | lootsauce wrote:
       | In my own experiments I have had major failures where much of the
       | text is fabricated by the LLM to the point where I just find it
       | hard to trust even with great prompt engineering. What I have
       | been very impressed with is it's ability to take medium quality
       | ocr from acrobat with poor formatting, lots of errors and
       | punctuation problems and render 100% accurate and properly
       | formatted output by simply asking it to correct the ocr output.
       | This approach using traditional cheap ocr for grounding might be
       | a really robust and cheap option.
        
       | ipkstef wrote:
       | I think i'm missing something.. why would i pay to ocr the images
       | when i can do it locally for free? Tesseract runs pretty well on
       | just cpu, wouldn't even need something crazy powerful.
        
         | gregolo wrote:
         | And OpenAI uses Tesseract in the background, as it sometimes
         | answers that Hungarian language is not installed for Tesseract
         | for me
        
           | s5ma6n wrote:
           | I would be extremely surprised if that's the case. There are
           | "open-source" multimodal LLMs can extract text from images as
           | a proof that the idea works.
           | 
           | Probably the model is hallucinating and adding "Hungarian
           | language is not installed for Tesseract" to the response.
        
         | daemonologist wrote:
         | Tesseract works great for pure label-the-characters OCR, which
         | is sufficient for books and other sources with straightforward
         | layouts, but doesn't handle weird layouts (tables, columns,
         | tables with columns in each cell, etc.) People will do
         | absolutely depraved stuff with Word and PDF documents and you
         | often need semantic understanding to decipher it.
         | 
         | That said, sometimes no amount of understanding will improve
         | the OCR output because a structure in a document cannot be
         | converted to a one-dimensional string (short of using HTML/CSS
         | or something). Maybe we'll get image -> HTML models eventually.
        
       | jagermo wrote:
       | ohh, that could finally be a great way to get my ttrpg books
       | readable for kindle. I'll give it a try, thanks for that.
        
       | amluto wrote:
       | My intuition is that the best solution here would be a division
       | of labor: have the big multimodal model identify tables,
       | paragraphs, etc, and output a mapping between segments of the
       | document and texture output. Then a much simpler model that
       | doesn't try to hold entire conversations can process those
       | segments into their contents.
       | 
       | This will perform worse in cases where whatever understanding the
       | large model has of the contents is needed to recognize indistinct
       | symbols. But it will avoid cases where that very same
       | understanding causes contents to be understood incorrectly due to
       | the model's assumptions of what the contents should be.
       | 
       | At least in my limited experiments with Claude, it's easy for
       | models to lose track of where they're looking on the page and to
       | omit things entirely. But if segmentation of the page is
       | explicit, one can enforce that all contents end up in exactly one
       | segment.
        
       | jdthedisciple wrote:
       | Very nice, seem to work pretty well!
       | 
       | Just                   maintainFormat: true
       | 
       | did not seem to have any effect in my testing.
        
       | aman2k4 wrote:
       | I am using AWS Textract + LLM (OpenAI/Claude) to read grocery
       | receipts for <https://www.5outapp.com>
       | 
       | So far, I have collected over 500 receipts from around 10
       | countries with 30 different supermarkets in 5 different
       | languages.
       | 
       | What has worked for me so far is having control over OCR and
       | processing (for formatting/structuring) separately. I don't have
       | the figures to provide a cost structure, but I'm looking for
       | other solutions to improve both speed and accuracy. Also, I need
       | to figure out a way to put a metric around accuracy. I will
       | definitely give this a shot. Thanks a lot.
        
         | sleno wrote:
         | Cool design. FYI the "Try now" card looks like it didn't render
         | right, just seeing a blank box around the button.
        
           | aman2k4 wrote:
           | You meant in the web version? it is supposed to look like a
           | blank box in the rectangle grocery bill shape, but i suppose
           | the design can be a bit better there. Thanks for the
           | feedback.
        
       | constantinum wrote:
       | If you want to do document OCR/PDF text extraction with decent
       | accuracy without using an LLM, do give LLMWhisperer[1] a try.
       | 
       | Try with any PDF document in the playground -
       | https://pg.llmwhisperer.unstract.com/
       | 
       | [1] - https://unstract.com/llmwhisperer/
        
       | throwthrowuknow wrote:
       | Have you compared the results to special purpose OCR free models
       | that do image to text with layout? My intuition is mini should be
       | just as good if not better.
        
       | jerrygenser wrote:
       | Azure document AI accuracy I would categorize as high not "mid".
       | Including hand writing. However for the $1.5/1000 pages, it
       | doesn't include layout detection.
       | 
       | The $10/1000 pages model includes layout detection (headers,
       | etc.) as well as key-value pairs and checkbox detection.
       | 
       | I have continued to do proofs of concept with Gemini and GPT, and
       | in general any new multimodal model that comes out but have it is
       | not on par with the checkbox detection of azure.
       | 
       | In fact the results from Gemini/GPT4 aren't even good enough to
       | use as a teacher for distillation of a "small" multimodal model
       | specializing in layout/checkbox.
       | 
       | I would like to also shout out surya OCR which is up and coming.
       | It's source available and free for under a certain funding or
       | revenue milestone - I think $5m. It doesn't have word level
       | detection yet but it's one of the more promising non-hyper
       | scaler/ heavy commercial OCR tools I'm aware of.
        
         | ianhawes wrote:
         | Surya OCR is great in my test use cases! Hoping to try it out
         | in production soon.
        
       | samuell wrote:
       | The problem I've not found one OCR solution to handle well is
       | complex column based layouts in magazines. Perhaps one problem is
       | that there are often images spanning anything from one to all
       | columns, and so the text might flow in sometimes funny ways. But
       | in this day and age, this must be possible to handle for the best
       | AI-based tools?
        
       | ndr_ wrote:
       | Prompts in the background:                 const systemPrompt = `
       | Convert the following PDF page to markdown.          Return only
       | the markdown with no explanation text.          Do not exclude
       | any content from the page.       `;
       | 
       | For each subsequent page: messages.push({ role: "system",
       | content: `Markdown must maintain consistent formatting with the
       | following page: \n\n """${priorPage}"""`, });
       | 
       | Could be handy for general-purpose frontend tools.
        
         | markous wrote:
         | so this is just a wrapper around gpt-4o mini?
        
       | binalpatel wrote:
       | You can do some really cool things now with these models, like
       | ask them to extract not just the text but figures/graphs as
       | nodes/edges and it works very well. Back when GPT-4 with vision
       | came out I tried this with a simple prompt + dumping in a
       | pydantic schema of what I wanted and it was spot on, pretty much
       | this (before json mode was a supported):                   You
       | are an expert in PDFs. You are helping a user extract text from a
       | PDF.              Extract the text from the image as a structured
       | json output.              Extract the data using the following
       | schema:              {Page.model_json_schema()}
       | Example:         {{           "title": "Title",
       | "page_number": 1,           "sections": [             ...
       | ],           "figures": [             ...           ]         }}
       | 
       | https://binal.pub/2023/12/structured-ocr-with-gpt-vision/
        
       ___________________________________________________________________
       (page generated 2024-07-24 23:10 UTC)