[HN Gopher] Ingesting PDFs and why Gemini 2.0 changes everything
       ___________________________________________________________________
        
       Ingesting PDFs and why Gemini 2.0 changes everything
        
       Author : serjester
       Score  : 459 points
       Date   : 2025-02-05 18:05 UTC (4 hours ago)
        
 (HTM) web link (www.sergey.fyi)
 (TXT) w3m dump (www.sergey.fyi)
        
       | cedws wrote:
       | 90% accuracy +/- 10%? What could that be useful for, that's
       | awfully low.
        
         | lvzw wrote:
         | > accuracy is measured with the Needleman-Wunsch algorithm
         | 
         | > Crucially, we've seen very few instances where specific
         | numerical values are actually misread. This suggests that most
         | of Gemini's "errors" are superficial formatting choices rather
         | than substantive inaccuracies. We attach examples of these
         | failure cases below [1].
         | 
         | > Beyond table parsing, Gemini consistently delivers near-
         | perfect accuracy across all other facets of PDF-to-markdown
         | conversion.
         | 
         | That seems fairly useful to me, no? Maybe not for mission
         | critical applications, but for a lot of use cases, this seems
         | to be good enough. I'm excited to try these prompts on my own
         | later.
        
         | schainks wrote:
         | This is "good enough" for Banks to use when doing due
         | diligence. You'd be surprised how much noise is in the system
         | with the current state of the art: algorithms/web scrapers and
         | entire buildings of humans in places like India.
        
           | ai-christianson wrote:
           | It's certainly pretty useful for discovery/information
           | filtering purposes. I.e. searching for signal in the noise if
           | you have a large dataset.
        
           | jjtheblunt wrote:
           | due diligence of this sort?
           | 
           | https://en.wikipedia.org/wiki/Know_your_customer
        
         | MattDaEskimo wrote:
         | Switching from manual data entry to approval
        
         | summerlight wrote:
         | I guess 90% is for "benchmark", which is typically tailored to
         | be challenging to parse.
        
         | serjester wrote:
         | Author here -- measuring accuracy in table parsing is
         | surprisingly challenging. Subtle, almost imperceptible
         | differences in how a table is parsed may not affect the
         | reader's understanding but can significantly impact benchmark
         | performance. For all practical purposes, I'd say it's near
         | perfect (also keep in mind the benchmark is on _very_
         | challenging tables).
        
         | raunakchowdhuri wrote:
         | would encourage you to take a look at some of the real data
         | here! https://huggingface.co/spaces/reducto/rd_table_bench
         | 
         | you'll find that most of the errors here are structural issues
         | with the table or inability to parse some special characters.
         | tables can get crazy!
        
         | mattnewton wrote:
         | having seen some of these tables, I would guess that's probably
         | above a layperson's score . Some are very complicated or just
         | misleadingly structured.
        
       | Havoc wrote:
       | Been toying with the flash model. Not the top model, but think
       | it'll see plenty use due to the details. Wins on things other
       | than top of benchmark logs
       | 
       | * Generous free tier
       | 
       | * Huge context window
       | 
       | * Lite version feels basically instant
       | 
       | However
       | 
       | * Lite model seems more prone to repeating itself / looping
       | 
       | * Very confusing naming e.g. {model}-latest worked for 1.5 but
       | now its {model}-001? The lite has a date appended, the non-lite
       | does not. Then there is exp and thinking exp...which has a date.
       | wut?
        
         | ai-christianson wrote:
         | > * Huge context window
         | 
         | But how well does it actually handle that context window? E.g.
         | a lot of models support 200K context, but the LLM can only
         | really work with ~80K or so of it before it starts to get
         | confused.
        
           | summerlight wrote:
           | My experience is that Gemini works relatively well on larger
           | contexts. Not perfect, but more reliable.
        
           | Havoc wrote:
           | I'm sure someone will do a haystack test, but from my casual
           | testing it seems pretty good
        
           | asadm wrote:
           | it works REALLY well. I have used it to dump many references
           | codes and then help me write a new modules etc. I have gone
           | up to 200k tokens I think with no problems in recall.
        
             | ai-christianson wrote:
             | Awesome. Models that can usefully leverage such large
             | context windows are rare at this point.
             | 
             | Something like this opens up a lot of use cases.
        
           | f38zf5vdt wrote:
           | It works okay out to roughly 20-40k tokens. Once the window
           | gets larger than that, it degrades significantly. You can
           | needle in the haystack out to that distance, but asking it
           | for multiple things from the document leads to hallucinations
           | for me.
           | 
           | Ironic, but GPT4o works better for me at longer contexts
           | <128k than Gemini 2.0 flash. And out to 1m is just hopeless,
           | even though you can do it.
        
           | llm_nerd wrote:
           | There is the needle in the haystack measure which is, as you
           | probably guessed, hiding a small fact in a massive set of
           | tokens and asking it to recall it.
           | 
           | Recent Gemini models actually do extraordinarily well.
           | 
           | https://cloud.google.com/blog/products/ai-machine-
           | learning/t...
        
       | daemonologist wrote:
       | I wonder how this compares to open source models (which might be
       | less accurate but even cheaper if self-hosted?), e.g. Llama 3.2.
       | I'll see if I can run the benchmark.
       | 
       | Also regarding the failure case in the footnote, I think Gemini
       | actually got that right (or at least outperformed Reducto) - the
       | original document seems to have what I call a "3D" table where
       | the third axis is rows _within_ each cell, and having multiple
       | headers is probably the best approximation in Markdown.
        
         | mediaman wrote:
         | Everything I tried previously had very disappointing results. I
         | was trying to get rid of Azure's DocumentIntelligence, which is
         | kind of expensive at scale. The models could often output a
         | portion of a table, but it was nearly impossible to get them to
         | produce a structured output of a large table on a single page;
         | they'd often insert "...rest of table follows" and similar
         | terminations, regardless of different kinds of prompting.
         | 
         | Maybe incremental processing of chunks of the table would have
         | worked, with subsequent stitching, but if Gemini can just
         | process it that would be pretty good.
        
       | fecal_henge wrote:
       | Is there an AI platform where I can paste a snip of a graph and
       | it will generate a n th order polynomial regression for me of the
       | trace?
        
         | CamperBob2 wrote:
         | Either ChatGPT o4 or one of the newer Google models should
         | handle that, since it's a pretty common task. Actually there
         | have been online curve fitters for several years that work
         | pretty well without AI, such as https://curve.fit/ and
         | https://www.standardsapplied.com/nonlinear-curve-fitting-cal...
         | .
         | 
         | I'd probably try those first, since otherwise you're depending
         | on the language model to do the right thing automagically.
        
         | potatoman22 wrote:
         | I've had decent luck using some of the reasoning models for
         | this. It helps if you task them with identifying where the
         | points on the graph are first.
        
       | bt3 wrote:
       | One major takeaway that matches my own investigation is that
       | Gemini 2.0 still materially struggles with bounding boxes on
       | digital content. Google has published[1] some great material on
       | spatial understanding and bounding boxes on photography, but
       | identifying sections of text or digital graphics like icons in a
       | presentation is still very hit and miss.
       | 
       | --
       | 
       | [1]: https://github.com/google-
       | gemini/cookbook/blob/a916686f95f43...
        
         | maeil wrote:
         | Have you seen any models that perform better at this? I last
         | looked into this a year ago but at the time they were indeed
         | quite bad at it across the board.
        
       | scottydelta wrote:
       | This is what I am trying to figure out how to solve.
       | 
       | My problem statement is:
       | 
       | - Injest PDFs, summarize, and extract important information.
       | 
       | - Have some way to overlay the extracted information on the pdf
       | in the UI.
       | 
       | - User can provide feedback on the overlaid info by accepting or
       | rejecting the highlights as useful or not.
       | 
       | - This info goes back in to the model for reinforced learning.
       | 
       | Hoping to find something that can make this more manageable.
        
         | baxtr wrote:
         | Have you tried cursor or replit for this?
        
         | cccybernetic wrote:
         | Most PDF parsers give you coordinate data (bounding boxes) for
         | extracted text. Use these to draw highlights over your PDF
         | viewer - users can then click the highlights to verify if the
         | extraction was correct.
         | 
         | The tricky part is maintaining a mapping between your LLM
         | extractions and these coordinates.
         | 
         | One way to do it would be with two LLM passes:
         | 1. First pass: Extract all important information from the PDF
         | 2. Second pass: "Hey LLM, find where each extraction appears in
         | these bounded text chunks"
         | 
         | Not the cheapest approach since you're hitting the API twice,
         | but it's straightforward!
        
           | Jimmc414 wrote:
           | Here's a PR thats not accepted yet for some reason that seems
           | to be having some success with the bounding boxes
           | 
           | https://github.com/getomni-ai/zerox/pull/44
           | 
           | Related to
           | 
           | https://github.com/getomni-ai/zerox/issues/7
        
       | exabrial wrote:
       | You know what'd be fucking nice? The ability to turn Gemini off.
        
       | fngjdflmdflg wrote:
       | >Unfortunately Gemini really seems to struggle on this, and no
       | matter how we tried prompting it, it would generate wildly
       | inaccurate bounding boxes
       | 
       | This is what I have found as well. From what I've read, LLMS do
       | not work well with images for specific details due to image
       | encoders which are too lossy. (No idea if this is actually
       | correct.) For now I guess you can use regular OCR to get bounding
       | boxes.
        
         | minimaxir wrote:
         | Modern multimodal encoders for LLMs are fine/not lossy since
         | they do not resize to a small size and can handle arbitrary
         | sizes, although some sizes are obviously better represented in
         | the training set. A 8.5" x 11" paper would be common.
         | 
         | I suspect the issue is prompt engineering related.
         | 
         | > Please provide me strict bounding boxes that encompasses the
         | following text in the attached image? I'm trying to draw a
         | rectangle around the text.
         | 
         | > - Use the top-left coordinate system
         | 
         | > - Values should be percentages of the image width and height
         | (0 to 1)
         | 
         | LLMs have enough trouble with integers (since token-wise
         | integers and text representation of integers are the same),
         | high-precision decimals will be even worse. It might be better
         | to reframe the problem as "this input document is 850 px x 1100
         | px, return the bounding boxes as integers" then parse and
         | calculate the decimals later.
        
           | fngjdflmdflg wrote:
           | Just tried this and it did not appear to work for me. Prompt:
           | 
           | >Please provide me strict bounding boxes that encompasses the
           | following text in the attached image? I'm trying to draw a
           | rectangle around the text.
           | 
           | > - Use the top-left coordinate system
           | 
           | >this input document is 1080 x 1236 px. return the bounding
           | boxes as integers
        
             | minimaxir wrote:
             | "Might" being the operative word, particularly with models
             | that have less prompt adherence. There's a few other prompt
             | massaging tricks beyond the scope of a HN comment, the
             | decimal issue is just one optimization.
        
             | BoorishBears wrote:
             | https://github.com/google-
             | gemini/cookbook/blob/a916686f95f43...
             | 
             | They say there's no magic prompt but I'd start with their
             | default since there is usually _some_ format used to
             | improve performance with posttraining with tasks like this
        
       | coderstartup wrote:
       | Following this post
        
       | cubefox wrote:
       | Why is Gemini Flash so much cheaper than other models here?
        
         | mattnewton wrote:
         | probably a mix of economies of scale (google workspace and
         | search are already massive customers of these models meaning
         | the build out is already there), and some efficiency dividends
         | from hardware r&d (google has developed the model and the TPU
         | hardware purpose built to run it almost in parallel)
        
       | lazypenguin wrote:
       | I work in fintech and we replaced an OCR vendor with Gemini at
       | work for ingesting some PDFs. After trial and error with
       | different models Gemini won because it was so darn easy to use
       | and it worked with minimal effort. I think one shouldn't
       | underestimate that multi-modal, large context window model in
       | terms of ease-of-use. Ironically this vendor is the best known
       | and most successful vendor for OCR'ing this specific type of PDF
       | but many of our requests failed over to their human-in-the-loop
       | process. Despite it not being their specialization switching to
       | Gemini was a no-brainer after our testing. Processing time went
       | from something like 12 minutes on average to 6s on average,
       | accuracy was like 96% of that of the vendor and price was
       | significantly cheaper. For the 4% inaccuracies a lot of them are
       | things like the text "LLC" handwritten would get OCR'd as "IIC"
       | which I would say is somewhat "fair". We probably could improve
       | our prompt to clean up this data even further. Our prompt is
       | currently very simple: "OCR this PDF into this format as
       | specified by this json schema" and didn't require some fancy
       | "prompt engineering" to contort out a result.
       | 
       | Gemini developer experience was stupidly easy. Easy to add a file
       | "part" to a prompt. Easy to focus on the main problem with
       | weirdly high context window. Multi-modal so it handles a lot of
       | issues for you (PDF image vs. PDF with data), etc. I can
       | recommend it for the use case presented in this blog (ignoring
       | the bounding boxes part)!
        
         | cess11 wrote:
         | What hardware are you using to run it?
        
           | kccqzy wrote:
           | The Gemini model isn't open so it does not matter what
           | hardware you have. You might have confused Gemini with Gemma.
        
             | cess11 wrote:
             | OK, I see, pity. I'm interested in similar applications but
             | in contexts where the material is proprietary and might
             | contain PII.
        
         | panarky wrote:
         | This is a big aha moment for me.
         | 
         | If Gemini can do semantic chunking at the same time as
         | extraction, all for so cheap and with nearly perfect accuracy,
         | and without brittle prompting incantation magic, this is huge.
        
           | potatoman22 wrote:
           | Small point but is it doing semantic chunking, or loading the
           | entire pdf into context? I've heard mixed results on semantic
           | chunking.
        
             | panarky wrote:
             | It loads the entire PDF into context, but then it would be
             | my job to chunk the output for RAG, and just doing
             | arbitrary fixed-size blocks, or breaking on sentences or
             | paragraphs is not ideal.
             | 
             | So I can ask Gemini to return chunks of variable size,
             | where each chunk is a one complete idea or concept, without
             | arbitrarily chopping a logical semantic segment into
             | multiple chunks.
        
               | thelittleone wrote:
               | Fixed size chunks is holding back a bunch of RAG projects
               | on my backlog. Will be extremely pleased if this semantic
               | chunking solves the issue. Currently we're getting around
               | an 78-82% success on fixed size chunked RAG which is far
               | too low. Users assume zero results on a RAG search
               | equates to zero results in the source data.
        
               | refulgentis wrote:
               | FWIW, you might be doing it / ruled it out already:
               | 
               | - BM25 to eliminate the 0 results in source data problem
               | 
               | - Longer term, a peek at Gwern's recent hierarchical
               | embedding article. Got decent early returns even with
               | fixed size chunks
        
               | thelittleone wrote:
               | Much appreciated.
               | 
               | For others interested in BM25 for the use case above, I
               | found this thread informative.
               | 
               | https://news.ycombinator.com/item?id=41034297
        
               | mediaman wrote:
               | Agree, BM25 honestly does an amazing job on its own
               | sometimes, especially if content is technical.
               | 
               | We use it in combination with semantic but sometimes turn
               | off the semantic part to see what happens and are
               | surprised with the robustness of the results.
               | 
               | This would work less well for cross-language or less
               | technical content, however. It's great for acronyms,
               | company or industry specific terms, project names,
               | people, technical phrases, and so on.
        
               | Tostino wrote:
               | I wish we had a local model for semantic chunking. I've
               | been wanting one for ages, but haven't had the time to
               | make a dataset and finetune that task =/.
        
           | fallinditch wrote:
           | If I used Gemini 2.0 for extraction and chunking to feed into
           | a RAG that I maintain on my local network, then what sort of
           | locally-hosted LLM would I need to gain meaningful insights
           | from my knowledge base? Would a 13B parameter model be
           | sufficient?
        
             | jhoechtl wrote:
             | Ypur lovalodel has littleore to do but stitch the already
             | meaningzl pieces together.
             | 
             | The pre-step, chunking and semantic understanding is all
             | that counts.
        
         | faxmeyourcode wrote:
         | I've been fighting trying to chunk SEC filings properly,
         | specifically surrounding the strange and inconsistent tabular
         | formats present in company filings.
         | 
         | This is giving me hope that it's possible.
        
           | otoburb wrote:
           | >> _I 've been fighting trying to chunk SEC filings properly,
           | specifically surrounding the strange and inconsistent tabular
           | formats present in company filings._
           | 
           | For this specific use case you can also try edgartools[1]
           | which is a library that was relatively recently released that
           | ingests SEC submissions and filings. They don't use OCR but
           | (from what I can tell) directly parse the XBRL documents
           | submitted by companies and stored in EDGAR, if they exist.
           | 
           | [1] https://github.com/dgunning/edgartools
        
           | barrenko wrote:
           | If you'd kindly tl;dr the chunking strategies you have tried
           | and what works best, I'd love to hear.
        
           | anirudhb99 wrote:
           | (from the gemini team) we're working on it! semantic chunking
           | & extraction will definitely be possible in the coming
           | months.
        
           | jgalt212 wrote:
           | isn't everyone on iXBRL now? Or are you struggling with
           | historical filings?
        
             | faxmeyourcode wrote:
             | XBRL is what I'm using currently, but it's still kind of a
             | mess (maybe I'm just bad at it) for some of the non-
             | standard information that isn't properly tagged.
        
         | yzydserd wrote:
         | How do today's LLM's like Gemini compare with the Document
         | Understanding services google/aws/azure have offered for a few
         | years, particularly when dealing with known forms? I think
         | Google's is Document AI.
        
           | zacmps wrote:
           | I've found the highest accuracy solution is to OCR with one
           | of the dedicated models then feed that text and the original
           | image into an LLM with a prompt like:
           | 
           | "Correct errors in this OCR transcription".
        
             | bradfox2 wrote:
             | This is what we do today. Have you tried it against Gemini
             | 2.0?
        
             | therein wrote:
             | How does it behave if the body of text is offensive or what
             | if it is talking about a recipe to purify UF-6 gas at home?
             | Will it stop doing what it is doing and enter lecturing
             | mode?
             | 
             | I am asking not to be cynical but because of my limited
             | experience with using LLMs for any task that may operate on
             | offensive or unknown input seems to get triggered by all
             | sorts of unpredictable moral judgements and dragged into
             | generating not the output I wanted, at all.
             | 
             | If I am asking this black box to give me a JSON output
             | containing keywords for a certain text, if it happens to be
             | offensive, it refuses to do that.
             | 
             | How does one tackle with that?
        
               | zacmps wrote:
               | It's not something I've needed to deal with personally.
               | 
               | We have run into added content filters in Azure OpenAI on
               | a different application, but we just put in a request to
               | tune them down for us.
        
               | xnx wrote:
               | There are many settings for changing the safety level in
               | Gemini API calls: https://ai.google.dev/gemini-
               | api/docs/safety-settings
        
               | sumedh wrote:
               | Try setting the safety params to none and see if that
               | makes any difference.
        
           | ajcp wrote:
           | GCP's Document AI service is now literally just a UI layer
           | specific to document parsing use-cases back by Gemini models.
           | When we realized that we dumped it and just use Gemini
           | directly.
        
         | depr wrote:
         | So are you mostly processing PDFs with data? Or PDFs with just
         | text, or images, graphs?
        
           | thelittleone wrote:
           | Not the parent, but we process PDFs with text, tables,
           | diagrams. Works well if the schema is properly defined.
        
         | sensecall wrote:
         | Out of interest, did you parse into any sort of defined
         | schema/structure?
        
           | gnat wrote:
           | Parent literally said so ...
           | 
           | > Our prompt is currently very simple: "OCR this PDF into
           | this format as specified by this json schema" and didn't
           | require some fancy "prompt engineering" to contort out a
           | result.
        
         | bionhoward wrote:
         | The Gemini api has a customer noncompete, so it's not an option
         | for AI, what are you working on that doesn't compete with AI?
        
           | B-Con wrote:
           | You do realize most people aren't working on AI, right?
           | 
           | Also, OP mentioned fintech at the outset.
        
           | novaleaf wrote:
           | what doesn't compete with ai?
        
         | xnx wrote:
         | Your OCR vendor would be smart to replace their own system with
         | Gemini.
        
           | _the_inflator wrote:
           | With "Next generation, extremely sophisticated AI" to be
           | precise, I wait say. ;)
           | 
           | Marketing joke aside, maybe a hybrid approach could serve the
           | vendor well. Best of both worlds if it reaps benefits or even
           | have a look at hugging face for even more specialized aka
           | better LLMs.
        
         | itissid wrote:
         | Wait isn't there atleast a two step process here one is
         | semantic segmentation followed by a method like texttract for
         | text - to avoid hallucinations?
         | 
         | One cannot possibly say that "Text extracted by a multimodal
         | model cannot hallucinate"?
         | 
         | > accuracy was like 96% of that of the vendor and price was
         | significantly cheaper.
         | 
         | I would like to know how this 96% was tested. If you use a
         | human to do random sample based testing, well how do you adjust
         | the random sample for variations in distribution of errors that
         | vary like a small set of documents could have 90% of the errors
         | and yet they are only 1% of the docs?
        
           | itissid wrote:
           | For an OCR company I imagine it is unconscionable to do this
           | because if you would say OCR for an Oral History project for
           | a library and you made hallucination errors, well you've
           | replaced facts with fiction. Rewriting history? What the
           | actual F.
        
           | themanmaran wrote:
           | One thing people always forget about traditional OCR
           | providers (azure, tesseract, aws textract, etc.) is that
           | they're ~85% accurate.
           | 
           | They are all probabilistic. You literally get back characters
           | + confidence intervals. So when textract gives you back
           | incorrect characters, is that a hallucination?
        
             | somebehemoth wrote:
             | I know nothing about OCR providers. It seems like OCR
             | failure would result in gibberish or awkward wording that
             | might be easy to spot. Doesn't the LLM failure mode assert
             | made up truths eloquently that are more difficult to spot?
        
             | anon373839 wrote:
             | It's a question of scale. When a traditional OCR system
             | makes an error, it's confined to a relatively small part of
             | the overall text. (Think of "Plastics" becoming
             | "PIastics".) When a LLM hallucinates, there is no limit to
             | how much text can be made up. Entire sentences can be
             | rewritten because the model thinks they're more plausible
             | than the sentences that were actually printed. And because
             | the bias is always toward plausibility, it's an especially
             | insidious problem.
        
             | kapitalx wrote:
             | I'm the founder of https://doctly.ai, also pdf extraction.
             | 
             | The hallucination in LLM extraction is much more subtle as
             | it will rewrite full sentences sometimes. It is much harder
             | to spot when reading the document and sounds very
             | plausible.
             | 
             | We're currently working on a version where we send the
             | document to two different LLMs, and use a 3rd to increase
             | confidence. That way you have the option of trading compute
             | and cost for accuracy.
        
             | Scoundreller wrote:
             | > You literally get back characters + confidence intervals.
             | 
             | Oh god, I wish speech to text engines would colour code the
             | whole thing like a heat map to focus your attention to
             | review where it may have over-enthusiastically guessed at
             | what was said.
             | 
             | You no knot.
        
           | basch wrote:
           | Wouldn't the temperature on something like OCR be very low.
           | You want the same result every time. Isn't some part of
           | hallucination the randomness of temperature?
        
       | ein0p wrote:
       | > Why Gemini 2.0 Changes Everything
       | 
       | Clickbait. It doesn't change "everything". It makes ingestion for
       | RAG much less expensive (and therefore feasible in a lot more
       | scenarios), at the expense of ~7% reduction in accuracy. Accuracy
       | is already rather poor even before this, however, with the top
       | alternative clocking in at 0.9. Gemini 2.0 is 0.84, although the
       | author seems to suggest that the failure modes are mostly around
       | formatting rather than e.g. mis-recognition or hallucinations.
       | 
       | TL;DR: is this exciting? If you do RAG, yes. Does it "change
       | everything" nope. There's still a very long way to go. Protip for
       | model designers: accuracy is always in greater demand than
       | performance. A slow model that solves the problem is invariably
       | better than a fast one that fucks everything up.
        
         | rvz wrote:
         | In this use-case, accuracy is non-negotiable with zero room for
         | any hallucination.
         | 
         | Overall it changes nothing.
        
       | ChrisArchitect wrote:
       | Related:
       | 
       |  _Gemini 2.0 is now available to everyone_
       | 
       | https://news.ycombinator.com/item?id=42950454
        
       | cccybernetic wrote:
       | Shameless plug: I'm working on a startup in this space.
       | 
       | But the bounding box problem hits close to home. We've found
       | Unstructured's API gives pretty accurate box coordinates, and
       | with some tweaks you can make them even better. The tricky part
       | is implementing those tweaks without burning a hole in your
       | wallet.
        
       | sho_hn wrote:
       | Remember all the hyperbole a year ago on how Google was failing
       | and over?
        
         | latexr wrote:
         | Anyone who cries "<service> is dead" after some new technology
         | is introduced is someone you can safely ignore. For ever.
         | They're hyperbolic clout chasers who will only ever be right by
         | mistake.
         | 
         | As if, when ChatGPT was introduced, Google would just stay
         | still, cross their arms, and say "well, this is based on our
         | research paper but there's nothing we can do, going to just
         | roll over and wait for billions of dollars to run out, we're
         | truly doomed". So unbelievably stupid.
        
       | pockmarked19 wrote:
       | Now, I could look at this relatively popular post about Google
       | and revise my opinion of HN as an echo chamber, but I'm afraid
       | it's just that the downvote loving HNers weren't able to make the
       | cognitive leap from Gemini to Google.
        
       | resource_waste wrote:
       | Google's models have historically been total disappointments
       | compared to chatGPT4. Worse quality, wont answer medical
       | questions either.
       | 
       | I suppose I'll try it again, for the 4th or 5th time.
       | 
       | This time I'm not excited. I'm expecting it to be a letdown.
        
       | beklein wrote:
       | Great article, I couldn't find any details about the prompt...
       | only the snippets of the `CHUNKING_PROMPT` and the
       | `GET_NODE_BOUNDING_BOXES_PROMPT`.
       | 
       | Is there is any code example with a full prompt available from
       | OP, or are there any references (such as similar GitHub repos)
       | for those looking to get started within this topic?
       | 
       | Your insights would be highly appreciated.
        
       | nickandbro wrote:
       | I think very soon a new model will destroy whatever startups and
       | services are built around document ingestion. As in a model that
       | can take in a pdf page as a image and transcribe it to text with
       | near perfect accuracy.
        
         | depr wrote:
         | I think the Azure Document Intelligence, Google Document AI and
         | Amazon Textract are among the best if not the best services
         | though and they offer these models.
        
         | layer8 wrote:
         | Extracting plain text isn't that much of a problem, relatively
         | speaking. It's interpreting more complex elements like nested
         | lists, tables, side bars, footnotes/endnotes, cross-references,
         | images and diagrams where things get challenging.
        
       | matthest wrote:
       | This is completely tangential, but does anyone know if AI is
       | creating any new jobs?
       | 
       | Thinking of the OCR vendors who get replaced. Where might they
       | go?
       | 
       | One thing I can think of is that AI could help the space industry
       | take off. But wondering if there are any concrete examples of new
       | jobs being created.
        
       | rjurney wrote:
       | I've been using NotebookLM powered by Gemini 2.0 for three
       | projects and it is _really powerful_ for comprehending large
       | corpuses you can't possibly read and thinking informed by all
       | your sources. It has solid Q&A. When you ask a question or get a
       | summary you like [which often happens] you can save it as a new
       | note, putting it into the corpus for analysis. In this way your
       | conclusions snowball. Yes, this experience actually happens and
       | it is beautiful.
       | 
       | I've tried Adobe Acrobat AI for this and it doesn't work yet.
       | NotebookLM is it. The grounding is the reason it works - you can
       | easily click on anything and it will take you to the source to
       | verify it. My only gripe is that the visual display of the source
       | material is _dogshit ugly_, like exceptionally so. Big blog pink
       | background letters in lines of 24 characters! :) It has trouble
       | displaying PDF columns, but at least it parses them. The ugly
       | will change I'm sure :)
       | 
       | My projects are setup to let me bridge the gaps between the
       | various sources and synthesize something more. It helps to have a
       | goal and organize your sources around that. If you aren't
       | focused, it gets confused. You lay the groundwork in sources and
       | it helps you reason. It works so well I feel _tender_ towards it
       | :) Survey papers provide background then you add specific sources
       | in your area of focus. You can write a profile for how you would
       | like NotebookLM to think - which REALLY helps out.
       | 
       | They are:
       | 
       | * The Stratigrapher - A Lovecraftian short story about the
       | world's first city. All of Seton Lloyd/Faud Safar's work on
       | Eridu. Various sources on Sumerian culture and religion All of
       | Lovecraft's work and letters. Various sources about opium Some
       | articles about nonlinear geometries
       | 
       | * FPGA Accelerated Graph Analytics An introduction to Verilog
       | Papers on FPGAs and graph analytics Papers on Apache Spark
       | architecture Papers on GraphFrames and a related rant I created
       | about it and graph DBs A source on Spark-RAPIDS Papers on
       | subgraph matching, graphlets, network motifs Papers on random
       | graph models
       | 
       | * Graph machine learning notebook without a specific goal, which
       | has been less successful. It helps to have a goal for the
       | project. It got confused by how broad my sources were.
       | 
       | I would LOVE to share my projects with you all, but you can only
       | share within a Google Workspaces domain. It will be AWESOME when
       | they open this thing up :)
        
       | ratedgene wrote:
       | Is this something we can run locally? if so what's the license?
        
         | xnx wrote:
         | Gemini are Google cloud/service models. Gemma are the Google
         | local models.
        
       | __jl__ wrote:
       | The numbers in the blog post seem VERY inaccurate.
       | 
       | Quick calculation: Input pricing: Image input in 2.0 Flash is
       | $0.0001935. Let's ignore the prompt. Output pricing: Let's assume
       | 500 token per page, which is $0.0003
       | 
       | Cost per page: $0.0004935
       | 
       | That means 2,026 pages per dollar. Not 6,000!
       | 
       | Might still be cheaper than many solutions but I don't see where
       | these numbers are coming from.
       | 
       | By the way, image input is much more expensive in Gemini 2.0 even
       | for 2.0 Flash Lite.
       | 
       | Edit: The post says batch pricing, which would be 4k pages based
       | on my calculation. Using batch pricing is pretty different
       | though. Great if feasible but not practical in many contexts.
        
         | serjester wrote:
         | Correct, it's with batching Vertex pricing with slightly lower
         | output tokens per page since a lot of pages are somewhat empty
         | in real world docs - I wanted a fair comparison to providers
         | that charge per page.
         | 
         | Regardless of what assumptions you use - it's still an order of
         | magnitude + improvement over anything else.
        
       | nothrowaways wrote:
       | Cool
        
       | anirudhb99 wrote:
       | thanks a ton for all the amazing feedback on this thread! if
       | 
       | (a) you have document understanding use cases that you'd like to
       | use gemini for (the more aspirational the better) and/or
       | 
       | (b) there are loss cases for which gemini doesn't work well
       | today,
       | 
       | please feel free to email anirudhbaddepu@google.com and we'd love
       | to help get your use case working & improve quality for our next
       | series of model updates!
        
       | raunakchowdhuri wrote:
       | CTO of Reducto here. Love this writeup!
       | 
       | We've generally found that Gemini 2.0 is a great model and have
       | tested this (and nearly every VLM) very extensively.
       | 
       | A big part of our research focus is incorporating the best of
       | what new VLMs offer without losing the benefits and reliability
       | of traditional CV models. A simple example of this is we've found
       | bounding box based attribution to be a non-negotiable for many of
       | our current customers. Citing the specific region in a document
       | where an answer came from becomes (in our opinion) even MORE
       | important when using large vision models in the loop, as there is
       | a continued risk of hallucination.
       | 
       | Whether that matters in your product is ultimately use case
       | dependent, but the more important challenge for us has been
       | reliability in outputs. RD-TableBench currently uses a single
       | table image on a page, but when testing with real world dense
       | pages we find that VLMs deviate more. Sometimes that involves
       | minor edits (summarizing a sentence but preserving meaning), but
       | sometimes it's a more serious case such as hallucinating large
       | sets of content.
       | 
       | The more extreme case is that internally we fine tuned a version
       | of Gemini 1.5 along with base Gemini 2.0, specifically for
       | checkbox extraction. We found that even with a broad distribution
       | of checkbox data we couldn't prevent frequent checkbox
       | hallucination on both the flash (+17% error rate) and pro model
       | (+8% error rate). Our customers in industries like healthcare
       | expect us to get it right, out of the box, deterministically, and
       | our team's directive is to get as close as we can to that ideal
       | state.
       | 
       | We think that the ideal state involves a combination of the two.
       | The flexibility that VLMs provide, for example with cases like
       | handwriting, is what I think will make it possible to go from 80
       | or 90 percent accuracy to some number very close 99%. I should
       | note that the Reducto performance for table extraction is with
       | our pre-VLM table parsing pipeline, and we'll have more to share
       | in terms of updates there soon. For now, our focus is entirely on
       | the performance frontier (though we do scale costs down with
       | volume). In the longer term as inference becomes more efficient
       | we want to move the needle on cost as well.
       | 
       | Overall though, I'm very excited about the progress here.
       | 
       | --- One small comment on your footnote, the evaluation script
       | with Needlemen-Wunsch algorithm doesn't actually consider the
       | headers outputted by the models and looks only at the table
       | structure itself.
        
         | noja wrote:
         | > deterministically
         | 
         | How are you planning to do this?
        
       | jbarrow wrote:
       | > Unfortunately Gemini really seems to struggle on this, and no
       | matter how we tried prompting it, it would generate wildly
       | inaccurate bounding boxes
       | 
       | Qwen2.5 VL was trained on a special HTML format for doing OCR
       | with bounding boxes. [1] The resulting boxes aren't quite as
       | accurate as something like Textract/Surya, but I've found they're
       | much more accurate than Gemini or any other LLM.
       | 
       | [1] https://qwenlm.github.io/blog/qwen2.5-vl/
        
       | xena wrote:
       | I really wish that Google made an endpoint that's compatible with
       | the OpenAI API. That'd make trying Gemini in existing flows so
       | much easier.
        
         | kurtoid wrote:
         | Is that not this? https://ai.google.dev/api/compatibility
        
         | myko wrote:
         | I believe this is already the case, at least the Python
         | libraries are compatible, if not recommended for more than just
         | trying things out:
         | 
         | https://ai.google.dev/gemini-api/docs/openai
        
           | msp26 wrote:
           | How well do they work when you want to do things like
           | grounding with search?
        
       | devmor wrote:
       | I think this is one of the few functional applications of LLMs
       | that is really undeniably useful.
       | 
       | OCR has always been "untrustworthy" (as in you cannot expect it
       | to be 100% correct and know you must account for that) and we
       | have long used ML algorithms for the process.
        
       | bambax wrote:
       | I'm building a system that does regular OCR and outputs layout-
       | following ASCII; in my admittedly limited tests it works better
       | than most existing offerings.
       | 
       | It will be ready for beta testing this week or the next, and I
       | will be looking for beta testers; if interested please contact
       | me!
        
       | siquick wrote:
       | Strange that LlamaParse is mentioned in the pricing table but not
       | the results. We've used them to process a lot of pages and it's
       | been excellent each time.
        
       | zoogeny wrote:
       | Orthogonal to this post, but this just highlights the need for a
       | more machine readable PDF alternative.
       | 
       | I get the inertia of the whole world being on PDF. And perhaps we
       | can just eat the cost and let LLMs suffer the burden going
       | forwards. But why not use that LLM coding brain power to create a
       | better overall format?
       | 
       | I mean, do we really see printing things out onto paper something
       | we need to worry about for the next 100 years? It reminds me of
       | the TTY interface at the heart of Linux. There was a time it all
       | made sense, but can we just deprecate it all now?
        
         | layer8 wrote:
         | PDF does support incorporating information about the logical
         | document structure, aka Tagged PDF. It's optional, but
         | recommended for accessibility (e.g. PDF/UA). See chapters
         | 14.7-14.8 in [1]. Processing PDF files as rendered images, as
         | suggested elsewhere in this thread, can actually dramatically
         | lose information present in the PDF.
         | 
         | Alternatively, XML document formats and the like do exist.
         | Indeed, HTML was supposed to be a document format. That's not
         | the problem. The problem is having people and systems actually
         | author documents in that way in an unambiguous fashion, and
         | having a uniform visual presentation for it that would be
         | durable in the long term (decades at least).
         | 
         | PDF as a format persists because it supports virtually every
         | feature under the sun (if authors care to use them), while
         | largely guaranteeing a precisely defined visual presentation,
         | and being one of the most stable formats.
         | 
         | [1] https://opensource.adobe.com/dc-acrobat-sdk-
         | docs/pdfstandard...
        
           | zoogeny wrote:
           | I'm not suggesting we re-invent RDF or any other kind of
           | semantic web idea. And the fact that semantic data can be
           | stored in a PDF isn't really the problem being solved by
           | tools such as these. In many cases, PDF is used for things
           | like scanned documents where adding that kind of metadata
           | can't really be done manually - in fact the kinds of tools
           | suggested in the post would be useful for adding that
           | metadata to the PDF after scanning (for example).
           | 
           | Imagine you went to a government office looking for some
           | document from 1930s, like an ancestors marriage or death
           | certificate. You might want to digitize a facsimile of that
           | using a camera or a scanner. You have a lot of options to
           | store that, JPG, PNG, PDF. You have even more options to
           | store the metadata (XML, RDF, TXT, SQLite, etc.). You could
           | even get fancy and zip up an HTML doc alongside a directory
           | of images/resources that stitched them all together. But
           | there isn't really a good standard format to do that.
           | 
           | It is the second part of you post that stands out - the
           | kitchen sink nature of PDFs that make them so terrible. If
           | they were just wrappers for image data, formatted in a way
           | that made printing them easy, I probably wouldn't dislike
           | them.
        
       | jibuai wrote:
       | I've been working on something similar the past couple months. A
       | few thoughts:
       | 
       | - A lot of natural chunk boundaries span multiple pages, so you
       | need some 'sliding window' mechanism for the best accuracy.
       | 
       | - Passing the entire document hurts throughput too much due to
       | the quadratic complexity of attention. Outputs are also much
       | worse when you use too much context.
       | 
       | - Bounding boxes can be solved by first generating boxes using
       | tradition OCR / layout recognition, then passing that data to the
       | LLM. The LLM can then link it's outputs to the boxes.
       | Unfortunately getting this reliable required a custom sampler so
       | proprietary models like Gemini are out of the question.
        
       | dasl wrote:
       | How does the Gemini OCR perform against non-English language
       | text?
        
       | an_aparallel wrote:
       | Has anyone in the AEC industry who's reading this worked out a
       | good way to get Bluebeam MEP, electrical layouts into Revit (LOD
       | 200-300).
       | 
       | Have seen MarkupX as a paid option, but it seems some AI in the
       | loop can greatly speed up exception handling, encode family
       | placement to certain elevations based on building code docs....
        
       | sensecall wrote:
       | This is super interesting.
       | 
       | Would this be suitable for ingesting and parsing wildly variable
       | unstructured data into a structured schema?
        
       | kbyatnal wrote:
       | It's clear that OCR & document parsing are going to be swallowed
       | up by these multimodal models. The best representation of a
       | document at the end of the day is an image.
       | 
       | I founded a doc processing company [1] and in our experience, a
       | lot of the difficulty w/ deploying document processing into
       | production is when accuracy requirements are high (> 97%). This
       | is because OCR and parsing is only one part of the problem, and
       | real world use cases need to bridge the gap between raw outputs
       | and production-ready data.
       | 
       | This requires things like:
       | 
       | - state-of-the-art parsing powered by VLMs and OCR
       | 
       | - multi-step extraction powered by semantic chunking, bounding
       | boxes, and citations
       | 
       | - processing modes for document parsing, classification,
       | extraction, and splitting (e.g. long documents, or multi-document
       | packages)
       | 
       | - tooling that lets nontechnical members quickly iterate, review
       | results, and improve accuracy
       | 
       | - evaluation and benchmarking tools
       | 
       | - fine-tuning pipelines that turn reviewed corrections --> custom
       | models
       | 
       | Very excited to get test and benchmark Gemini 2.0 in our product,
       | very excited about the progress here.
       | 
       | [1] https://extend.app/
        
         | anon373839 wrote:
         | > It's clear that OCR & document parsing are going to be
         | swallowed up by these multimodal models.
         | 
         | I don't think this is clear at all. A multimodal LLM can and
         | will hallucinate data at arbitrary scale (phrases, sentences,
         | etc.). Since OCR is the part of the system that extracts the
         | "ground truth" out of your source documents, this is an
         | unacceptable risk IMO.
        
       | ThinkBeat wrote:
       | Hmm I have been doing a but if this manually lately for a
       | personal project. I am working on some old books that are far
       | past any copyright, but they are not available anywhere on the
       | net. (Being in Norwegian m makes a book a lot more obscure) so I
       | have been working on creating ebooks out of them.
       | 
       | I have a scanner, and some OCR processes I run things through. I
       | am close to 85% from my automatic process.
       | 
       | The pain of going from 85% to 99% though is considerable. (and in
       | my case manual) (well Perl helps)
       | 
       | I went to try this AI on one of the short poem manufscript I
       | have.
       | 
       | I told the prompt I wanted PDF to Markdown, it says sure go ahead
       | give me the pdf. I went upload it. It spent a long time spinning.
       | then a quick messages comes up, something like
       | 
       | "Failed to count tokens"
       | 
       | but it just flashes and goes away.
       | 
       | I guess the PDF is too big? Weird though, its not a lot of pages.
        
         | sumedh wrote:
         | Take a screenshot of the pdf page and give that to the LLM and
         | see if it can be processed.
         | 
         | Your PDF might have some quirks inside which the LLM cannot
         | process.
        
       | rudolph9 wrote:
       | We parse millions of PDFs using Apache Tika and process about
       | 30,000 per dollar of compute cost. However, the structured output
       | leaves something to be desired, and there are a significant
       | number of pages that Tika is unable to parse.
       | 
       | https://tika.apache.org/
        
         | rudolph9 wrote:
         | Under the hood tika uses tesseract for ocr parsing. For clarity
         | this all works surprisingly well generally speaking and it's
         | pretty easy to run your self and order of magnitude cheaper
         | than most services out there.
         | 
         | https://tesseract-ocr.github.io/tessdoc/
        
       | nottorp wrote:
       | Will 2.0.1 also change everything?
       | 
       | How about 2.0.2?
       | 
       | How about Llama 13.4.0.1?
       | 
       | This is tiring. It's always the end of the world when they
       | release a new version of some LLM.
        
       | llm_trw wrote:
       | This is using exactly the wrong tools at every stage of the OCR
       | pipeline, and the cost is astronomical as a result.
       | 
       | You don't use multimodal models to extract a wall of text from an
       | image. They hallucinate constantly the second you get past
       | perfect 100% high-fidelity images.
       | 
       | You use an object detection model trained on documents to find
       | the bounding boxes of each document section as _images_; each
       | bounding box comes with a confidence score for free.
       | 
       | You then feed each box of text to a regular OCR model, also gives
       | you a confidence score along with each prediction it makes.
       | 
       | You feed each image box into a multimodal model to describe what
       | the image is about.
       | 
       | For tables, use a specialist model that does nothing but extract
       | tables--models like GridFormer that aren't hyped to hell and
       | back.
       | 
       | You then stitch everything together in an XML file because
       | Markdown is for human consumption.
       | 
       | You now have everything extracted with flat XML markup for each
       | category the object detection model knows about, along with
       | multiple types of probability metadata for each bounding box,
       | each letter, and each table cell.
       | 
       | You can now start feeding this data programmatically into an LLM
       | to do _text_ processing, where you use the XML to control what
       | parts of the document you send to the LLM.
       | 
       | You then get chunking with location data and confidence scores of
       | every part of the document to put as meta data into the RAG
       | store.
       | 
       | I've build a system that read 500k pages _per day_ using the
       | above completely locally on a machine that cost $20k.
        
         | ck_one wrote:
         | What object detection model do you use?
        
           | a3w wrote:
           | Is tesseract even ML based? Oh, this piece of software is
           | more than 19 years old, perhaps there are other ways to do
           | good, cheap OCR now. Does Gemini have an OCR library,
           | internally? For other LLMs, I had the feeling that the LLM
           | scripts a few lines of python to do the actual heavy lifting
           | with a common OCR framework.
        
           | llm_trw wrote:
           | Custom trained yolo v8. I've moved on since then and the work
           | was done in 2023. You'd get better results for much less
           | today.
        
         | ajcp wrote:
         | Not sure what service you're basing your calculation on but
         | with Gemmini I've processed 10,000,000+ shipping documents (PDF
         | and PNGs) of every concievable layout in one month at under
         | $1000 and an accuracy rate of between 80-82% (humans were at
         | 66%).
         | 
         | The longest part of the development timeline was establishing
         | the accuracy rate and the ingestion pipeline, which itself is
         | massively less complex than what your workflow sounds like: PDF
         | -> Storage Bucket -> Gemini -> JSON response -> Database
         | 
         | Just to get sick with it we actually added some recusion to the
         | Gemini step to have it rate how well it extracted, and if it
         | was below a certain rating to rewrite its own instructions on
         | how to extract the information and then feed it back into
         | itself. We didn't see any improvement in accuracy, but it was
         | still fun to do.
        
           | cpursley wrote:
           | Very cool! How are you storing it to a database - vectors?
           | What do you do with the extracted data (in terms of being
           | able to pull it up via some query system)?
        
             | ajcp wrote:
             | In this use-case the customer just wanted data not
             | currently in the warehouse inventory management system
             | capatured, so here we converted a JSON response to a
             | classic table row schema (where 1 row = 1 document) and now
             | boom, shipping data!
             | 
             | However we do very much recommend storing the raw model
             | responses for audit and then at least as vector embeddings
             | to orient the expectation that the data will need to be
             | utilized for vector search and RAG. Kind of like "while
             | we're here why don't we do what you're going to want to do
             | at some point, even if it's not your use-case now..."
        
         | polote wrote:
         | Do you know another model than gridformer to detect table that
         | has an available implementation somewhere ?
        
         | dr_kiszonka wrote:
         | Impressive. Can you share anything more about this project?
         | 500k pages a day is massive and I can imagine why one would
         | require that much throughput.
        
       | oedemis wrote:
       | there is also https://ds4sd.github.io/docling/ from ibm research
       | which is mit license and track bounding boxes as rich json format
        
         | pbronez wrote:
         | Docling has worked well for me. It handles scenarios that
         | crashed ChatGPT Pro. Only problem is it's super annoying to
         | install. When I have a minute I might package it for homebrew.
        
       | sergiotapia wrote:
       | The article mentions OCR, but you're sending a PDF how is that
       | OCR? Or is this is mistake? What if you send photos of the pages,
       | that would be true OCR - does the performance and price remain
       | the same?
       | 
       | If so this unlocks a massive workflow for us.
        
       | xnx wrote:
       | Glad Gemini is getting some attention. Using it is like a
       | superpower. There are so many discussions about ChatGTP, Claude,
       | DeepSeek, Llama, etc. that don't even mention Gemini.
        
         | throwaway314155 wrote:
         | Google had a pretty rough start compared to ChatGPT, Claude. I
         | suspect that left a bad taste in many people's mouths. In
         | particular because evaluating so many LLM's is a lot of effort
         | on its own.
         | 
         | Llama and DeepSeek are no-brainers; the weights are public.
        
           | beastman82 wrote:
           | No brainer if you're sitting on a >$100k inference server.
        
             | throwaway314155 wrote:
             | Sure, that's fair. If you're aiming for state of the art
             | performance. Otherwise, you can get close and do it on
             | reasonably priced hardware by using smaller distilled
             | and/or quantized variants of llama/r1.
             | 
             | Really though I just meant "it's a no-brainer that they are
             | popular here on HN".
        
         | sumedh wrote:
         | Google was not serious about LLMs, they could not even figure
         | what to call it. There is always a risk that they will get
         | bored and just kill the whole thing.
        
         | Workaccount2 wrote:
         | Before 2.0 models their offerings were pretty underwhelming,
         | but now they can certainly hold their own. I think Gemini will
         | ultimately be the LLM that eats the world, Google has the
         | talent and most importantly has their own custom hardware
         | (hence why their prices are dirt cheap and context is huge).
        
       | roywashere wrote:
       | I think it is very ironic that we chose to use PDF in many fields
       | to archive data because it is a standard and because we would be
       | able to open our pdf documents in 50 or 100 years time. So here
       | we are just a couple of years later facing the challenge of
       | getting the data out of our stupid PDF documents already!
        
         | esafak wrote:
         | It's not ironic. PDFs are a _container_ , which can hold
         | scanned documents as well as text. Scanned documents need OCR
         | and analyzed for their layout. This is not a failing of the PDF
         | format, but a problem inherent to working with print scans.
        
           | scotty79 wrote:
           | Pdf is a horrible format. Even if it contains plain text it
           | has no concept of something as simple as paragraphs.
        
       | gapeslape wrote:
       | In my mind, Gemini 2.0 changes everything because of the
       | incredibly long context (2M tokens on some models), while having
       | strong reasoning capabilities.
       | 
       | We are working on compliance solution (https://fx-lex.com) and
       | RAG just doesn't cut it for our use case. Legislation cannot be
       | chunked if you want the model to reason well about it.
       | 
       | It's magical to be able to just throw everything into the model.
       | And the best thing is that we automatically benefit from future
       | model improvements along all performance axes.
        
       | mateuszbuda wrote:
       | There's AWS Bedrock Knowledge Base (Amazon proprietary RAG
       | solution) which can digest PDFs and, as far as I tested it on
       | real world documents, it works pretty well and is cost effective.
        
       | erulabs wrote:
       | Hrm I've been using a combo of Textract (for bounding boxes) AI
       | for understanding the contents of the document. Textract is
       | excellent at bounding boxes and exact-text capture, but LLMs are
       | excellent at understanding when a messy/ugly bit of a form is
       | actually one question, or if there are duplicate questions etc.
       | 
       | Correlating the two (Textract <-> AI) output is difficult, but
       | another round of AI is usually good at that. Combined with some
       | text-different scoring and logic, I can get pretty good full-
       | document understanding of questions and answer locations. I've
       | spent a pretty absurd amount of time on this and as of yet have
       | not launched a product with it, but if anyone is interested I'd
       | love to chat about the pipeline!
        
       | pbronez wrote:
       | Wonder how this compares to Docling. So far that's been the only
       | tool that really unlocked PDFs for me. It's solid but really
       | annoying to install.
       | 
       | https://ds4sd.github.io/docling/
        
       ___________________________________________________________________
       (page generated 2025-02-05 23:00 UTC)