[HN Gopher] Ingesting PDFs and why Gemini 2.0 changes everything
___________________________________________________________________
Ingesting PDFs and why Gemini 2.0 changes everything
Author : serjester
Score : 459 points
Date : 2025-02-05 18:05 UTC (4 hours ago)
(HTM) web link (www.sergey.fyi)
(TXT) w3m dump (www.sergey.fyi)
| cedws wrote:
| 90% accuracy +/- 10%? What could that be useful for, that's
| awfully low.
| lvzw wrote:
| > accuracy is measured with the Needleman-Wunsch algorithm
|
| > Crucially, we've seen very few instances where specific
| numerical values are actually misread. This suggests that most
| of Gemini's "errors" are superficial formatting choices rather
| than substantive inaccuracies. We attach examples of these
| failure cases below [1].
|
| > Beyond table parsing, Gemini consistently delivers near-
| perfect accuracy across all other facets of PDF-to-markdown
| conversion.
|
| That seems fairly useful to me, no? Maybe not for mission
| critical applications, but for a lot of use cases, this seems
| to be good enough. I'm excited to try these prompts on my own
| later.
| schainks wrote:
| This is "good enough" for Banks to use when doing due
| diligence. You'd be surprised how much noise is in the system
| with the current state of the art: algorithms/web scrapers and
| entire buildings of humans in places like India.
| ai-christianson wrote:
| It's certainly pretty useful for discovery/information
| filtering purposes. I.e. searching for signal in the noise if
| you have a large dataset.
| jjtheblunt wrote:
| due diligence of this sort?
|
| https://en.wikipedia.org/wiki/Know_your_customer
| MattDaEskimo wrote:
| Switching from manual data entry to approval
| summerlight wrote:
| I guess 90% is for "benchmark", which is typically tailored to
| be challenging to parse.
| serjester wrote:
| Author here -- measuring accuracy in table parsing is
| surprisingly challenging. Subtle, almost imperceptible
| differences in how a table is parsed may not affect the
| reader's understanding but can significantly impact benchmark
| performance. For all practical purposes, I'd say it's near
| perfect (also keep in mind the benchmark is on _very_
| challenging tables).
| raunakchowdhuri wrote:
| would encourage you to take a look at some of the real data
| here! https://huggingface.co/spaces/reducto/rd_table_bench
|
| you'll find that most of the errors here are structural issues
| with the table or inability to parse some special characters.
| tables can get crazy!
| mattnewton wrote:
| having seen some of these tables, I would guess that's probably
| above a layperson's score . Some are very complicated or just
| misleadingly structured.
| Havoc wrote:
| Been toying with the flash model. Not the top model, but think
| it'll see plenty use due to the details. Wins on things other
| than top of benchmark logs
|
| * Generous free tier
|
| * Huge context window
|
| * Lite version feels basically instant
|
| However
|
| * Lite model seems more prone to repeating itself / looping
|
| * Very confusing naming e.g. {model}-latest worked for 1.5 but
| now its {model}-001? The lite has a date appended, the non-lite
| does not. Then there is exp and thinking exp...which has a date.
| wut?
| ai-christianson wrote:
| > * Huge context window
|
| But how well does it actually handle that context window? E.g.
| a lot of models support 200K context, but the LLM can only
| really work with ~80K or so of it before it starts to get
| confused.
| summerlight wrote:
| My experience is that Gemini works relatively well on larger
| contexts. Not perfect, but more reliable.
| Havoc wrote:
| I'm sure someone will do a haystack test, but from my casual
| testing it seems pretty good
| asadm wrote:
| it works REALLY well. I have used it to dump many references
| codes and then help me write a new modules etc. I have gone
| up to 200k tokens I think with no problems in recall.
| ai-christianson wrote:
| Awesome. Models that can usefully leverage such large
| context windows are rare at this point.
|
| Something like this opens up a lot of use cases.
| f38zf5vdt wrote:
| It works okay out to roughly 20-40k tokens. Once the window
| gets larger than that, it degrades significantly. You can
| needle in the haystack out to that distance, but asking it
| for multiple things from the document leads to hallucinations
| for me.
|
| Ironic, but GPT4o works better for me at longer contexts
| <128k than Gemini 2.0 flash. And out to 1m is just hopeless,
| even though you can do it.
| llm_nerd wrote:
| There is the needle in the haystack measure which is, as you
| probably guessed, hiding a small fact in a massive set of
| tokens and asking it to recall it.
|
| Recent Gemini models actually do extraordinarily well.
|
| https://cloud.google.com/blog/products/ai-machine-
| learning/t...
| daemonologist wrote:
| I wonder how this compares to open source models (which might be
| less accurate but even cheaper if self-hosted?), e.g. Llama 3.2.
| I'll see if I can run the benchmark.
|
| Also regarding the failure case in the footnote, I think Gemini
| actually got that right (or at least outperformed Reducto) - the
| original document seems to have what I call a "3D" table where
| the third axis is rows _within_ each cell, and having multiple
| headers is probably the best approximation in Markdown.
| mediaman wrote:
| Everything I tried previously had very disappointing results. I
| was trying to get rid of Azure's DocumentIntelligence, which is
| kind of expensive at scale. The models could often output a
| portion of a table, but it was nearly impossible to get them to
| produce a structured output of a large table on a single page;
| they'd often insert "...rest of table follows" and similar
| terminations, regardless of different kinds of prompting.
|
| Maybe incremental processing of chunks of the table would have
| worked, with subsequent stitching, but if Gemini can just
| process it that would be pretty good.
| fecal_henge wrote:
| Is there an AI platform where I can paste a snip of a graph and
| it will generate a n th order polynomial regression for me of the
| trace?
| CamperBob2 wrote:
| Either ChatGPT o4 or one of the newer Google models should
| handle that, since it's a pretty common task. Actually there
| have been online curve fitters for several years that work
| pretty well without AI, such as https://curve.fit/ and
| https://www.standardsapplied.com/nonlinear-curve-fitting-cal...
| .
|
| I'd probably try those first, since otherwise you're depending
| on the language model to do the right thing automagically.
| potatoman22 wrote:
| I've had decent luck using some of the reasoning models for
| this. It helps if you task them with identifying where the
| points on the graph are first.
| bt3 wrote:
| One major takeaway that matches my own investigation is that
| Gemini 2.0 still materially struggles with bounding boxes on
| digital content. Google has published[1] some great material on
| spatial understanding and bounding boxes on photography, but
| identifying sections of text or digital graphics like icons in a
| presentation is still very hit and miss.
|
| --
|
| [1]: https://github.com/google-
| gemini/cookbook/blob/a916686f95f43...
| maeil wrote:
| Have you seen any models that perform better at this? I last
| looked into this a year ago but at the time they were indeed
| quite bad at it across the board.
| scottydelta wrote:
| This is what I am trying to figure out how to solve.
|
| My problem statement is:
|
| - Injest PDFs, summarize, and extract important information.
|
| - Have some way to overlay the extracted information on the pdf
| in the UI.
|
| - User can provide feedback on the overlaid info by accepting or
| rejecting the highlights as useful or not.
|
| - This info goes back in to the model for reinforced learning.
|
| Hoping to find something that can make this more manageable.
| baxtr wrote:
| Have you tried cursor or replit for this?
| cccybernetic wrote:
| Most PDF parsers give you coordinate data (bounding boxes) for
| extracted text. Use these to draw highlights over your PDF
| viewer - users can then click the highlights to verify if the
| extraction was correct.
|
| The tricky part is maintaining a mapping between your LLM
| extractions and these coordinates.
|
| One way to do it would be with two LLM passes:
| 1. First pass: Extract all important information from the PDF
| 2. Second pass: "Hey LLM, find where each extraction appears in
| these bounded text chunks"
|
| Not the cheapest approach since you're hitting the API twice,
| but it's straightforward!
| Jimmc414 wrote:
| Here's a PR thats not accepted yet for some reason that seems
| to be having some success with the bounding boxes
|
| https://github.com/getomni-ai/zerox/pull/44
|
| Related to
|
| https://github.com/getomni-ai/zerox/issues/7
| exabrial wrote:
| You know what'd be fucking nice? The ability to turn Gemini off.
| fngjdflmdflg wrote:
| >Unfortunately Gemini really seems to struggle on this, and no
| matter how we tried prompting it, it would generate wildly
| inaccurate bounding boxes
|
| This is what I have found as well. From what I've read, LLMS do
| not work well with images for specific details due to image
| encoders which are too lossy. (No idea if this is actually
| correct.) For now I guess you can use regular OCR to get bounding
| boxes.
| minimaxir wrote:
| Modern multimodal encoders for LLMs are fine/not lossy since
| they do not resize to a small size and can handle arbitrary
| sizes, although some sizes are obviously better represented in
| the training set. A 8.5" x 11" paper would be common.
|
| I suspect the issue is prompt engineering related.
|
| > Please provide me strict bounding boxes that encompasses the
| following text in the attached image? I'm trying to draw a
| rectangle around the text.
|
| > - Use the top-left coordinate system
|
| > - Values should be percentages of the image width and height
| (0 to 1)
|
| LLMs have enough trouble with integers (since token-wise
| integers and text representation of integers are the same),
| high-precision decimals will be even worse. It might be better
| to reframe the problem as "this input document is 850 px x 1100
| px, return the bounding boxes as integers" then parse and
| calculate the decimals later.
| fngjdflmdflg wrote:
| Just tried this and it did not appear to work for me. Prompt:
|
| >Please provide me strict bounding boxes that encompasses the
| following text in the attached image? I'm trying to draw a
| rectangle around the text.
|
| > - Use the top-left coordinate system
|
| >this input document is 1080 x 1236 px. return the bounding
| boxes as integers
| minimaxir wrote:
| "Might" being the operative word, particularly with models
| that have less prompt adherence. There's a few other prompt
| massaging tricks beyond the scope of a HN comment, the
| decimal issue is just one optimization.
| BoorishBears wrote:
| https://github.com/google-
| gemini/cookbook/blob/a916686f95f43...
|
| They say there's no magic prompt but I'd start with their
| default since there is usually _some_ format used to
| improve performance with posttraining with tasks like this
| coderstartup wrote:
| Following this post
| cubefox wrote:
| Why is Gemini Flash so much cheaper than other models here?
| mattnewton wrote:
| probably a mix of economies of scale (google workspace and
| search are already massive customers of these models meaning
| the build out is already there), and some efficiency dividends
| from hardware r&d (google has developed the model and the TPU
| hardware purpose built to run it almost in parallel)
| lazypenguin wrote:
| I work in fintech and we replaced an OCR vendor with Gemini at
| work for ingesting some PDFs. After trial and error with
| different models Gemini won because it was so darn easy to use
| and it worked with minimal effort. I think one shouldn't
| underestimate that multi-modal, large context window model in
| terms of ease-of-use. Ironically this vendor is the best known
| and most successful vendor for OCR'ing this specific type of PDF
| but many of our requests failed over to their human-in-the-loop
| process. Despite it not being their specialization switching to
| Gemini was a no-brainer after our testing. Processing time went
| from something like 12 minutes on average to 6s on average,
| accuracy was like 96% of that of the vendor and price was
| significantly cheaper. For the 4% inaccuracies a lot of them are
| things like the text "LLC" handwritten would get OCR'd as "IIC"
| which I would say is somewhat "fair". We probably could improve
| our prompt to clean up this data even further. Our prompt is
| currently very simple: "OCR this PDF into this format as
| specified by this json schema" and didn't require some fancy
| "prompt engineering" to contort out a result.
|
| Gemini developer experience was stupidly easy. Easy to add a file
| "part" to a prompt. Easy to focus on the main problem with
| weirdly high context window. Multi-modal so it handles a lot of
| issues for you (PDF image vs. PDF with data), etc. I can
| recommend it for the use case presented in this blog (ignoring
| the bounding boxes part)!
| cess11 wrote:
| What hardware are you using to run it?
| kccqzy wrote:
| The Gemini model isn't open so it does not matter what
| hardware you have. You might have confused Gemini with Gemma.
| cess11 wrote:
| OK, I see, pity. I'm interested in similar applications but
| in contexts where the material is proprietary and might
| contain PII.
| panarky wrote:
| This is a big aha moment for me.
|
| If Gemini can do semantic chunking at the same time as
| extraction, all for so cheap and with nearly perfect accuracy,
| and without brittle prompting incantation magic, this is huge.
| potatoman22 wrote:
| Small point but is it doing semantic chunking, or loading the
| entire pdf into context? I've heard mixed results on semantic
| chunking.
| panarky wrote:
| It loads the entire PDF into context, but then it would be
| my job to chunk the output for RAG, and just doing
| arbitrary fixed-size blocks, or breaking on sentences or
| paragraphs is not ideal.
|
| So I can ask Gemini to return chunks of variable size,
| where each chunk is a one complete idea or concept, without
| arbitrarily chopping a logical semantic segment into
| multiple chunks.
| thelittleone wrote:
| Fixed size chunks is holding back a bunch of RAG projects
| on my backlog. Will be extremely pleased if this semantic
| chunking solves the issue. Currently we're getting around
| an 78-82% success on fixed size chunked RAG which is far
| too low. Users assume zero results on a RAG search
| equates to zero results in the source data.
| refulgentis wrote:
| FWIW, you might be doing it / ruled it out already:
|
| - BM25 to eliminate the 0 results in source data problem
|
| - Longer term, a peek at Gwern's recent hierarchical
| embedding article. Got decent early returns even with
| fixed size chunks
| thelittleone wrote:
| Much appreciated.
|
| For others interested in BM25 for the use case above, I
| found this thread informative.
|
| https://news.ycombinator.com/item?id=41034297
| mediaman wrote:
| Agree, BM25 honestly does an amazing job on its own
| sometimes, especially if content is technical.
|
| We use it in combination with semantic but sometimes turn
| off the semantic part to see what happens and are
| surprised with the robustness of the results.
|
| This would work less well for cross-language or less
| technical content, however. It's great for acronyms,
| company or industry specific terms, project names,
| people, technical phrases, and so on.
| Tostino wrote:
| I wish we had a local model for semantic chunking. I've
| been wanting one for ages, but haven't had the time to
| make a dataset and finetune that task =/.
| fallinditch wrote:
| If I used Gemini 2.0 for extraction and chunking to feed into
| a RAG that I maintain on my local network, then what sort of
| locally-hosted LLM would I need to gain meaningful insights
| from my knowledge base? Would a 13B parameter model be
| sufficient?
| jhoechtl wrote:
| Ypur lovalodel has littleore to do but stitch the already
| meaningzl pieces together.
|
| The pre-step, chunking and semantic understanding is all
| that counts.
| faxmeyourcode wrote:
| I've been fighting trying to chunk SEC filings properly,
| specifically surrounding the strange and inconsistent tabular
| formats present in company filings.
|
| This is giving me hope that it's possible.
| otoburb wrote:
| >> _I 've been fighting trying to chunk SEC filings properly,
| specifically surrounding the strange and inconsistent tabular
| formats present in company filings._
|
| For this specific use case you can also try edgartools[1]
| which is a library that was relatively recently released that
| ingests SEC submissions and filings. They don't use OCR but
| (from what I can tell) directly parse the XBRL documents
| submitted by companies and stored in EDGAR, if they exist.
|
| [1] https://github.com/dgunning/edgartools
| barrenko wrote:
| If you'd kindly tl;dr the chunking strategies you have tried
| and what works best, I'd love to hear.
| anirudhb99 wrote:
| (from the gemini team) we're working on it! semantic chunking
| & extraction will definitely be possible in the coming
| months.
| jgalt212 wrote:
| isn't everyone on iXBRL now? Or are you struggling with
| historical filings?
| faxmeyourcode wrote:
| XBRL is what I'm using currently, but it's still kind of a
| mess (maybe I'm just bad at it) for some of the non-
| standard information that isn't properly tagged.
| yzydserd wrote:
| How do today's LLM's like Gemini compare with the Document
| Understanding services google/aws/azure have offered for a few
| years, particularly when dealing with known forms? I think
| Google's is Document AI.
| zacmps wrote:
| I've found the highest accuracy solution is to OCR with one
| of the dedicated models then feed that text and the original
| image into an LLM with a prompt like:
|
| "Correct errors in this OCR transcription".
| bradfox2 wrote:
| This is what we do today. Have you tried it against Gemini
| 2.0?
| therein wrote:
| How does it behave if the body of text is offensive or what
| if it is talking about a recipe to purify UF-6 gas at home?
| Will it stop doing what it is doing and enter lecturing
| mode?
|
| I am asking not to be cynical but because of my limited
| experience with using LLMs for any task that may operate on
| offensive or unknown input seems to get triggered by all
| sorts of unpredictable moral judgements and dragged into
| generating not the output I wanted, at all.
|
| If I am asking this black box to give me a JSON output
| containing keywords for a certain text, if it happens to be
| offensive, it refuses to do that.
|
| How does one tackle with that?
| zacmps wrote:
| It's not something I've needed to deal with personally.
|
| We have run into added content filters in Azure OpenAI on
| a different application, but we just put in a request to
| tune them down for us.
| xnx wrote:
| There are many settings for changing the safety level in
| Gemini API calls: https://ai.google.dev/gemini-
| api/docs/safety-settings
| sumedh wrote:
| Try setting the safety params to none and see if that
| makes any difference.
| ajcp wrote:
| GCP's Document AI service is now literally just a UI layer
| specific to document parsing use-cases back by Gemini models.
| When we realized that we dumped it and just use Gemini
| directly.
| depr wrote:
| So are you mostly processing PDFs with data? Or PDFs with just
| text, or images, graphs?
| thelittleone wrote:
| Not the parent, but we process PDFs with text, tables,
| diagrams. Works well if the schema is properly defined.
| sensecall wrote:
| Out of interest, did you parse into any sort of defined
| schema/structure?
| gnat wrote:
| Parent literally said so ...
|
| > Our prompt is currently very simple: "OCR this PDF into
| this format as specified by this json schema" and didn't
| require some fancy "prompt engineering" to contort out a
| result.
| bionhoward wrote:
| The Gemini api has a customer noncompete, so it's not an option
| for AI, what are you working on that doesn't compete with AI?
| B-Con wrote:
| You do realize most people aren't working on AI, right?
|
| Also, OP mentioned fintech at the outset.
| novaleaf wrote:
| what doesn't compete with ai?
| xnx wrote:
| Your OCR vendor would be smart to replace their own system with
| Gemini.
| _the_inflator wrote:
| With "Next generation, extremely sophisticated AI" to be
| precise, I wait say. ;)
|
| Marketing joke aside, maybe a hybrid approach could serve the
| vendor well. Best of both worlds if it reaps benefits or even
| have a look at hugging face for even more specialized aka
| better LLMs.
| itissid wrote:
| Wait isn't there atleast a two step process here one is
| semantic segmentation followed by a method like texttract for
| text - to avoid hallucinations?
|
| One cannot possibly say that "Text extracted by a multimodal
| model cannot hallucinate"?
|
| > accuracy was like 96% of that of the vendor and price was
| significantly cheaper.
|
| I would like to know how this 96% was tested. If you use a
| human to do random sample based testing, well how do you adjust
| the random sample for variations in distribution of errors that
| vary like a small set of documents could have 90% of the errors
| and yet they are only 1% of the docs?
| itissid wrote:
| For an OCR company I imagine it is unconscionable to do this
| because if you would say OCR for an Oral History project for
| a library and you made hallucination errors, well you've
| replaced facts with fiction. Rewriting history? What the
| actual F.
| themanmaran wrote:
| One thing people always forget about traditional OCR
| providers (azure, tesseract, aws textract, etc.) is that
| they're ~85% accurate.
|
| They are all probabilistic. You literally get back characters
| + confidence intervals. So when textract gives you back
| incorrect characters, is that a hallucination?
| somebehemoth wrote:
| I know nothing about OCR providers. It seems like OCR
| failure would result in gibberish or awkward wording that
| might be easy to spot. Doesn't the LLM failure mode assert
| made up truths eloquently that are more difficult to spot?
| anon373839 wrote:
| It's a question of scale. When a traditional OCR system
| makes an error, it's confined to a relatively small part of
| the overall text. (Think of "Plastics" becoming
| "PIastics".) When a LLM hallucinates, there is no limit to
| how much text can be made up. Entire sentences can be
| rewritten because the model thinks they're more plausible
| than the sentences that were actually printed. And because
| the bias is always toward plausibility, it's an especially
| insidious problem.
| kapitalx wrote:
| I'm the founder of https://doctly.ai, also pdf extraction.
|
| The hallucination in LLM extraction is much more subtle as
| it will rewrite full sentences sometimes. It is much harder
| to spot when reading the document and sounds very
| plausible.
|
| We're currently working on a version where we send the
| document to two different LLMs, and use a 3rd to increase
| confidence. That way you have the option of trading compute
| and cost for accuracy.
| Scoundreller wrote:
| > You literally get back characters + confidence intervals.
|
| Oh god, I wish speech to text engines would colour code the
| whole thing like a heat map to focus your attention to
| review where it may have over-enthusiastically guessed at
| what was said.
|
| You no knot.
| basch wrote:
| Wouldn't the temperature on something like OCR be very low.
| You want the same result every time. Isn't some part of
| hallucination the randomness of temperature?
| ein0p wrote:
| > Why Gemini 2.0 Changes Everything
|
| Clickbait. It doesn't change "everything". It makes ingestion for
| RAG much less expensive (and therefore feasible in a lot more
| scenarios), at the expense of ~7% reduction in accuracy. Accuracy
| is already rather poor even before this, however, with the top
| alternative clocking in at 0.9. Gemini 2.0 is 0.84, although the
| author seems to suggest that the failure modes are mostly around
| formatting rather than e.g. mis-recognition or hallucinations.
|
| TL;DR: is this exciting? If you do RAG, yes. Does it "change
| everything" nope. There's still a very long way to go. Protip for
| model designers: accuracy is always in greater demand than
| performance. A slow model that solves the problem is invariably
| better than a fast one that fucks everything up.
| rvz wrote:
| In this use-case, accuracy is non-negotiable with zero room for
| any hallucination.
|
| Overall it changes nothing.
| ChrisArchitect wrote:
| Related:
|
| _Gemini 2.0 is now available to everyone_
|
| https://news.ycombinator.com/item?id=42950454
| cccybernetic wrote:
| Shameless plug: I'm working on a startup in this space.
|
| But the bounding box problem hits close to home. We've found
| Unstructured's API gives pretty accurate box coordinates, and
| with some tweaks you can make them even better. The tricky part
| is implementing those tweaks without burning a hole in your
| wallet.
| sho_hn wrote:
| Remember all the hyperbole a year ago on how Google was failing
| and over?
| latexr wrote:
| Anyone who cries "<service> is dead" after some new technology
| is introduced is someone you can safely ignore. For ever.
| They're hyperbolic clout chasers who will only ever be right by
| mistake.
|
| As if, when ChatGPT was introduced, Google would just stay
| still, cross their arms, and say "well, this is based on our
| research paper but there's nothing we can do, going to just
| roll over and wait for billions of dollars to run out, we're
| truly doomed". So unbelievably stupid.
| pockmarked19 wrote:
| Now, I could look at this relatively popular post about Google
| and revise my opinion of HN as an echo chamber, but I'm afraid
| it's just that the downvote loving HNers weren't able to make the
| cognitive leap from Gemini to Google.
| resource_waste wrote:
| Google's models have historically been total disappointments
| compared to chatGPT4. Worse quality, wont answer medical
| questions either.
|
| I suppose I'll try it again, for the 4th or 5th time.
|
| This time I'm not excited. I'm expecting it to be a letdown.
| beklein wrote:
| Great article, I couldn't find any details about the prompt...
| only the snippets of the `CHUNKING_PROMPT` and the
| `GET_NODE_BOUNDING_BOXES_PROMPT`.
|
| Is there is any code example with a full prompt available from
| OP, or are there any references (such as similar GitHub repos)
| for those looking to get started within this topic?
|
| Your insights would be highly appreciated.
| nickandbro wrote:
| I think very soon a new model will destroy whatever startups and
| services are built around document ingestion. As in a model that
| can take in a pdf page as a image and transcribe it to text with
| near perfect accuracy.
| depr wrote:
| I think the Azure Document Intelligence, Google Document AI and
| Amazon Textract are among the best if not the best services
| though and they offer these models.
| layer8 wrote:
| Extracting plain text isn't that much of a problem, relatively
| speaking. It's interpreting more complex elements like nested
| lists, tables, side bars, footnotes/endnotes, cross-references,
| images and diagrams where things get challenging.
| matthest wrote:
| This is completely tangential, but does anyone know if AI is
| creating any new jobs?
|
| Thinking of the OCR vendors who get replaced. Where might they
| go?
|
| One thing I can think of is that AI could help the space industry
| take off. But wondering if there are any concrete examples of new
| jobs being created.
| rjurney wrote:
| I've been using NotebookLM powered by Gemini 2.0 for three
| projects and it is _really powerful_ for comprehending large
| corpuses you can't possibly read and thinking informed by all
| your sources. It has solid Q&A. When you ask a question or get a
| summary you like [which often happens] you can save it as a new
| note, putting it into the corpus for analysis. In this way your
| conclusions snowball. Yes, this experience actually happens and
| it is beautiful.
|
| I've tried Adobe Acrobat AI for this and it doesn't work yet.
| NotebookLM is it. The grounding is the reason it works - you can
| easily click on anything and it will take you to the source to
| verify it. My only gripe is that the visual display of the source
| material is _dogshit ugly_, like exceptionally so. Big blog pink
| background letters in lines of 24 characters! :) It has trouble
| displaying PDF columns, but at least it parses them. The ugly
| will change I'm sure :)
|
| My projects are setup to let me bridge the gaps between the
| various sources and synthesize something more. It helps to have a
| goal and organize your sources around that. If you aren't
| focused, it gets confused. You lay the groundwork in sources and
| it helps you reason. It works so well I feel _tender_ towards it
| :) Survey papers provide background then you add specific sources
| in your area of focus. You can write a profile for how you would
| like NotebookLM to think - which REALLY helps out.
|
| They are:
|
| * The Stratigrapher - A Lovecraftian short story about the
| world's first city. All of Seton Lloyd/Faud Safar's work on
| Eridu. Various sources on Sumerian culture and religion All of
| Lovecraft's work and letters. Various sources about opium Some
| articles about nonlinear geometries
|
| * FPGA Accelerated Graph Analytics An introduction to Verilog
| Papers on FPGAs and graph analytics Papers on Apache Spark
| architecture Papers on GraphFrames and a related rant I created
| about it and graph DBs A source on Spark-RAPIDS Papers on
| subgraph matching, graphlets, network motifs Papers on random
| graph models
|
| * Graph machine learning notebook without a specific goal, which
| has been less successful. It helps to have a goal for the
| project. It got confused by how broad my sources were.
|
| I would LOVE to share my projects with you all, but you can only
| share within a Google Workspaces domain. It will be AWESOME when
| they open this thing up :)
| ratedgene wrote:
| Is this something we can run locally? if so what's the license?
| xnx wrote:
| Gemini are Google cloud/service models. Gemma are the Google
| local models.
| __jl__ wrote:
| The numbers in the blog post seem VERY inaccurate.
|
| Quick calculation: Input pricing: Image input in 2.0 Flash is
| $0.0001935. Let's ignore the prompt. Output pricing: Let's assume
| 500 token per page, which is $0.0003
|
| Cost per page: $0.0004935
|
| That means 2,026 pages per dollar. Not 6,000!
|
| Might still be cheaper than many solutions but I don't see where
| these numbers are coming from.
|
| By the way, image input is much more expensive in Gemini 2.0 even
| for 2.0 Flash Lite.
|
| Edit: The post says batch pricing, which would be 4k pages based
| on my calculation. Using batch pricing is pretty different
| though. Great if feasible but not practical in many contexts.
| serjester wrote:
| Correct, it's with batching Vertex pricing with slightly lower
| output tokens per page since a lot of pages are somewhat empty
| in real world docs - I wanted a fair comparison to providers
| that charge per page.
|
| Regardless of what assumptions you use - it's still an order of
| magnitude + improvement over anything else.
| nothrowaways wrote:
| Cool
| anirudhb99 wrote:
| thanks a ton for all the amazing feedback on this thread! if
|
| (a) you have document understanding use cases that you'd like to
| use gemini for (the more aspirational the better) and/or
|
| (b) there are loss cases for which gemini doesn't work well
| today,
|
| please feel free to email anirudhbaddepu@google.com and we'd love
| to help get your use case working & improve quality for our next
| series of model updates!
| raunakchowdhuri wrote:
| CTO of Reducto here. Love this writeup!
|
| We've generally found that Gemini 2.0 is a great model and have
| tested this (and nearly every VLM) very extensively.
|
| A big part of our research focus is incorporating the best of
| what new VLMs offer without losing the benefits and reliability
| of traditional CV models. A simple example of this is we've found
| bounding box based attribution to be a non-negotiable for many of
| our current customers. Citing the specific region in a document
| where an answer came from becomes (in our opinion) even MORE
| important when using large vision models in the loop, as there is
| a continued risk of hallucination.
|
| Whether that matters in your product is ultimately use case
| dependent, but the more important challenge for us has been
| reliability in outputs. RD-TableBench currently uses a single
| table image on a page, but when testing with real world dense
| pages we find that VLMs deviate more. Sometimes that involves
| minor edits (summarizing a sentence but preserving meaning), but
| sometimes it's a more serious case such as hallucinating large
| sets of content.
|
| The more extreme case is that internally we fine tuned a version
| of Gemini 1.5 along with base Gemini 2.0, specifically for
| checkbox extraction. We found that even with a broad distribution
| of checkbox data we couldn't prevent frequent checkbox
| hallucination on both the flash (+17% error rate) and pro model
| (+8% error rate). Our customers in industries like healthcare
| expect us to get it right, out of the box, deterministically, and
| our team's directive is to get as close as we can to that ideal
| state.
|
| We think that the ideal state involves a combination of the two.
| The flexibility that VLMs provide, for example with cases like
| handwriting, is what I think will make it possible to go from 80
| or 90 percent accuracy to some number very close 99%. I should
| note that the Reducto performance for table extraction is with
| our pre-VLM table parsing pipeline, and we'll have more to share
| in terms of updates there soon. For now, our focus is entirely on
| the performance frontier (though we do scale costs down with
| volume). In the longer term as inference becomes more efficient
| we want to move the needle on cost as well.
|
| Overall though, I'm very excited about the progress here.
|
| --- One small comment on your footnote, the evaluation script
| with Needlemen-Wunsch algorithm doesn't actually consider the
| headers outputted by the models and looks only at the table
| structure itself.
| noja wrote:
| > deterministically
|
| How are you planning to do this?
| jbarrow wrote:
| > Unfortunately Gemini really seems to struggle on this, and no
| matter how we tried prompting it, it would generate wildly
| inaccurate bounding boxes
|
| Qwen2.5 VL was trained on a special HTML format for doing OCR
| with bounding boxes. [1] The resulting boxes aren't quite as
| accurate as something like Textract/Surya, but I've found they're
| much more accurate than Gemini or any other LLM.
|
| [1] https://qwenlm.github.io/blog/qwen2.5-vl/
| xena wrote:
| I really wish that Google made an endpoint that's compatible with
| the OpenAI API. That'd make trying Gemini in existing flows so
| much easier.
| kurtoid wrote:
| Is that not this? https://ai.google.dev/api/compatibility
| myko wrote:
| I believe this is already the case, at least the Python
| libraries are compatible, if not recommended for more than just
| trying things out:
|
| https://ai.google.dev/gemini-api/docs/openai
| msp26 wrote:
| How well do they work when you want to do things like
| grounding with search?
| devmor wrote:
| I think this is one of the few functional applications of LLMs
| that is really undeniably useful.
|
| OCR has always been "untrustworthy" (as in you cannot expect it
| to be 100% correct and know you must account for that) and we
| have long used ML algorithms for the process.
| bambax wrote:
| I'm building a system that does regular OCR and outputs layout-
| following ASCII; in my admittedly limited tests it works better
| than most existing offerings.
|
| It will be ready for beta testing this week or the next, and I
| will be looking for beta testers; if interested please contact
| me!
| siquick wrote:
| Strange that LlamaParse is mentioned in the pricing table but not
| the results. We've used them to process a lot of pages and it's
| been excellent each time.
| zoogeny wrote:
| Orthogonal to this post, but this just highlights the need for a
| more machine readable PDF alternative.
|
| I get the inertia of the whole world being on PDF. And perhaps we
| can just eat the cost and let LLMs suffer the burden going
| forwards. But why not use that LLM coding brain power to create a
| better overall format?
|
| I mean, do we really see printing things out onto paper something
| we need to worry about for the next 100 years? It reminds me of
| the TTY interface at the heart of Linux. There was a time it all
| made sense, but can we just deprecate it all now?
| layer8 wrote:
| PDF does support incorporating information about the logical
| document structure, aka Tagged PDF. It's optional, but
| recommended for accessibility (e.g. PDF/UA). See chapters
| 14.7-14.8 in [1]. Processing PDF files as rendered images, as
| suggested elsewhere in this thread, can actually dramatically
| lose information present in the PDF.
|
| Alternatively, XML document formats and the like do exist.
| Indeed, HTML was supposed to be a document format. That's not
| the problem. The problem is having people and systems actually
| author documents in that way in an unambiguous fashion, and
| having a uniform visual presentation for it that would be
| durable in the long term (decades at least).
|
| PDF as a format persists because it supports virtually every
| feature under the sun (if authors care to use them), while
| largely guaranteeing a precisely defined visual presentation,
| and being one of the most stable formats.
|
| [1] https://opensource.adobe.com/dc-acrobat-sdk-
| docs/pdfstandard...
| zoogeny wrote:
| I'm not suggesting we re-invent RDF or any other kind of
| semantic web idea. And the fact that semantic data can be
| stored in a PDF isn't really the problem being solved by
| tools such as these. In many cases, PDF is used for things
| like scanned documents where adding that kind of metadata
| can't really be done manually - in fact the kinds of tools
| suggested in the post would be useful for adding that
| metadata to the PDF after scanning (for example).
|
| Imagine you went to a government office looking for some
| document from 1930s, like an ancestors marriage or death
| certificate. You might want to digitize a facsimile of that
| using a camera or a scanner. You have a lot of options to
| store that, JPG, PNG, PDF. You have even more options to
| store the metadata (XML, RDF, TXT, SQLite, etc.). You could
| even get fancy and zip up an HTML doc alongside a directory
| of images/resources that stitched them all together. But
| there isn't really a good standard format to do that.
|
| It is the second part of you post that stands out - the
| kitchen sink nature of PDFs that make them so terrible. If
| they were just wrappers for image data, formatted in a way
| that made printing them easy, I probably wouldn't dislike
| them.
| jibuai wrote:
| I've been working on something similar the past couple months. A
| few thoughts:
|
| - A lot of natural chunk boundaries span multiple pages, so you
| need some 'sliding window' mechanism for the best accuracy.
|
| - Passing the entire document hurts throughput too much due to
| the quadratic complexity of attention. Outputs are also much
| worse when you use too much context.
|
| - Bounding boxes can be solved by first generating boxes using
| tradition OCR / layout recognition, then passing that data to the
| LLM. The LLM can then link it's outputs to the boxes.
| Unfortunately getting this reliable required a custom sampler so
| proprietary models like Gemini are out of the question.
| dasl wrote:
| How does the Gemini OCR perform against non-English language
| text?
| an_aparallel wrote:
| Has anyone in the AEC industry who's reading this worked out a
| good way to get Bluebeam MEP, electrical layouts into Revit (LOD
| 200-300).
|
| Have seen MarkupX as a paid option, but it seems some AI in the
| loop can greatly speed up exception handling, encode family
| placement to certain elevations based on building code docs....
| sensecall wrote:
| This is super interesting.
|
| Would this be suitable for ingesting and parsing wildly variable
| unstructured data into a structured schema?
| kbyatnal wrote:
| It's clear that OCR & document parsing are going to be swallowed
| up by these multimodal models. The best representation of a
| document at the end of the day is an image.
|
| I founded a doc processing company [1] and in our experience, a
| lot of the difficulty w/ deploying document processing into
| production is when accuracy requirements are high (> 97%). This
| is because OCR and parsing is only one part of the problem, and
| real world use cases need to bridge the gap between raw outputs
| and production-ready data.
|
| This requires things like:
|
| - state-of-the-art parsing powered by VLMs and OCR
|
| - multi-step extraction powered by semantic chunking, bounding
| boxes, and citations
|
| - processing modes for document parsing, classification,
| extraction, and splitting (e.g. long documents, or multi-document
| packages)
|
| - tooling that lets nontechnical members quickly iterate, review
| results, and improve accuracy
|
| - evaluation and benchmarking tools
|
| - fine-tuning pipelines that turn reviewed corrections --> custom
| models
|
| Very excited to get test and benchmark Gemini 2.0 in our product,
| very excited about the progress here.
|
| [1] https://extend.app/
| anon373839 wrote:
| > It's clear that OCR & document parsing are going to be
| swallowed up by these multimodal models.
|
| I don't think this is clear at all. A multimodal LLM can and
| will hallucinate data at arbitrary scale (phrases, sentences,
| etc.). Since OCR is the part of the system that extracts the
| "ground truth" out of your source documents, this is an
| unacceptable risk IMO.
| ThinkBeat wrote:
| Hmm I have been doing a but if this manually lately for a
| personal project. I am working on some old books that are far
| past any copyright, but they are not available anywhere on the
| net. (Being in Norwegian m makes a book a lot more obscure) so I
| have been working on creating ebooks out of them.
|
| I have a scanner, and some OCR processes I run things through. I
| am close to 85% from my automatic process.
|
| The pain of going from 85% to 99% though is considerable. (and in
| my case manual) (well Perl helps)
|
| I went to try this AI on one of the short poem manufscript I
| have.
|
| I told the prompt I wanted PDF to Markdown, it says sure go ahead
| give me the pdf. I went upload it. It spent a long time spinning.
| then a quick messages comes up, something like
|
| "Failed to count tokens"
|
| but it just flashes and goes away.
|
| I guess the PDF is too big? Weird though, its not a lot of pages.
| sumedh wrote:
| Take a screenshot of the pdf page and give that to the LLM and
| see if it can be processed.
|
| Your PDF might have some quirks inside which the LLM cannot
| process.
| rudolph9 wrote:
| We parse millions of PDFs using Apache Tika and process about
| 30,000 per dollar of compute cost. However, the structured output
| leaves something to be desired, and there are a significant
| number of pages that Tika is unable to parse.
|
| https://tika.apache.org/
| rudolph9 wrote:
| Under the hood tika uses tesseract for ocr parsing. For clarity
| this all works surprisingly well generally speaking and it's
| pretty easy to run your self and order of magnitude cheaper
| than most services out there.
|
| https://tesseract-ocr.github.io/tessdoc/
| nottorp wrote:
| Will 2.0.1 also change everything?
|
| How about 2.0.2?
|
| How about Llama 13.4.0.1?
|
| This is tiring. It's always the end of the world when they
| release a new version of some LLM.
| llm_trw wrote:
| This is using exactly the wrong tools at every stage of the OCR
| pipeline, and the cost is astronomical as a result.
|
| You don't use multimodal models to extract a wall of text from an
| image. They hallucinate constantly the second you get past
| perfect 100% high-fidelity images.
|
| You use an object detection model trained on documents to find
| the bounding boxes of each document section as _images_; each
| bounding box comes with a confidence score for free.
|
| You then feed each box of text to a regular OCR model, also gives
| you a confidence score along with each prediction it makes.
|
| You feed each image box into a multimodal model to describe what
| the image is about.
|
| For tables, use a specialist model that does nothing but extract
| tables--models like GridFormer that aren't hyped to hell and
| back.
|
| You then stitch everything together in an XML file because
| Markdown is for human consumption.
|
| You now have everything extracted with flat XML markup for each
| category the object detection model knows about, along with
| multiple types of probability metadata for each bounding box,
| each letter, and each table cell.
|
| You can now start feeding this data programmatically into an LLM
| to do _text_ processing, where you use the XML to control what
| parts of the document you send to the LLM.
|
| You then get chunking with location data and confidence scores of
| every part of the document to put as meta data into the RAG
| store.
|
| I've build a system that read 500k pages _per day_ using the
| above completely locally on a machine that cost $20k.
| ck_one wrote:
| What object detection model do you use?
| a3w wrote:
| Is tesseract even ML based? Oh, this piece of software is
| more than 19 years old, perhaps there are other ways to do
| good, cheap OCR now. Does Gemini have an OCR library,
| internally? For other LLMs, I had the feeling that the LLM
| scripts a few lines of python to do the actual heavy lifting
| with a common OCR framework.
| llm_trw wrote:
| Custom trained yolo v8. I've moved on since then and the work
| was done in 2023. You'd get better results for much less
| today.
| ajcp wrote:
| Not sure what service you're basing your calculation on but
| with Gemmini I've processed 10,000,000+ shipping documents (PDF
| and PNGs) of every concievable layout in one month at under
| $1000 and an accuracy rate of between 80-82% (humans were at
| 66%).
|
| The longest part of the development timeline was establishing
| the accuracy rate and the ingestion pipeline, which itself is
| massively less complex than what your workflow sounds like: PDF
| -> Storage Bucket -> Gemini -> JSON response -> Database
|
| Just to get sick with it we actually added some recusion to the
| Gemini step to have it rate how well it extracted, and if it
| was below a certain rating to rewrite its own instructions on
| how to extract the information and then feed it back into
| itself. We didn't see any improvement in accuracy, but it was
| still fun to do.
| cpursley wrote:
| Very cool! How are you storing it to a database - vectors?
| What do you do with the extracted data (in terms of being
| able to pull it up via some query system)?
| ajcp wrote:
| In this use-case the customer just wanted data not
| currently in the warehouse inventory management system
| capatured, so here we converted a JSON response to a
| classic table row schema (where 1 row = 1 document) and now
| boom, shipping data!
|
| However we do very much recommend storing the raw model
| responses for audit and then at least as vector embeddings
| to orient the expectation that the data will need to be
| utilized for vector search and RAG. Kind of like "while
| we're here why don't we do what you're going to want to do
| at some point, even if it's not your use-case now..."
| polote wrote:
| Do you know another model than gridformer to detect table that
| has an available implementation somewhere ?
| dr_kiszonka wrote:
| Impressive. Can you share anything more about this project?
| 500k pages a day is massive and I can imagine why one would
| require that much throughput.
| oedemis wrote:
| there is also https://ds4sd.github.io/docling/ from ibm research
| which is mit license and track bounding boxes as rich json format
| pbronez wrote:
| Docling has worked well for me. It handles scenarios that
| crashed ChatGPT Pro. Only problem is it's super annoying to
| install. When I have a minute I might package it for homebrew.
| sergiotapia wrote:
| The article mentions OCR, but you're sending a PDF how is that
| OCR? Or is this is mistake? What if you send photos of the pages,
| that would be true OCR - does the performance and price remain
| the same?
|
| If so this unlocks a massive workflow for us.
| xnx wrote:
| Glad Gemini is getting some attention. Using it is like a
| superpower. There are so many discussions about ChatGTP, Claude,
| DeepSeek, Llama, etc. that don't even mention Gemini.
| throwaway314155 wrote:
| Google had a pretty rough start compared to ChatGPT, Claude. I
| suspect that left a bad taste in many people's mouths. In
| particular because evaluating so many LLM's is a lot of effort
| on its own.
|
| Llama and DeepSeek are no-brainers; the weights are public.
| beastman82 wrote:
| No brainer if you're sitting on a >$100k inference server.
| throwaway314155 wrote:
| Sure, that's fair. If you're aiming for state of the art
| performance. Otherwise, you can get close and do it on
| reasonably priced hardware by using smaller distilled
| and/or quantized variants of llama/r1.
|
| Really though I just meant "it's a no-brainer that they are
| popular here on HN".
| sumedh wrote:
| Google was not serious about LLMs, they could not even figure
| what to call it. There is always a risk that they will get
| bored and just kill the whole thing.
| Workaccount2 wrote:
| Before 2.0 models their offerings were pretty underwhelming,
| but now they can certainly hold their own. I think Gemini will
| ultimately be the LLM that eats the world, Google has the
| talent and most importantly has their own custom hardware
| (hence why their prices are dirt cheap and context is huge).
| roywashere wrote:
| I think it is very ironic that we chose to use PDF in many fields
| to archive data because it is a standard and because we would be
| able to open our pdf documents in 50 or 100 years time. So here
| we are just a couple of years later facing the challenge of
| getting the data out of our stupid PDF documents already!
| esafak wrote:
| It's not ironic. PDFs are a _container_ , which can hold
| scanned documents as well as text. Scanned documents need OCR
| and analyzed for their layout. This is not a failing of the PDF
| format, but a problem inherent to working with print scans.
| scotty79 wrote:
| Pdf is a horrible format. Even if it contains plain text it
| has no concept of something as simple as paragraphs.
| gapeslape wrote:
| In my mind, Gemini 2.0 changes everything because of the
| incredibly long context (2M tokens on some models), while having
| strong reasoning capabilities.
|
| We are working on compliance solution (https://fx-lex.com) and
| RAG just doesn't cut it for our use case. Legislation cannot be
| chunked if you want the model to reason well about it.
|
| It's magical to be able to just throw everything into the model.
| And the best thing is that we automatically benefit from future
| model improvements along all performance axes.
| mateuszbuda wrote:
| There's AWS Bedrock Knowledge Base (Amazon proprietary RAG
| solution) which can digest PDFs and, as far as I tested it on
| real world documents, it works pretty well and is cost effective.
| erulabs wrote:
| Hrm I've been using a combo of Textract (for bounding boxes) AI
| for understanding the contents of the document. Textract is
| excellent at bounding boxes and exact-text capture, but LLMs are
| excellent at understanding when a messy/ugly bit of a form is
| actually one question, or if there are duplicate questions etc.
|
| Correlating the two (Textract <-> AI) output is difficult, but
| another round of AI is usually good at that. Combined with some
| text-different scoring and logic, I can get pretty good full-
| document understanding of questions and answer locations. I've
| spent a pretty absurd amount of time on this and as of yet have
| not launched a product with it, but if anyone is interested I'd
| love to chat about the pipeline!
| pbronez wrote:
| Wonder how this compares to Docling. So far that's been the only
| tool that really unlocked PDFs for me. It's solid but really
| annoying to install.
|
| https://ds4sd.github.io/docling/
___________________________________________________________________
(page generated 2025-02-05 23:00 UTC)