[HN Gopher] Nvidia-Ingest: Multi-modal data extraction
___________________________________________________________________
Nvidia-Ingest: Multi-modal data extraction
Author : mihaid150
Score : 125 points
Date : 2025-01-10 09:17 UTC (13 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| vardump wrote:
| Sounds pretty useful. What are the system requirements?
| Prerequisites Hardware GPU Family Memory # of GPUs
| (min.) H100 SXM or PCIe 80GB 2 A100 SXM or PCIe 80GB
| 2
|
| Hmm, perhaps this is not for me.
| neuroelectron wrote:
| Seems pretty ridiculous to me to parse some PDFs. Almost like
| they made this as bloated as possible to justify buying $5,000+
| GPUs for an office.
| vardump wrote:
| I think those GPUs cost between $25-40k each.
| latchkey wrote:
| Why even buy them at this point... just rent neocloud for
| $1-2... even at $2/hr, that's over a year of rental for
| $25k... by then you'd have made your money off the
| implementation.
| vardump wrote:
| Not sure whether I'd like to send potentially sensitive
| documents to a lesser known provider. Or even to a well
| known.
| latchkey wrote:
| Even at $3/hour (which is above the current market rate),
| that's roughly a year.
|
| I genuinely appreciate your perspective, but as a
| smaller, lesser-known provider, I'd like to understand
| your concerns better.
|
| Are you worried that I might misuse your data and
| compromise my entire business, by selling it to the
| highest bidder? Do you feel uncertain about the security
| of my systems? Or is it a belief that owning and managing
| the hardware yourself gives you greater control over
| security?
|
| What kind of validation or reassurance would help address
| these concerns?
| greatgib wrote:
| I have hard time to understand what they mean by "early access
| micro services"...?
|
| Does it mean that it is yet another wrapper library to call they
| proprietary cloud api?
|
| Or that when you have the specific access right, you can retrieve
| a proprietary docker image with secret proprietary binary stuffs
| inside that will be the server used by the library available in
| GitHub?
| theossuary wrote:
| The latter. NIMs is Nvidia's umbrella branding for proprietary
| containerized AI models, which is being pushed hard by Jensen.
| They build models and containers, then push them to
| ngc.nvidia.com. They then provide reference architectures which
| rely on them. In this case the images are in an invite only
| org, so to use the helm chart you have to sign up, request
| access, then use an API key to pull the image.
|
| You can imagine how fun it is to debug.
| joaquincabezas wrote:
| lol, while checking which OCR is using (PaddleOCR) I found a line
| with the text: "TODO(Devin)" and was pretty excited thinking they
| were already using Devin AI...
|
| "Devin Robison" is the author of the package!! Funny, guess it
| will be similar with the name Alexa
| shutty wrote:
| Wow, I perhaps need a kubernetes cluster just for a demo:
| CONTAINER ID IMAGE
| 0f2f86615ea5 nvcr.io/ohlfw0olaadg/ea-participants/nv-
| ingest:24.10 de44122c6ddc otel/opentelemetry-
| collector-contrib:0.91.0 02c9ab8c6901
| nvcr.io/ohlfw0olaadg/ea-participants/cached:0.2.0
| d49369334398 nvcr.io/nim/nvidia/nv-embedqa-e5-v5:1.1.0
| 508715a24998 nvcr.io/ohlfw0olaadg/ea-participants/nv-yolox-
| structured-images-v1:0.2.0 5b7a174a0a85
| nvcr.io/ohlfw0olaadg/ea-participants/deplot:1.0.0
| 430045f98c02 nvcr.io/ohlfw0olaadg/ea-
| participants/paddleocr:0.2.0
| 8e587b45821b grafana/grafana
| aa2c0ec387e2 redis/redis-stack
| bda9a2a9c8b5 openzipkin/zipkin
| ac27e5297d57 prom/prometheus:latest
| threeseed wrote:
| You can just use k3s/rke2 and run everything on the same node.
| verdverm wrote:
| You can run vanilla k8s on a single node too
| fsniper wrote:
| It may be least of your worries considering it requires
| 2x[A/H]100 80GB Ram.
| mdaniel wrote:
| Also, they're rolling the dice continuing to use Redis
| https://github.com/redis/redis/blob/21aee83abdbfe8878d8b870b...
| mirekrusin wrote:
| You think there is a risk of them pivoting from this project
| to providing redis as a service?
| ixaxaar wrote:
| Ah so like NIM is a set of microservices on top of various
| models, and this is another set of microservices using NIM
| microservices to do large scale OCR?
|
| and that too integrated with prometheus, 160GB VRAM requirement
| and so on?
|
| Looks like this is targeted for enterprises or maybe governments
| etc trying to digitalize at scale.
| jappgar wrote:
| Nvidia getting in on the lucrative gpt-wrapper market.
| dragonwriter wrote:
| If it was a GPT wrapper, it wouldn't require an A100/H100 GPU;
| the container has a model wrapper, sure, but also it has the
| wrapped, standalone model, as well; its not calling OpenAI's
| model.
| hammersbald wrote:
| Is there a OCR toolkit or a ML Model which is able to reliable
| extract tables from invoices?
| benpacker wrote:
| All frontier multi modal LLMs can do this - there's likely
| something lighter weight as well.
|
| In my experience, the latest Gemini is best at vision and OCR
| michaelt wrote:
| _> All frontier multi modal LLMs can do this_
|
| There's reliable, and there's reliable. For example [1] is a
| conversation where I ask ChatGPT 4o questions about a seven-
| page tabular PDF from [2] which contains a list of election
| polling stations.
|
| The results are simultaneously impressive and unimpressive.
| The document contains some repeated addresses, and the LLM
| correctly identifies all 11 of them... then says it found
| ten.
|
| It gracefully deals with the PDF table, and converts the all-
| caps input data into Title Case.
|
| The table is split across multiple pages, and the title row
| repeats each time. It deals with that easily.
|
| It correctly finds all five schools mentioned.
|
| When asked to extract an address that isn't in the document
| it correctly refuses, instead of hallucinating an answer.
|
| When asked to count churches, "Bunyan Baptist Church" gets
| missed out. Of two church halls, only one gets counted.
|
| The "Friends Meeting House" also doesn't get counted, but
| arguably that's not a church even if it is a place of
| worship.
|
| Longmeadow Evangelical Church has one address, three rows and
| two polling station numbers. When asked how many polling
| stations are in the table, the LLM counts that as two. A
| reasonable person might have expected one, two, three, or a
| warning. If I was writing an invoice parser, I would want
| this to be very predictable.
|
| So, it's a mixed bag. I've certainly seen worse attempts at
| parsing a PDF.
|
| [1] https://chatgpt.com/share/67812ad9-f2bc-8011-96be-
| faea40e48d... [2]
| https://www.stevenage.gov.uk/documents/elections/2024-pcc-
| el...
| philomath_mn wrote:
| I wonder if performance would improve if you asked it to
| create csvs from the tables first, then fed the CSVs in to
| a new chat?
| NeedMoreTime4Me wrote:
| Do I understand correctly that nearly all issues were
| related to counting (i.e. numerical operations)? that makes
| it still impressive because you can do that client-side
| with the structured data
| michaelt wrote:
| Some would say the numerical information is among the
| most important parts of an invoice.
| dragonwriter wrote:
| > There's reliable, and there's reliable. For example [1]
| is a conversation where I ask ChatGPT 4o questions about a
| seven-page tabular PDF from [2] which contains a list of
| election polling stations.
|
| From your description, it does perfectly at the task asked
| about upthread (extraction) and has mixed results on other,
| question-answering, tasks, that weren't the subject.
| michaelt wrote:
| _> From your description, it does perfectly at the task
| asked about upthread (extraction) and has mixed results
| on other, question-answering, tasks, that weren 't the
| subject._
|
| -\\_(tsu)_/-
|
| Which do you think was which?
| numba888 wrote:
| You can try to ask it to list all churches and assign them
| incremental number starting with 1. then print the last
| number. It's a variation of counting 'r' in 'raspberry'
| which works better than simple direct question.
| CharlieDigital wrote:
| By far the best one I've come across is Microsoft Azure
| Document Intelligence with the Layout Model[0].
|
| It's really, really good at tables.
|
| You have to use the Layout Model and not just the base Document
| Intelligence.
|
| A bit pricey, but if you're processing content one time and
| it's high value (my use case as clinical trial protocol
| documents and the trial will run anywhere from 6-24 months),
| then it's worth it, IMO.
|
| [0] https://learn.microsoft.com/en-us/azure/ai-
| services/document...
| ttt3ts wrote:
| https://github.com/microsoft/table-transformer
|
| This is much lighter weight and more reliable than vllm
| serjester wrote:
| As someone that spent quite a bit of time with table-
| transformers, I would definitely not recommend it. It was one
| of the first libraries we added for parsing tables into our
| chunking library [1] and the results were very underwhelming.
| This was a while back and at this point, it's just so much
| easier to use an LLM end to end for parsing docs (Gemini
| Flash can parse 20k pages per dollar) and I'm wary of any
| approach that stitches together different models.
|
| [1] https://github.com/Filimoa/open-parse/
| jonathan-adly wrote:
| I would like to through our project in the ring. We use
| ColQwen2 over a ColPali implementation. Basically, search &
| extract pipeline: https://docs.colivara.com/guide/markdown
| lyime wrote:
| So who is going to deploy this and turn this into a service/API?
| wiradikusuma wrote:
| Is this like Nvidia version of MCP?
| (https://modelcontextprotocol.io/introduction)
| OutOfHere wrote:
| No relation.
| OutOfHere wrote:
| This requires Nvidia GPUs to run.
|
| The open question is whether to use rule-based parsing using
| simpler software or model-based parsing using this software.
| PeterStuer wrote:
| Before you get too exited, this needs 2 A100 or H100's minimum.
| alecco wrote:
| GH200 $1.49 / GPU / hr
|
| https://lambdalabs.com/nvidia-gh200
___________________________________________________________________
(page generated 2025-01-10 23:01 UTC)