[HN Gopher] Nvidia-Ingest: Multi-modal data extraction
       ___________________________________________________________________
        
       Nvidia-Ingest: Multi-modal data extraction
        
       Author : mihaid150
       Score  : 125 points
       Date   : 2025-01-10 09:17 UTC (13 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | vardump wrote:
       | Sounds pretty useful. What are the system requirements?
       | Prerequisites       Hardware       GPU Family Memory # of GPUs
       | (min.)       H100 SXM or PCIe 80GB 2       A100 SXM or PCIe 80GB
       | 2
       | 
       | Hmm, perhaps this is not for me.
        
         | neuroelectron wrote:
         | Seems pretty ridiculous to me to parse some PDFs. Almost like
         | they made this as bloated as possible to justify buying $5,000+
         | GPUs for an office.
        
           | vardump wrote:
           | I think those GPUs cost between $25-40k each.
        
             | latchkey wrote:
             | Why even buy them at this point... just rent neocloud for
             | $1-2... even at $2/hr, that's over a year of rental for
             | $25k... by then you'd have made your money off the
             | implementation.
        
               | vardump wrote:
               | Not sure whether I'd like to send potentially sensitive
               | documents to a lesser known provider. Or even to a well
               | known.
        
               | latchkey wrote:
               | Even at $3/hour (which is above the current market rate),
               | that's roughly a year.
               | 
               | I genuinely appreciate your perspective, but as a
               | smaller, lesser-known provider, I'd like to understand
               | your concerns better.
               | 
               | Are you worried that I might misuse your data and
               | compromise my entire business, by selling it to the
               | highest bidder? Do you feel uncertain about the security
               | of my systems? Or is it a belief that owning and managing
               | the hardware yourself gives you greater control over
               | security?
               | 
               | What kind of validation or reassurance would help address
               | these concerns?
        
       | greatgib wrote:
       | I have hard time to understand what they mean by "early access
       | micro services"...?
       | 
       | Does it mean that it is yet another wrapper library to call they
       | proprietary cloud api?
       | 
       | Or that when you have the specific access right, you can retrieve
       | a proprietary docker image with secret proprietary binary stuffs
       | inside that will be the server used by the library available in
       | GitHub?
        
         | theossuary wrote:
         | The latter. NIMs is Nvidia's umbrella branding for proprietary
         | containerized AI models, which is being pushed hard by Jensen.
         | They build models and containers, then push them to
         | ngc.nvidia.com. They then provide reference architectures which
         | rely on them. In this case the images are in an invite only
         | org, so to use the helm chart you have to sign up, request
         | access, then use an API key to pull the image.
         | 
         | You can imagine how fun it is to debug.
        
       | joaquincabezas wrote:
       | lol, while checking which OCR is using (PaddleOCR) I found a line
       | with the text: "TODO(Devin)" and was pretty excited thinking they
       | were already using Devin AI...
       | 
       | "Devin Robison" is the author of the package!! Funny, guess it
       | will be similar with the name Alexa
        
       | shutty wrote:
       | Wow, I perhaps need a kubernetes cluster just for a demo:
       | CONTAINER ID   IMAGE
       | 0f2f86615ea5   nvcr.io/ohlfw0olaadg/ea-participants/nv-
       | ingest:24.10              de44122c6ddc   otel/opentelemetry-
       | collector-contrib:0.91.0                       02c9ab8c6901
       | nvcr.io/ohlfw0olaadg/ea-participants/cached:0.2.0
       | d49369334398   nvcr.io/nim/nvidia/nv-embedqa-e5-v5:1.1.0
       | 508715a24998   nvcr.io/ohlfw0olaadg/ea-participants/nv-yolox-
       | structured-images-v1:0.2.0         5b7a174a0a85
       | nvcr.io/ohlfw0olaadg/ea-participants/deplot:1.0.0
       | 430045f98c02   nvcr.io/ohlfw0olaadg/ea-
       | participants/paddleocr:0.2.0
       | 8e587b45821b   grafana/grafana
       | aa2c0ec387e2   redis/redis-stack
       | bda9a2a9c8b5   openzipkin/zipkin
       | ac27e5297d57   prom/prometheus:latest
        
         | threeseed wrote:
         | You can just use k3s/rke2 and run everything on the same node.
        
           | verdverm wrote:
           | You can run vanilla k8s on a single node too
        
         | fsniper wrote:
         | It may be least of your worries considering it requires
         | 2x[A/H]100 80GB Ram.
        
         | mdaniel wrote:
         | Also, they're rolling the dice continuing to use Redis
         | https://github.com/redis/redis/blob/21aee83abdbfe8878d8b870b...
        
           | mirekrusin wrote:
           | You think there is a risk of them pivoting from this project
           | to providing redis as a service?
        
       | ixaxaar wrote:
       | Ah so like NIM is a set of microservices on top of various
       | models, and this is another set of microservices using NIM
       | microservices to do large scale OCR?
       | 
       | and that too integrated with prometheus, 160GB VRAM requirement
       | and so on?
       | 
       | Looks like this is targeted for enterprises or maybe governments
       | etc trying to digitalize at scale.
        
       | jappgar wrote:
       | Nvidia getting in on the lucrative gpt-wrapper market.
        
         | dragonwriter wrote:
         | If it was a GPT wrapper, it wouldn't require an A100/H100 GPU;
         | the container has a model wrapper, sure, but also it has the
         | wrapped, standalone model, as well; its not calling OpenAI's
         | model.
        
       | hammersbald wrote:
       | Is there a OCR toolkit or a ML Model which is able to reliable
       | extract tables from invoices?
        
         | benpacker wrote:
         | All frontier multi modal LLMs can do this - there's likely
         | something lighter weight as well.
         | 
         | In my experience, the latest Gemini is best at vision and OCR
        
           | michaelt wrote:
           | _> All frontier multi modal LLMs can do this_
           | 
           | There's reliable, and there's reliable. For example [1] is a
           | conversation where I ask ChatGPT 4o questions about a seven-
           | page tabular PDF from [2] which contains a list of election
           | polling stations.
           | 
           | The results are simultaneously impressive and unimpressive.
           | The document contains some repeated addresses, and the LLM
           | correctly identifies all 11 of them... then says it found
           | ten.
           | 
           | It gracefully deals with the PDF table, and converts the all-
           | caps input data into Title Case.
           | 
           | The table is split across multiple pages, and the title row
           | repeats each time. It deals with that easily.
           | 
           | It correctly finds all five schools mentioned.
           | 
           | When asked to extract an address that isn't in the document
           | it correctly refuses, instead of hallucinating an answer.
           | 
           | When asked to count churches, "Bunyan Baptist Church" gets
           | missed out. Of two church halls, only one gets counted.
           | 
           | The "Friends Meeting House" also doesn't get counted, but
           | arguably that's not a church even if it is a place of
           | worship.
           | 
           | Longmeadow Evangelical Church has one address, three rows and
           | two polling station numbers. When asked how many polling
           | stations are in the table, the LLM counts that as two. A
           | reasonable person might have expected one, two, three, or a
           | warning. If I was writing an invoice parser, I would want
           | this to be very predictable.
           | 
           | So, it's a mixed bag. I've certainly seen worse attempts at
           | parsing a PDF.
           | 
           | [1] https://chatgpt.com/share/67812ad9-f2bc-8011-96be-
           | faea40e48d... [2]
           | https://www.stevenage.gov.uk/documents/elections/2024-pcc-
           | el...
        
             | philomath_mn wrote:
             | I wonder if performance would improve if you asked it to
             | create csvs from the tables first, then fed the CSVs in to
             | a new chat?
        
             | NeedMoreTime4Me wrote:
             | Do I understand correctly that nearly all issues were
             | related to counting (i.e. numerical operations)? that makes
             | it still impressive because you can do that client-side
             | with the structured data
        
               | michaelt wrote:
               | Some would say the numerical information is among the
               | most important parts of an invoice.
        
             | dragonwriter wrote:
             | > There's reliable, and there's reliable. For example [1]
             | is a conversation where I ask ChatGPT 4o questions about a
             | seven-page tabular PDF from [2] which contains a list of
             | election polling stations.
             | 
             | From your description, it does perfectly at the task asked
             | about upthread (extraction) and has mixed results on other,
             | question-answering, tasks, that weren't the subject.
        
               | michaelt wrote:
               | _> From your description, it does perfectly at the task
               | asked about upthread (extraction) and has mixed results
               | on other, question-answering, tasks, that weren 't the
               | subject._
               | 
               | -\\_(tsu)_/-
               | 
               | Which do you think was which?
        
             | numba888 wrote:
             | You can try to ask it to list all churches and assign them
             | incremental number starting with 1. then print the last
             | number. It's a variation of counting 'r' in 'raspberry'
             | which works better than simple direct question.
        
         | CharlieDigital wrote:
         | By far the best one I've come across is Microsoft Azure
         | Document Intelligence with the Layout Model[0].
         | 
         | It's really, really good at tables.
         | 
         | You have to use the Layout Model and not just the base Document
         | Intelligence.
         | 
         | A bit pricey, but if you're processing content one time and
         | it's high value (my use case as clinical trial protocol
         | documents and the trial will run anywhere from 6-24 months),
         | then it's worth it, IMO.
         | 
         | [0] https://learn.microsoft.com/en-us/azure/ai-
         | services/document...
        
         | ttt3ts wrote:
         | https://github.com/microsoft/table-transformer
         | 
         | This is much lighter weight and more reliable than vllm
        
           | serjester wrote:
           | As someone that spent quite a bit of time with table-
           | transformers, I would definitely not recommend it. It was one
           | of the first libraries we added for parsing tables into our
           | chunking library [1] and the results were very underwhelming.
           | This was a while back and at this point, it's just so much
           | easier to use an LLM end to end for parsing docs (Gemini
           | Flash can parse 20k pages per dollar) and I'm wary of any
           | approach that stitches together different models.
           | 
           | [1] https://github.com/Filimoa/open-parse/
        
         | jonathan-adly wrote:
         | I would like to through our project in the ring. We use
         | ColQwen2 over a ColPali implementation. Basically, search &
         | extract pipeline: https://docs.colivara.com/guide/markdown
        
       | lyime wrote:
       | So who is going to deploy this and turn this into a service/API?
        
       | wiradikusuma wrote:
       | Is this like Nvidia version of MCP?
       | (https://modelcontextprotocol.io/introduction)
        
         | OutOfHere wrote:
         | No relation.
        
       | OutOfHere wrote:
       | This requires Nvidia GPUs to run.
       | 
       | The open question is whether to use rule-based parsing using
       | simpler software or model-based parsing using this software.
        
       | PeterStuer wrote:
       | Before you get too exited, this needs 2 A100 or H100's minimum.
        
         | alecco wrote:
         | GH200 $1.49 / GPU / hr
         | 
         | https://lambdalabs.com/nvidia-gh200
        
       ___________________________________________________________________
       (page generated 2025-01-10 23:01 UTC)