[HN Gopher] Show HN: Documind - Open-source AI tool to turn docu...
       ___________________________________________________________________
        
       Show HN: Documind - Open-source AI tool to turn documents into
       structured data
        
       Documind is an open-source tool that turns documents into
       structured data using AI.  What it does:  - Extracts specific data
       from PDFs based on your custom schema - Returns clean, structured
       JSON that's ready to use - Works with just a PDF link + your schema
       definition  Just run npm install documind to get started.
        
       Author : Tammilore
       Score  : 139 points
       Date   : 2024-11-18 10:51 UTC (12 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | bob778 wrote:
       | From just reading the README, the example is not valid JSON. Is
       | that intentional?
       | 
       | Otherwise it seems like a prompt building tool, or am I missing
       | something here?
        
         | assanineass wrote:
         | Oof you're right LOL
        
         | Tammilore wrote:
         | Thanks for pointing this out. This was an error on my part.
         | 
         | I see someone opened an issue for it so will fix now.
        
       | rkuodys wrote:
       | Just this weekend was solving similar problem.
       | 
       | What I've noticed, that on scanned documents, where stamp-text
       | and handwriting is just as important as printed text, Gemini was
       | way better compared to chat gpt.
       | 
       | Of course, my prompts might have been an issue, but gemini with
       | very brief and generic queries made significantly better results.
        
       | inexcf wrote:
       | Got excited about an open-source tool doing this.
       | 
       | Alas, i am let down. It is an open-source tool creating the
       | prompt for the OpenAI API and i can't go and send customer data
       | to them.
       | 
       | I'm aware of https://github.com/clovaai/donut so i hoped this
       | would be more like that.
        
         | _joel wrote:
         | You can self host OpenAPI compatible models with lmstudio and
         | the like. I've used it with https://anythingllm.com/
        
         | turblety wrote:
         | You might be able to use Ollama, which has a OpenAI compatible
         | API.
        
           | Zambyte wrote:
           | Not without chaning the code (should be easy though)
           | 
           | https://github.com/DocumindHQ/documind/blob/d91121739df03867.
           | ..
        
         | Tammilore wrote:
         | Hi. I totally get the concern about sending data to OpenAI.
         | Right now, Documind uses OpenAI's API just so people could
         | quickly get started and see what it is like, but I'm open to
         | adding options and contributions that would be better for
         | privacy.
        
       | danbruc wrote:
       | With such a system, how do you ensure that the extracted data
       | matches the data in the source document? Run the process several
       | times and check that the results are identical? Can it reject
       | inputs for manual processing? Or is it intended to be always
       | checked manually? How good is it, how many errors does it make,
       | say per million extracted values?
        
         | glorpsicle wrote:
         | Perhaps there's still value in the documents being transformed
         | by this tool and someone reviewing them manually, but obviously
         | the real value would be in reducing manual review. I don't
         | think there's a world-for now-in which this manual review can
         | be completely eliminated.
         | 
         | However, if you process, say, 1 million documents, you could
         | sample and review a small percentage of them manually (a power
         | calculation would help here). Assuming your random sample
         | models the "distribution" (which may be tough to
         | define/summarize) of the 1 million documents, you could then
         | extrapolate your accuracy onto the larger set of documents
         | without having to review each and every one.
        
           | danbruc wrote:
           | You can sample the result to determine the error rate, but if
           | you find an unacceptable level of errors, then you still have
           | to review everything manually. On the other hand, if you use
           | traditional techniques, pattern matching with regular
           | expressions and things like that, then you can probably get
           | pretty close to perfection for those cases where your
           | patterns match and you can just reject the rest for manual
           | processing. Maybe you could ask a language model to compare
           | the source document and the extracted data and to indicate
           | whether there are errors, but I am not sure if that would
           | help, maybe what tripped up the extraction would also trip up
           | the result evaluation.
        
       | khaki54 wrote:
       | Not sure I would want something non-deterministic in my data
       | pipeline. Maybe if it used GenAI to _develop a ruleset_ that
       | could then be deployed, it would be more practical.
        
       | fredtalty5 wrote:
       | Documind: Open-Source AI for Document Data Extraction
       | 
       | If you're dealing with unstructured data trapped in PDFs,
       | Documind might be the tool you've been waiting for. It's an open-
       | source solution that simplifies the process of turning documents
       | into clean, structured JSON data with the power of AI.
       | 
       | Key Features: 1. Customizable Data Extraction Define your own
       | schema to extract exactly the information you need from PDFs--no
       | unnecessary clutter.
       | 
       | 2. Simple Input, Clean Output Just provide a PDF link and your
       | schema definition, and it returns structured JSON data, ready to
       | integrate into your workflows.
       | 
       | 3. Developer-Friendly With a simple setup (`npm install
       | documind`), you can get started right away and start automating
       | tedious document processing tasks.
       | 
       | Whether you're automating invoice processing, handling contracts,
       | or working with any document-heavy workflows, Documind offers a
       | lightweight, accessible solution. And since it's open-source, you
       | can customize it further to suit your specific needs.
       | 
       | Would love to hear if others in the community have tried it--how
       | does it stack up for your use cases?
        
       | avereveard wrote:
       | > an interesting open source project
       | 
       | enthusiastically setting up a lounge chair
       | 
       | > OPENAI_API_KEY=your_openai_api_key
       | 
       | carrying it back apathetically
        
         | Tammilore wrote:
         | Thanks for the laugh and your feedback! I know that depending
         | on an OpenAI isn't ideal for everyone. I'm considering ways to
         | make it more self-contained in the future, so it's great to
         | hear what users are looking for.
        
           | avereveard wrote:
           | litellm would be a start, then you just pass in a model
           | string that includes the provider, and can default on openai
           | gpts, that removes most of the effort in adapting stuff both
           | from you and other users.
        
       | gibsonf1 wrote:
       | I'm not sure having statistics with fabrication try to extract
       | text from PDF's would result in any mission-critical reliable
       | data?
        
       | eichi wrote:
       | const systemPrompt = `         Convert the following PDF page to
       | markdown.         Return only the markdown with no explanation
       | text. Do not include deliminators like '''markdown.         You
       | must include all information on the page. Do not exclude headers,
       | footers, or subtext.       `;
        
       | thor-rodrigues wrote:
       | Very nice tool! Just last week, I was working on extracting
       | information from PDFs for an automation flow I'm building. I used
       | Unstructured (https://unstructured.io/), which supports multiple
       | file types, not just PDFs.
       | 
       | However, my main issue is that I need to work with confidential
       | client data that cannot be uploaded to a third party. Setting up
       | the open-source, locally hosted version of Unstructured was quite
       | cumbersome due to the numerous additional packages and
       | installation steps required.
       | 
       | While I'm open to the idea of parsing content with an LLM that
       | has vision capabilities, data safety and confidentiality are
       | critical for many applications. I think your project would go
       | from good to great if it would be possible to connect to Ollama
       | and run locally,
       | 
       | That said, this is an excellent application! I can definitely see
       | myself using it in other projects that don't demand such
       | stringent data confidentiality."
        
         | Tammilore wrote:
         | Thank you, I appreciate the feedback! I understand people
         | wanting data confidentiality and I'm considering connecting
         | Ollama for future updates!
        
       | ajith-joseph wrote:
       | This looks like a promising tool for working with unstructured
       | documents! A few questions come to mind:
       | 
       | 1) Data Accuracy: How do you ensure the extracted data aligns
       | perfectly with the source? Are there specific safeguards or
       | confidence scoring mechanisms in place to flag potentially
       | inaccurate extractions, or is this left entirely to manual
       | review?
       | 
       | 2) Customization and Flexibility: Many real-world scenarios
       | involve highly specific schemas or even multi-step extraction
       | workflows. Does Documind allow for layered or conditional parsing
       | where fields depend on the values of others?
       | 
       | 3) Local Hosting for Confidential Data: Data confidentiality is a
       | big concern for many businesses (e.g., legal or financial
       | industries). While it's great that Documind is open source, do
       | you have any built-in provisions or guides for secure local
       | hosting, especially in resource-constrained environments?
       | 
       | Looking forward to seeing how this evolves--seems like a tool
       | with great potential for streamlining document processing!
        
       | asjfkdlf wrote:
       | I am looking for a similar service that turns any document (PNG,
       | PDf, DocX) into JSON (preserving the field relationships). I
       | tried with ChatGPT, but hallucinations are common. Does anything
       | exist?
        
         | omk wrote:
         | This is also using OpenAI's GPT model. So the same
         | hallucinations are probable here for PDFs.
        
         | cccybernetic wrote:
         | I built a drag-and-drop document converter that extracts text
         | into custom columns (for CSV) or keys (for JSON). You can
         | schedule it to run at certain times and update a database as
         | well.
         | 
         | I haven't had issues with hallucinations. If you're interested,
         | my email is in my bio.
        
       | hirezeeshan wrote:
       | That's a valid problem you are solving. I had similar usecase
       | that I solved using PDF[dot]co
        
       | azinman2 wrote:
       | Looking at the source it seems this is just a thin wrapper over
       | OpenAI. Am I missing something?
        
       | emmanueloga_ wrote:
       | From the source, Documind appears to:
       | 
       | 1) Install tools like Ghostscript, GraphicsMagick, and
       | LibreOffice with a JS script. 2) Convert document pages to Base64
       | PNGs and send them to OpenAI for data extraction. 3) Use Supabase
       | for unclear reasons.
       | 
       | Some issues with this approach:
       | 
       | * OpenAI may retain and use your data for training, raising
       | privacy concerns [1].
       | 
       | * Dependencies should be managed with Docker or package managers
       | like Nix or Pixi, which are more robust. Example: a tool like
       | Parsr [2] provides a Dockerized pdf-to-json solution, complete
       | with OCR support and an HTTP api.
       | 
       | * GPT-4 vision seems like a costly, error-prone, and unreliable
       | solution, not really suited for extracting data from sensitive
       | docs like invoices, without review.
       | 
       | * Traditional methods (PDF parsers with OCR support) are cheaper,
       | more reliable, and avoid retention risks for this particular use
       | case. Although these tools do require some plumbing... probably
       | LLMs can really help with that!
       | 
       | While there are plenty of tools for structured data extraction, I
       | think there's still room for a streamlined, all-in-one solution.
       | This gap likely explains the abundance of closed-source
       | commercial options tackling this very challenge.
       | 
       | ---
       | 
       | 1: https://platform.openai.com/docs/models#how-we-use-your-data
       | 
       | 2: https://github.com/axa-group/Parsr
        
         | groby_b wrote:
         | That's not what [1] says, though? Quoth: "As of March 1, 2023,
         | data sent to the OpenAI API will not be used to train or
         | improve OpenAI models (unless you explicitly opt-in to share
         | data with us, such as by providing feedback in the Playground).
         | "
         | 
         | "Traditional methods (PDF parsers with OCR support) are
         | cheaper, more reliable"
         | 
         | Not sure on the reliability - the ones I'm using all fail at
         | structured data. You want a table extracted from a PDF, LLMs
         | are your friend. (Recommendations welcome)
        
           | niklasd wrote:
           | We found that for extracting tables, OpenAIs LLMs aren't
           | great. What is working well for us is Docling
           | (https://github.com/DS4SD/docling/)
        
         | brianjking wrote:
         | OpenAI isn't retaining your details sent via the API for
         | training details. Stopp.
        
       | infecto wrote:
       | Multimodal LLM are not the way to do this for a business workflow
       | yet.
       | 
       | In my experience your much better of starting with a Azure Doc
       | Intelligence or AWS Textract to first get the structure of the
       | document (PDF). These tools are incredibly robust and do a great
       | job with most of the common cases you can throw at it. From there
       | you can use an LLM to interrogate and structure the data to your
       | hearts delight.
        
         | IndieCoder wrote:
         | Plus one, using the exact setup to make it scale. If Azure Doc
         | Intelligence gets too expensive, VLMs also work great
        
           | vinothgopi wrote:
           | What is a VLM?
        
             | saharhash wrote:
             | Vision Language Model like Qwen VL
             | https://github.com/QwenLM/Qwen2-VL or CoPali
             | https://huggingface.co/blog/manu/colpali
        
         | disgruntledphd2 wrote:
         | > AWS Textract to first get the structure of the document
         | (PDF). These tools are incredibly robust and do a great job
         | with most of the common cases you can throw at it.
         | 
         | Do they work for Bills of Lading yet? When I tested a sample of
         | these bills a few years back (2022 I think), the results were
         | not good at all. But I honestly wouldn't be surprised if they'd
         | massively improved lately.
        
       | constantinum wrote:
       | Reading from the comments, some of the common questions regarding
       | document extraction are:
       | 
       | * Run locally or on premise for security/privacy reasons
       | 
       | * Support multiple LLMs and vector DBs - plug and play
       | 
       | * Support customisable schemas
       | 
       | * Method to check/confirm accuracy with source
       | 
       | * Cron jobs for automation
       | 
       | There is Unstract that solves the above requirements.
       | 
       | https://github.com/Zipstack/unstract
        
       | vr46 wrote:
       | I'll have to test this against my local Python pipeline which
       | does all this without an LLM in attendance. There are a ton of
       | existing Python libraries which have been doing this for a long
       | time, so let's take a look..
        
         | thegabriele wrote:
         | Care to share the best ones for some use cases? Thanks
        
           | vr46 wrote:
           | MinerU
           | 
           | PDFQuery
           | 
           | PyMuPDF (having more success with older versions, right now)
        
       ___________________________________________________________________
       (page generated 2024-11-18 23:01 UTC)