[HN Gopher] Show HN: Documind - Open-source AI tool to turn docu...
___________________________________________________________________
Show HN: Documind - Open-source AI tool to turn documents into
structured data
Documind is an open-source tool that turns documents into
structured data using AI. What it does: - Extracts specific data
from PDFs based on your custom schema - Returns clean, structured
JSON that's ready to use - Works with just a PDF link + your schema
definition Just run npm install documind to get started.
Author : Tammilore
Score : 139 points
Date : 2024-11-18 10:51 UTC (12 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| bob778 wrote:
| From just reading the README, the example is not valid JSON. Is
| that intentional?
|
| Otherwise it seems like a prompt building tool, or am I missing
| something here?
| assanineass wrote:
| Oof you're right LOL
| Tammilore wrote:
| Thanks for pointing this out. This was an error on my part.
|
| I see someone opened an issue for it so will fix now.
| rkuodys wrote:
| Just this weekend was solving similar problem.
|
| What I've noticed, that on scanned documents, where stamp-text
| and handwriting is just as important as printed text, Gemini was
| way better compared to chat gpt.
|
| Of course, my prompts might have been an issue, but gemini with
| very brief and generic queries made significantly better results.
| inexcf wrote:
| Got excited about an open-source tool doing this.
|
| Alas, i am let down. It is an open-source tool creating the
| prompt for the OpenAI API and i can't go and send customer data
| to them.
|
| I'm aware of https://github.com/clovaai/donut so i hoped this
| would be more like that.
| _joel wrote:
| You can self host OpenAPI compatible models with lmstudio and
| the like. I've used it with https://anythingllm.com/
| turblety wrote:
| You might be able to use Ollama, which has a OpenAI compatible
| API.
| Zambyte wrote:
| Not without chaning the code (should be easy though)
|
| https://github.com/DocumindHQ/documind/blob/d91121739df03867.
| ..
| Tammilore wrote:
| Hi. I totally get the concern about sending data to OpenAI.
| Right now, Documind uses OpenAI's API just so people could
| quickly get started and see what it is like, but I'm open to
| adding options and contributions that would be better for
| privacy.
| danbruc wrote:
| With such a system, how do you ensure that the extracted data
| matches the data in the source document? Run the process several
| times and check that the results are identical? Can it reject
| inputs for manual processing? Or is it intended to be always
| checked manually? How good is it, how many errors does it make,
| say per million extracted values?
| glorpsicle wrote:
| Perhaps there's still value in the documents being transformed
| by this tool and someone reviewing them manually, but obviously
| the real value would be in reducing manual review. I don't
| think there's a world-for now-in which this manual review can
| be completely eliminated.
|
| However, if you process, say, 1 million documents, you could
| sample and review a small percentage of them manually (a power
| calculation would help here). Assuming your random sample
| models the "distribution" (which may be tough to
| define/summarize) of the 1 million documents, you could then
| extrapolate your accuracy onto the larger set of documents
| without having to review each and every one.
| danbruc wrote:
| You can sample the result to determine the error rate, but if
| you find an unacceptable level of errors, then you still have
| to review everything manually. On the other hand, if you use
| traditional techniques, pattern matching with regular
| expressions and things like that, then you can probably get
| pretty close to perfection for those cases where your
| patterns match and you can just reject the rest for manual
| processing. Maybe you could ask a language model to compare
| the source document and the extracted data and to indicate
| whether there are errors, but I am not sure if that would
| help, maybe what tripped up the extraction would also trip up
| the result evaluation.
| khaki54 wrote:
| Not sure I would want something non-deterministic in my data
| pipeline. Maybe if it used GenAI to _develop a ruleset_ that
| could then be deployed, it would be more practical.
| fredtalty5 wrote:
| Documind: Open-Source AI for Document Data Extraction
|
| If you're dealing with unstructured data trapped in PDFs,
| Documind might be the tool you've been waiting for. It's an open-
| source solution that simplifies the process of turning documents
| into clean, structured JSON data with the power of AI.
|
| Key Features: 1. Customizable Data Extraction Define your own
| schema to extract exactly the information you need from PDFs--no
| unnecessary clutter.
|
| 2. Simple Input, Clean Output Just provide a PDF link and your
| schema definition, and it returns structured JSON data, ready to
| integrate into your workflows.
|
| 3. Developer-Friendly With a simple setup (`npm install
| documind`), you can get started right away and start automating
| tedious document processing tasks.
|
| Whether you're automating invoice processing, handling contracts,
| or working with any document-heavy workflows, Documind offers a
| lightweight, accessible solution. And since it's open-source, you
| can customize it further to suit your specific needs.
|
| Would love to hear if others in the community have tried it--how
| does it stack up for your use cases?
| avereveard wrote:
| > an interesting open source project
|
| enthusiastically setting up a lounge chair
|
| > OPENAI_API_KEY=your_openai_api_key
|
| carrying it back apathetically
| Tammilore wrote:
| Thanks for the laugh and your feedback! I know that depending
| on an OpenAI isn't ideal for everyone. I'm considering ways to
| make it more self-contained in the future, so it's great to
| hear what users are looking for.
| avereveard wrote:
| litellm would be a start, then you just pass in a model
| string that includes the provider, and can default on openai
| gpts, that removes most of the effort in adapting stuff both
| from you and other users.
| gibsonf1 wrote:
| I'm not sure having statistics with fabrication try to extract
| text from PDF's would result in any mission-critical reliable
| data?
| eichi wrote:
| const systemPrompt = ` Convert the following PDF page to
| markdown. Return only the markdown with no explanation
| text. Do not include deliminators like '''markdown. You
| must include all information on the page. Do not exclude headers,
| footers, or subtext. `;
| thor-rodrigues wrote:
| Very nice tool! Just last week, I was working on extracting
| information from PDFs for an automation flow I'm building. I used
| Unstructured (https://unstructured.io/), which supports multiple
| file types, not just PDFs.
|
| However, my main issue is that I need to work with confidential
| client data that cannot be uploaded to a third party. Setting up
| the open-source, locally hosted version of Unstructured was quite
| cumbersome due to the numerous additional packages and
| installation steps required.
|
| While I'm open to the idea of parsing content with an LLM that
| has vision capabilities, data safety and confidentiality are
| critical for many applications. I think your project would go
| from good to great if it would be possible to connect to Ollama
| and run locally,
|
| That said, this is an excellent application! I can definitely see
| myself using it in other projects that don't demand such
| stringent data confidentiality."
| Tammilore wrote:
| Thank you, I appreciate the feedback! I understand people
| wanting data confidentiality and I'm considering connecting
| Ollama for future updates!
| ajith-joseph wrote:
| This looks like a promising tool for working with unstructured
| documents! A few questions come to mind:
|
| 1) Data Accuracy: How do you ensure the extracted data aligns
| perfectly with the source? Are there specific safeguards or
| confidence scoring mechanisms in place to flag potentially
| inaccurate extractions, or is this left entirely to manual
| review?
|
| 2) Customization and Flexibility: Many real-world scenarios
| involve highly specific schemas or even multi-step extraction
| workflows. Does Documind allow for layered or conditional parsing
| where fields depend on the values of others?
|
| 3) Local Hosting for Confidential Data: Data confidentiality is a
| big concern for many businesses (e.g., legal or financial
| industries). While it's great that Documind is open source, do
| you have any built-in provisions or guides for secure local
| hosting, especially in resource-constrained environments?
|
| Looking forward to seeing how this evolves--seems like a tool
| with great potential for streamlining document processing!
| asjfkdlf wrote:
| I am looking for a similar service that turns any document (PNG,
| PDf, DocX) into JSON (preserving the field relationships). I
| tried with ChatGPT, but hallucinations are common. Does anything
| exist?
| omk wrote:
| This is also using OpenAI's GPT model. So the same
| hallucinations are probable here for PDFs.
| cccybernetic wrote:
| I built a drag-and-drop document converter that extracts text
| into custom columns (for CSV) or keys (for JSON). You can
| schedule it to run at certain times and update a database as
| well.
|
| I haven't had issues with hallucinations. If you're interested,
| my email is in my bio.
| hirezeeshan wrote:
| That's a valid problem you are solving. I had similar usecase
| that I solved using PDF[dot]co
| azinman2 wrote:
| Looking at the source it seems this is just a thin wrapper over
| OpenAI. Am I missing something?
| emmanueloga_ wrote:
| From the source, Documind appears to:
|
| 1) Install tools like Ghostscript, GraphicsMagick, and
| LibreOffice with a JS script. 2) Convert document pages to Base64
| PNGs and send them to OpenAI for data extraction. 3) Use Supabase
| for unclear reasons.
|
| Some issues with this approach:
|
| * OpenAI may retain and use your data for training, raising
| privacy concerns [1].
|
| * Dependencies should be managed with Docker or package managers
| like Nix or Pixi, which are more robust. Example: a tool like
| Parsr [2] provides a Dockerized pdf-to-json solution, complete
| with OCR support and an HTTP api.
|
| * GPT-4 vision seems like a costly, error-prone, and unreliable
| solution, not really suited for extracting data from sensitive
| docs like invoices, without review.
|
| * Traditional methods (PDF parsers with OCR support) are cheaper,
| more reliable, and avoid retention risks for this particular use
| case. Although these tools do require some plumbing... probably
| LLMs can really help with that!
|
| While there are plenty of tools for structured data extraction, I
| think there's still room for a streamlined, all-in-one solution.
| This gap likely explains the abundance of closed-source
| commercial options tackling this very challenge.
|
| ---
|
| 1: https://platform.openai.com/docs/models#how-we-use-your-data
|
| 2: https://github.com/axa-group/Parsr
| groby_b wrote:
| That's not what [1] says, though? Quoth: "As of March 1, 2023,
| data sent to the OpenAI API will not be used to train or
| improve OpenAI models (unless you explicitly opt-in to share
| data with us, such as by providing feedback in the Playground).
| "
|
| "Traditional methods (PDF parsers with OCR support) are
| cheaper, more reliable"
|
| Not sure on the reliability - the ones I'm using all fail at
| structured data. You want a table extracted from a PDF, LLMs
| are your friend. (Recommendations welcome)
| niklasd wrote:
| We found that for extracting tables, OpenAIs LLMs aren't
| great. What is working well for us is Docling
| (https://github.com/DS4SD/docling/)
| brianjking wrote:
| OpenAI isn't retaining your details sent via the API for
| training details. Stopp.
| infecto wrote:
| Multimodal LLM are not the way to do this for a business workflow
| yet.
|
| In my experience your much better of starting with a Azure Doc
| Intelligence or AWS Textract to first get the structure of the
| document (PDF). These tools are incredibly robust and do a great
| job with most of the common cases you can throw at it. From there
| you can use an LLM to interrogate and structure the data to your
| hearts delight.
| IndieCoder wrote:
| Plus one, using the exact setup to make it scale. If Azure Doc
| Intelligence gets too expensive, VLMs also work great
| vinothgopi wrote:
| What is a VLM?
| saharhash wrote:
| Vision Language Model like Qwen VL
| https://github.com/QwenLM/Qwen2-VL or CoPali
| https://huggingface.co/blog/manu/colpali
| disgruntledphd2 wrote:
| > AWS Textract to first get the structure of the document
| (PDF). These tools are incredibly robust and do a great job
| with most of the common cases you can throw at it.
|
| Do they work for Bills of Lading yet? When I tested a sample of
| these bills a few years back (2022 I think), the results were
| not good at all. But I honestly wouldn't be surprised if they'd
| massively improved lately.
| constantinum wrote:
| Reading from the comments, some of the common questions regarding
| document extraction are:
|
| * Run locally or on premise for security/privacy reasons
|
| * Support multiple LLMs and vector DBs - plug and play
|
| * Support customisable schemas
|
| * Method to check/confirm accuracy with source
|
| * Cron jobs for automation
|
| There is Unstract that solves the above requirements.
|
| https://github.com/Zipstack/unstract
| vr46 wrote:
| I'll have to test this against my local Python pipeline which
| does all this without an LLM in attendance. There are a ton of
| existing Python libraries which have been doing this for a long
| time, so let's take a look..
| thegabriele wrote:
| Care to share the best ones for some use cases? Thanks
| vr46 wrote:
| MinerU
|
| PDFQuery
|
| PyMuPDF (having more success with older versions, right now)
___________________________________________________________________
(page generated 2024-11-18 23:01 UTC)