https://github.com/DocumindHQ/documind Skip to content Navigation Menu Toggle navigation Sign in * Product + GitHub Copilot Write better code with AI + Security Find and fix vulnerabilities + Actions Automate any workflow + Codespaces Instant dev environments + Issues Plan and track work + Code Review Manage code changes + Discussions Collaborate outside of code + Code Search Find more, search less Explore + All features + Documentation + GitHub Skills + Blog * Solutions By company size + Enterprises + Small and medium teams + Startups By use case + DevSecOps + DevOps + CI/CD + View all use cases By industry + Healthcare + Financial services + Manufacturing + Government + View all industries View all solutions * Resources Topics + AI + DevOps + Security + Software Development + View all Explore + Learning Pathways + White papers, Ebooks, Webinars + Customer Stories + Partners * Open Source + GitHub Sponsors Fund open source developers + The ReadME Project GitHub community articles Repositories + Topics + Trending + Collections * Enterprise + Enterprise platform AI-powered developer platform Available add-ons + Advanced Security Enterprise-grade security features + GitHub Copilot Enterprise-grade AI features + Premium Support Enterprise-grade 24/7 support * Pricing Search or jump to... Search code, repositories, users, issues, pull requests... Search [ ] Clear Search syntax tips Provide feedback We read every piece of feedback, and take your input very seriously. [ ] [ ] Include my email address so I can be contacted Cancel Submit feedback Saved searches Use saved searches to filter your results more quickly Name [ ] Query [ ] To see all available qualifiers, see our documentation. Cancel Create saved search Sign in Sign up Reseting focus You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session. Dismiss alert {{ message }} DocumindHQ / documind Public * Notifications You must be signed in to change notification settings * Fork 4 * Star 355 Open-source platform for extracting structured data from documents using AI. documind.xyz License View license 355 stars 4 forks Branches Tags Activity Star Notifications You must be signed in to change notification settings * Code * Issues 2 * Pull requests 0 * Actions * Projects 0 * Security * Insights Additional navigation options * Code * Issues * Pull requests * Actions * Projects * Security * Insights DocumindHQ/documind main BranchesTags [ ] Go to file Code Folders and files Name Name Last commit Last commit message date Latest commit History 8 Commits core core extractor extractor .gitignore .gitignore .npmignore .npmignore LICENSE LICENSE README.md README.md package-lock.json package-lock.json package.json package.json View all files Repository files navigation * README * License Documind Documind is an advanced document processing tool that leverages AI to extract structured data from PDFs. It is built to handle PDF conversions, extract relevant information, and format results as specified by customizable schemas. Features * Converts PDFs to images for detailed AI processing. * Uses OpenAI's API to extract and structure information. * Allows users to specify extraction schemas for various document formats. * Designed for flexible deployment on local or cloud environments. Try the Hosted Version A demo of the documind hosted version will be available soon for you to try out! The hosted version provides a seamless experience with fully managed APIs, so you can skip the setup and start extracting data right away. For full access to the hosted service, please request access and we'll get you set up. Requirements Before using documind, ensure the following software dependencies are installed: System Dependencies * Ghostscript: documind relies on Ghostscript for handling certain PDF operations. * GraphicsMagick: Required for image processing within document conversions. Install both on your system before proceeding: # On macOS brew install ghostscript graphicsmagick # On Debian/Ubuntu sudo apt-get update sudo apt-get install -y ghostscript graphicsmagick Node.js & NPM Ensure Node.js (v18+) and NPM are installed on your system. Installation You can install documind via npm: npm install documind Environment Setup documind requires an .env file to store sensitive information like API keys and Supabase configurations. Create an .env file in your project directory and add the following: OPENAI_API_KEY=your_openai_api_key SUPABASE_URL=your_supabase_url SUPABASE_KEY=your_supabase_key SUPABASE_BUCKET=your_supabase_bucket_name Usage Basic Example First, import documind and define your schema. The schema outline what information documind should look for in each document. Here's a quick setup to get started. 1. Define a Schema The schema is an array of objects where each object defines: * name: Field name to extract. * type: Data type (e.g., "string", "number", "array", "object"). * description: Description of the field. * children (optional): For arrays and objects, define nested fields. Example schema for a bank statement: const schema = [ { name: "accountNumber", type: "string", description: "The account number of the bank statement." }, { name: "openingBalance", type: "number", description: "The opening balance of the account." }, { name: "transactions", type: "array", description: "List of transactions in the account.", children: [ { name: "date", type: "string", description: "Transaction date." }, { name: "creditAmount", type: "number", description: "Credit Amount of the transaction." }, { name: "debitAmount", type: "number", description: "Debit Amount of the transaction." }, { name: "description", type: "string", description: "Transaction description." } ] }, { name: "closingBalance", type: "number", description: "The closing balance of the account." } ]; 2. Run documind Use documind to process a PDF by passing the file URL and the schema. import { extract } from 'documind'; const runExtraction = async () => { const result = await extract({ file: 'https://bank_statement.pdf', schema }); console.log("Extracted Data:", result); }; runExtraction(); Example Output Here's an example of what the extracted result might look like: { "success": true, "pages": 1, "data": { "accountNumber": "100002345", "openingBalance": 3200, "transactions": [ { "date": "2021-05-12", "creditAmount": null, "debitAmount": 100, "description": "transfer to Tom" }, { "date": "2021-05-12", "creditAmount": 50, "debitAmount": null, "description": "For lunch the other day" }, { "date": "2021-05-13", "creditAmount": 20, "debitAmount": null, "description": "Refund for voucher" }, { "date": "2021-05-13", "creditAmount": null, "debitAmount": 750, "description": "May's rent" } ], "closingBalance": 2420 }, "fileName": "bank_statement.pdf", } Contributing Contributions are welcome! Please submit a pull request with any improvements or features. License This project is licensed under the AGPL v3.0 License. --------------------------------------------------------------------- About Open-source platform for extracting structured data from documents using AI. documind.xyz Topics open-source ai pdf-extractor document-processing document-extraction llms Resources Readme License View license Activity Custom properties Stars 355 stars Watchers 4 watching Forks 4 forks Report repository Releases 1 v1.0.7 Latest Nov 17, 2024 Languages * TypeScript 62.4% * JavaScript 37.6% Footer (c) 2024 GitHub, Inc. Footer navigation * Terms * Privacy * Security * Status * Docs * Contact * Manage cookies * Do not share my personal information You can't perform that action at this time.