https://github.com/raghavan/PdfGptIndexer Skip to content Toggle navigation Sign up * Product + Actions Automate any workflow + Packages Host and manage packages + Security Find and fix vulnerabilities + Codespaces Instant dev environments + Copilot Write better code with AI + Code review Manage code changes + Issues Plan and track work + Discussions Collaborate outside of code Explore + All features + Documentation + GitHub Skills + Blog * Solutions For + Enterprise + Teams + Startups + Education By Solution + CI/CD & Automation + DevOps + DevSecOps Case Studies + Customer Stories + Resources * Open Source + GitHub Sponsors Fund open source developers + The ReadME Project GitHub community articles Repositories + Topics + Trending + Collections * Pricing Search or jump to... Search code, repositories, users, issues, pull requests... Search [ ] Clear Search syntax tips Provide feedback We read every piece of feedback, and take your input very seriously. [ ] [ ] Include my email address so I can be contacted Cancel Submit feedback Saved searches Use saved searches to filter your results more quickly Name [ ] Query [ ] To see all available qualifiers, see our documentation. Cancel Create saved search Sign in Sign up You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session. {{ message }} raghavan / PdfGptIndexer Public * Notifications * Fork 1 * Star 97 An efficient tool for indexing and searching PDF text data using OpenAI's GPT-2 model and FAISS (Facebook AI Similarity Search) index, designed for rapid information retrieval and superior search accuracy. License MIT license 97 stars 1 fork Star Notifications * Code * Issues 0 * Pull requests 0 * Actions * Projects 0 * Security * Insights More * Code * Issues * Pull requests * Actions * Projects * Security * Insights raghavan/PdfGptIndexer This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository. main Switch branches/tags [ ] Branches Tags Could not load branches Nothing to show {{ refName }} default View all branches Could not load tags Nothing to show {{ refName }} default View all tags Name already in use A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch? Cancel Create 1 branch 0 tags Code * Local * Codespaces * Clone HTTPS GitHub CLI [https://github.com/r] Use Git or checkout with SVN using the web URL. [gh repo clone raghav] Work fast with our official CLI. Learn more about the CLI. * Open with GitHub Desktop * Download ZIP Sign In Required Please sign in to use Codespaces. Launching GitHub Desktop If nothing happens, download GitHub Desktop and try again. Launching GitHub Desktop If nothing happens, download GitHub Desktop and try again. Launching Xcode If nothing happens, download Xcode and try again. Launching Visual Studio Code Your codespace will open once ready. There was a problem preparing your codespace, please try again. Latest commit @raghavan raghavan Update README.md ... e8a1076 Jul 8, 2023 Update README.md e8a1076 Git stats * 6 commits Files Permalink Failed to load latest commit information. Type Name Latest commit message Commit time pdf Add pdf gpt indexer and query engine July 7, 2023 15:05 text Add pdf gpt indexer and query engine July 7, 2023 15:05 LICENSE Create LICENSE July 8, 2023 16:49 README.md Update README.md July 8, 2023 17:05 pdf_gpt_indexer.py Add pdf gpt indexer and query engine July 7, 2023 15:05 View code [ ] PdfGptIndexer Description Libraries Used Installing Dependencies How It Works Advantages of Storing Embeddings Locally Running the Program Exploring Custom Data with ChatGPT README.md PdfGptIndexer Description PdfGptIndexer is an efficient tool for indexing and searching PDF text data using OpenAI's GPT-2 model and FAISS (Facebook AI Similarity Search). This software is designed for rapid information retrieval and superior search accuracy. Libraries Used 1. Textract - A Python library for extracting text from any document. 2. Transformers - A library by Hugging Face providing state-of-the-art general-purpose architectures for Natural Language Understanding (NLU) and Natural Language Generation (NLG). 3. Langchain - A text processing and embeddings library. 4. FAISS (Facebook AI Similarity Search) - A library for efficient similarity search and clustering of dense vectors. Installing Dependencies You can install all dependencies by running the following command: pip install langchain openai textract transformers langchain faiss-cpu How It Works The PdfGptIndexer operates in several stages: 1. It first processes a specified folder of PDF documents, extracting the text and splitting it into manageable chunks using a GPT-2 tokenizer from the Transformers library. 2. Each text chunk is then embedded using the OpenAI GPT-2 model through the LangChain library. 3. These embeddings are stored in a FAISS index, providing a compact and efficient storage method. 4. Finally, a query interface allows you to retrieve relevant information from the indexed data by asking questions. The application fetches and displays the most relevant text chunk. Untitled-2023-06-16-1537 Advantages of Storing Embeddings Locally Storing embeddings locally provides several advantages: 1. Speed: Once the embeddings are stored, retrieval of data is significantly faster as there's no need to compute embeddings in real-time. 2. Offline access: After the initial embedding creation, the data can be accessed offline. 3. Compute Savings: You only need to compute the embeddings once and reuse them, saving computational resources. 4. Scalability: This makes it feasible to work with large datasets that would be otherwise difficult to process in real-time. Running the Program To run the program, you should: 1. Make sure you have installed all dependencies. 2. Clone the repository to your local machine. 3. Navigate to the directory containing the Python script. 4. Replace "" with your actual OpenAI API key in the script. 5. Finally, run the script with Python. python3 pdf_gpt_indexer.py Please ensure that the folders specified in the script for PDF documents and the output text files exist and are accessible. The query interface will start after the embeddings are computed and stored. You can exit the query interface by typing 'exit'. Exploring Custom Data with ChatGPT Check out the post here for a comprehensive guide on how to utilize ChatGPT with your own custom data. About An efficient tool for indexing and searching PDF text data using OpenAI's GPT-2 model and FAISS (Facebook AI Similarity Search) index, designed for rapid information retrieval and superior search accuracy. Resources Readme License MIT license Stars 97 stars Watchers 2 watching Forks 1 fork Report repository Releases No releases published Packages 0 No packages published Languages * Python 100.0% Footer (c) 2023 GitHub, Inc. Footer navigation * Terms * Privacy * Security * Status * Docs * Contact GitHub * Pricing * API * Training * Blog * About You can't perform that action at this time.