https://github.com/raghavan/PdfGptIndexer

Skip to content Toggle navigation
 
Sign up

  * Product
      +  
        Actions
        Automate any workflow
      +  
        Packages
        Host and manage packages
      +  
        Security
        Find and fix vulnerabilities
      +  
        Codespaces
        Instant dev environments
      +  
        Copilot
        Write better code with AI
      +  
        Code review
        Manage code changes
      +  
        Issues
        Plan and track work
      +  
        Discussions
        Collaborate outside of code
    Explore
      + All features
      + Documentation
      + GitHub Skills
      + Blog
  * Solutions
    For
      + Enterprise
      + Teams
      + Startups
      + Education
    By Solution
      + CI/CD & Automation
      + DevOps
      + DevSecOps
    Case Studies
      + Customer Stories
      + Resources
  * Open Source
      +  
        GitHub Sponsors
        Fund open source developers
      +  
        The ReadME Project
        GitHub community articles
    Repositories
      + Topics
      + Trending
      + Collections
  * Pricing

Search or jump to...

Search code, repositories, users, issues, pull requests...

Search
[                    ]
Clear

Search syntax tips

Provide feedback

We read every piece of feedback, and take your input very seriously.

[                    ] [ ] Include my email address so I can be
contacted
Cancel Submit feedback

Saved searches

Use saved searches to filter your results more quickly

Name [                    ] 
Query [                    ]

To see all available qualifiers, see our documentation.

Cancel Create saved search
Sign in
Sign up
You signed in with another tab or window. Reload to refresh your
session. You signed out in another tab or window. Reload to refresh
your session. You switched accounts on another tab or window. Reload
to refresh your session.
{{ message }}
raghavan / PdfGptIndexer Public

  * Notifications
  * Fork 1
  * Star 97

An efficient tool for indexing and searching PDF text data using
OpenAI's GPT-2 model and FAISS (Facebook AI Similarity Search) index,
designed for rapid information retrieval and superior search
accuracy.

License

MIT license
97 stars 1 fork
Star
Notifications

  * Code
  * Issues 0
  * Pull requests 0
  * Actions
  * Projects 0
  * Security
  * Insights

More

  * Code
  * Issues
  * Pull requests
  * Actions
  * Projects
  * Security
  * Insights

raghavan/PdfGptIndexer

This commit does not belong to any branch on this repository, and may
belong to a fork outside of the repository.
main
Switch branches/tags
[                    ]
Branches Tags
Could not load branches
Nothing to show
{{ refName }} default View all branches
Could not load tags
Nothing to show
{{ refName }} default
View all tags

Name already in use

A tag already exists with the provided branch name. Many Git commands
accept both tag and branch names, so creating this branch may cause
unexpected behavior. Are you sure you want to create this branch?
Cancel Create
1 branch 0 tags
Code

  * Local
  * Codespaces

  *  
    Clone
    HTTPS GitHub CLI
    [https://github.com/r]

    Use Git or checkout with SVN using the web URL.

    [gh repo clone raghav]

    Work fast with our official CLI. Learn more about the CLI.

  * Open with GitHub Desktop
  * Download ZIP

Sign In Required

Please sign in to use Codespaces.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching GitHub Desktop

If nothing happens, download GitHub Desktop and try again.

Launching Xcode

If nothing happens, download Xcode and try again.

Launching Visual Studio Code

Your codespace will open once ready.

There was a problem preparing your codespace, please try again.

Latest commit

@raghavan
raghavan Update README.md
...
e8a1076 Jul 8, 2023
Update README.md
e8a1076

Git stats

  * 6 commits

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
pdf
Add pdf gpt indexer and query engine
July 7, 2023 15:05
text
Add pdf gpt indexer and query engine
July 7, 2023 15:05
LICENSE
Create LICENSE
July 8, 2023 16:49
README.md
Update README.md
July 8, 2023 17:05
pdf_gpt_indexer.py
Add pdf gpt indexer and query engine
July 7, 2023 15:05
View code
[                    ]
PdfGptIndexer Description Libraries Used Installing Dependencies How
It Works Advantages of Storing Embeddings Locally Running the Program
Exploring Custom Data with ChatGPT

README.md

 PdfGptIndexer

 Description

PdfGptIndexer is an efficient tool for indexing and searching PDF
text data using OpenAI's GPT-2 model and FAISS (Facebook AI
Similarity Search). This software is designed for rapid information
retrieval and superior search accuracy.

 Libraries Used

 1. Textract - A Python library for extracting text from any
    document.
 2. Transformers - A library by Hugging Face providing
    state-of-the-art general-purpose architectures for Natural
    Language Understanding (NLU) and Natural Language Generation
    (NLG).
 3. Langchain - A text processing and embeddings library.
 4. FAISS (Facebook AI Similarity Search) - A library for efficient
    similarity search and clustering of dense vectors.

 Installing Dependencies

You can install all dependencies by running the following command:

pip install langchain openai textract transformers langchain faiss-cpu

 How It Works

The PdfGptIndexer operates in several stages:

 1. It first processes a specified folder of PDF documents,
    extracting the text and splitting it into manageable chunks using
    a GPT-2 tokenizer from the Transformers library.
 2. Each text chunk is then embedded using the OpenAI GPT-2 model
    through the LangChain library.
 3. These embeddings are stored in a FAISS index, providing a compact
    and efficient storage method.
 4. Finally, a query interface allows you to retrieve relevant
    information from the indexed data by asking questions. The
    application fetches and displays the most relevant text chunk.

Untitled-2023-06-16-1537

 Advantages of Storing Embeddings Locally

Storing embeddings locally provides several advantages:

 1. Speed: Once the embeddings are stored, retrieval of data is
    significantly faster as there's no need to compute embeddings in
    real-time.
 2. Offline access: After the initial embedding creation, the data
    can be accessed offline.
 3. Compute Savings: You only need to compute the embeddings once and
    reuse them, saving computational resources.
 4. Scalability: This makes it feasible to work with large datasets
    that would be otherwise difficult to process in real-time.

 Running the Program

To run the program, you should:

 1. Make sure you have installed all dependencies.
 2. Clone the repository to your local machine.
 3. Navigate to the directory containing the Python script.
 4. Replace "<OPENAI_API_KEY>" with your actual OpenAI API key in the
    script.
 5. Finally, run the script with Python.

python3 pdf_gpt_indexer.py

Please ensure that the folders specified in the script for PDF
documents and the output text files exist and are accessible. The
query interface will start after the embeddings are computed and
stored. You can exit the query interface by typing 'exit'.

 Exploring Custom Data with ChatGPT

Check out the post here for a comprehensive guide on how to utilize
ChatGPT with your own custom data.

About

An efficient tool for indexing and searching PDF text data using
OpenAI's GPT-2 model and FAISS (Facebook AI Similarity Search) index,
designed for rapid information retrieval and superior search
accuracy.

Resources

Readme

License

MIT license

Stars

97 stars

Watchers

2 watching

Forks

1 fork
Report repository

Releases

No releases published

Packages 0

No packages published

Languages

  * Python 100.0%

Footer

 (c) 2023 GitHub, Inc.

Footer navigation

  * Terms
  * Privacy
  * Security
  * Status
  * Docs
  * Contact GitHub
  * Pricing
  * API
  * Training
  * Blog
  * About

You can't perform that action at this time.