hngopher.com

       [HN Gopher] Ask HN: I have many PDFs - what is the best local wa...
       ___________________________________________________________________
        
       Ask HN: I have many PDFs - what is the best local way to leverage
       AI for search?
        
       As the title says, I have many PDFs - mostly scans via Scansnap -
       but also non-scans. These are sensitive in nature, e.g. bills,
       documents, etc. I would like a local-first AI solution that allows
       me to say things like: "show me all tax documents for August 2023"
       or "show my home title". Ideally it is Mac software that can access
       iCloud too, since that where I store it all. I would prefer to not
       do any tagging. I would like to optimize on recall over precision,
       so False Positives in the search results are ok. What are modern
       approaches to do this, without hacking one up on my own?
        
       Author : phodo
       Score  : 39 points
       Date   : 2024-05-30 20:24 UTC (2 hours ago)
        
       | adyashakti wrote:
       | getcody.ai
        
         | borg16 wrote:
         | the op wanted a local method, and this does not seem to be
         | local
        
       | dudus wrote:
       | I tried Google's NotebookLM for this use case and was very
       | pleased with the experience.
       | 
       | If you trust Google that is.
        
         | hobo_mark wrote:
         | NotebookLM is currently US only, limited to 20 documents
         | (sorry, 'sources') per notebook, and only works with Google
         | Drive.
        
         | bendsawyer wrote:
         | Not offline. I do not trust anyone with some data, because I
         | have contractually promised not to do so.
        
       | gibsonf1 wrote:
       | https://graphmetrix.com/trinpod-server
        
       | finack wrote:
       | OCR and pattern matching on text are computationally cheap and
       | incredibly easy to do. For example, tax documents often bear the
       | name of your government's tax authority, which presumably you are
       | familiar with and can search for. They also tend to have years on
       | them.
        
       | m0shen wrote:
       | Paperless supports OCR + full text indexing:
       | https://docs.paperless-ngx.com/
       | 
       | As far as AI goes, not sure.
        
       | elrostelperien wrote:
       | For macOS, there's this: https://pdfsearch.app/
       | 
       | Without AI, but searching the PDF content, I use Recoll
       | (https://www.recoll.org/) or ripgrep-all
       | (https://github.com/phiresky/ripgrep-all)
        
       | 1123581321 wrote:
       | Devonthink would do this with a tiny model to translate your
       | natural length search prompts into its syntax and your folder/tag
       | tree.
       | 
       | If you're okay with some false positives, Devonthink would work
       | as is, actually.
        
         | bendsawyer wrote:
         | I used to use this, but the LLM approach allows for much deeper
         | interactions. Not "find all times I've typed X" but
         | 
         | "act as an expert in Y, looking across all times I've typed X,
         | summarize my changing position over thee years, and suggest
         | other terms that have a similar pattern of change, in a list."
         | 
         | The kind of thing I used to give to an intern over a month,
         | with results that are not far off what that intern produced...
        
       | bendsawyer wrote:
       | I looked into this for sensitive material recently. In the end I
       | got a purpose-built local system built and am having it remotely
       | maintained. Cost: around 5k a year. I used
       | http://www.skunkwerx.ai, who are US based.
       | 
       | The result is a huge step up from 'full text search' solutions,
       | for my use case. I can have conversations with decades of
       | documents, and it's incredibly helpful. The support scheme keeps
       | my original documents unconnected from the machine, which I own,
       | while updates are done over a remote link. It's great, and I feel
       | safe.
       | 
       | Things change so fast in this space that there did not seem to be
       | a cheap, stable, local alternative. I honestly doubt one is
       | coming. This is not a on-size-fits-all problem.
        
       | Kikawala wrote:
       | Quivr: https://github.com/QuivrHQ/quivr
       | 
       | SecureAI-Tools: https://github.com/SecureAI-Tools/SecureAI-Tools
        
       | pierre wrote:
       | RAG cli from llamaindex, allow you to do it 100% locally when
       | used with ollama or llamacpp instead of OpenAI.
       | 
       | https://docs.llamaindex.ai/en/stable/getting_started/starter...
        
         | homarp wrote:
         | and at some point
         | (https://github.com/ggerganov/llama.cpp/issues/7444) you will
         | be able to use Phi-3-vision
         | https://huggingface.co/microsoft/Phi-3-vision-128k-instruct
         | 
         | but for now you will have to use python.
         | 
         | You can try it here
         | https://ai.azure.com/explore/models/Phi-3-vision-128k-instru...
         | to get an idea of its OCR + QA abilities
        
       | yousnail wrote:
       | PrivateGPT is a great starting point for using a local model and
       | RAG. Text-generation-ui, oogabooga, using superbooga V2 is very
       | nice and more customizable.
       | 
       | I've used both for sensitive internal SOPs, and both work quite
       | well. Private gpt excels at ingesting many separate documents,
       | the other excels at customization. Both are totally offline, and
       | can use mostly whatever models you want.
        
       | edgyquant wrote:
       | Using python to dump the PDF to text then use llama3 (8B) to
       | parse
        
       ___________________________________________________________________
       (page generated 2024-05-30 23:00 UTC)