hngopher.com

       [HN Gopher] Show HN: RAGstack - private ChatGPT for enterprise V...
       ___________________________________________________________________
        
       Show HN: RAGstack - private ChatGPT for enterprise VPCs, built with
       Llama 2
        
       Hey hacker news,  We're the cofounders at Psychic.dev
       (http://psychic.dev) where we help companies connect LLMs to
       private data. With the launch of Llama 2, we think it's finally
       viable to self-host an internal application that's on-par with
       ChatGPT, so we did exactly that and made it an open source project.
       We also included a vector DB and API server so you can upload files
       and connect Llama 2 to your own data.  The RAG in RAGstack stands
       for Retrieval Augmented Generation, a technique where the
       capabilities of a large language model (LLM) are augmented by
       retrieving information from other systems and inserting them into
       the LLM's context window via a prompt. This gives LLMs information
       beyond what was provided in their training data, which is necessary
       for almost every enterprise application. Examples include data from
       current web pages, data from SaaS apps like Confluence or
       Salesforce, and data from documents like sales contracts and PDFs.
       RAG works better than fine-tuning the model because it's cheaper,
       it's faster, and it's more reliable since the provenance of
       information is attached to each response.  While there are quite
       quite a few "chat with your data" apps at this point, most have
       external dependencies to APIs like OpenAI or Pinecone. RAGstack, on
       the other hand, only has open-source dependencies and lets you run
       the entire stack locally or on your cloud provider. This includes:
       - Containerizing LLMs like Falcon, Llama2, and GPT4all with Truss -
       Vector search with Qdrant. - File parsing and ingestion with
       Langchain, PyMuPDF, and Unstructured.io - Cloud deployment with
       Terraform  If you want to dive into it yourself, we also published
       a couple of tutorials on how to deploy open source LLMs for your
       organization, and optionally give it access to internal documents
       without any data ever leaving your VPC.  - How to deploy Llama 2 to
       Google Cloud (GCP): https://www.psychic.dev/post/how-to-deploy-
       llama-2-to-google... - How to connect Llama 2 to your own data
       using RAGstack: https://www.psychic.dev/post/how-to-self-host-
       llama-2-and-co...  Let a thousand private corporate oracles bloom!
        
       Author : ayanb9440
       Score  : 39 points
       Date   : 2023-07-20 17:11 UTC (5 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | rozap wrote:
       | So this dumps the documents returned from the vector store into a
       | prompt to the LLM. How does it work when there are many documents
       | returned? What's the upper limit there?
        
         | jasonwcfan wrote:
         | Yep. We use LangChain's basic text splitter to chunk the
         | documents and the QA chain to stuff it into the prompt. But
         | AFAIK it doesn't check for context length so that's a piece
         | that's still missing.
         | 
         | Upper limit depends on the model, Llama 2 is 4k including the
         | prompt.
        
       | neilv wrote:
       | > _only has open-source dependencies and lets you run the entire
       | stack locally or_
       | 
       | Open source and on-prem are two different things. Llama 2 doesn't
       | seem to be open source.
        
         | mmastrac wrote:
         | I don't think we've collectively figured out how to describe
         | what "weights openly available" means, so open-source is
         | probably a reasonable descriptor.
        
           | wmf wrote:
           | Maybe we could call it... Open Weights(tm).
        
             | e12e wrote:
             | More like Weights Available in the case of llama2 (and
             | Bring Your Own Pirate Treasure in case of llama1?).
        
           | neilv wrote:
           | I disagree. Open source involves the "source" being
           | available, not just the "compiled".
        
             | jasonwcfan wrote:
             | The concept of "source" is nebulous for ML models. If you
             | have the weights you can recreate a model without access to
             | the source code originally used to train it, and similarly
             | just having the source code without the training data won't
             | allow you to recreate the model.
             | 
             | While it would be nice to have the data set Meta used I
             | think open sourcing the weights is good enough.
        
               | Q6T46nT668w6i3m wrote:
               | No. The weights encode recorded parameters they don't
               | encode essential components like hyperparameters or
               | modules without recorded parameters.
        
               | jasonwcfan wrote:
               | You're right. Either way it's impossible to recreate
               | Llama 2 without the data set so perhaps "free to use
               | model" is a better description than "open source model"
        
               | neilv wrote:
               | I think some marketers are trying to use this term "open
               | source" to try to ride on the goodwill and perceived
               | benefits of open source, without actually doing it.
               | 
               | Also, people who just want to be able to run something on
               | their computer without paying money for it shouldn't call
               | it "open source", unless it actually is.
               | 
               | These distinctions have been going on for decades, for
               | very good reasons. No need to throw away that progress
               | now.
        
       | e12e wrote:
       | This looks like a great project. Given the costs, I imagine many
       | might want to run on dedicated hardware with GPU - yet:
       | 
       | > GPT4All: When you run locally, RAGstack will download and
       | deploy Nomic AI's gpt4all model, which runs on consumer CPUs.
       | 
       | > Falcon-7b: On the cloud, RAGstack deploys Technology Innovation
       | Institute's falcon-7b model onto a GPU-enabled GKE cluster.
       | 
       | > LLama 2: On the cloud, RAGstack can also deploy the 7B paramter
       | version of Meta's Llama 2 model onto a GPU-enabled GKE cluster.
       | 
       | Why not llama2 on dedicated/local hardware? Memory and download
       | size requirements?
       | 
       | Ed: After reading the linked tutorial - it looks like the built
       | docker container will run fine on local/dedicated hardware?
       | 
       | https://www.psychic.dev/post/how-to-deploy-llama-2-to-google...
        
         | jasonwcfan wrote:
         | Yep the docker containers should run fine on local hardware,
         | but the terraform config only supports GCP right now.
         | 
         | In terms of cost - just ran our deployed cluster through GCP's
         | pricing calculator and it's about $300 USD per month.
         | Definitely not cheap for individual use, but pretty affordable
         | for enterprise use. Running the 40B parameter version will be
         | significantly more.
        
       | shekhar101 wrote:
       | - Do you have plans to support other connectors, specifically
       | OneDrive? - Do you have a demo somewhere? From the website and
       | screenshots, it's not clear the functionalities you offer. A few
       | min long screencast would help. - How do you differ youself from
       | Quivr? Seems like another open source alternative and has some
       | nice feature. Thanks for this. I will try to use this and see how
       | well it works for my use case.
        
         | jasonwcfan wrote:
         | We have about 10 other connectors in a separate project at
         | https://github.com/psychic-api/psychic
         | 
         | Thanks for the feedback! We'll include a demo soon.
        
       | vertis wrote:
       | Trying to run this locally and once I get past a few gotchas
       | (local.env, needing to be renamed to .env) and needing to `pip3
       | install poetry`. I start getting back responses like
       | 
       | "D<D,8H8,H<,,DH8DHH,,<<,DH<,<DHD<<,<<D,D,HD88<<H8<<D8D88,,8D,DH<,
       | 8,D<D,D,D8,D8<D8H,DHH8,D8H<,8D,,H8DHD88DD8H8<,8,HD<8D<,8D,<<888D<
       | H,8<HD<HHD<8<<D8DD<DD<HHHH,,DDD<<DHDH,88HDH8,8DHD<<,D8,<8<H8<8H<,
       | ,<,,,D,88,<,<<8D,8<8,,H8,,D888D8<HD8<D,D8,<8<<H8D,,D<D,8<DD,<8"
       | 
       | I'm sure I'm doing something wrong :)
        
         | jasonwcfan wrote:
         | Thanks for the callout! We'll add the local.env instructions to
         | the readme.
         | 
         | Are you using it with input docs or without? Locally it uses
         | GPT4all which isn't nearly as good as Llama or Falcon. I saw a
         | project that is docker for Llama 2 so we might use that
         | instead!
        
           | vertis wrote:
           | I tried both. I'll certainly try it remotely as well. Was
           | just pottering through HN before bed.
        
       | Jayakumark wrote:
       | Does it use openAI embeddings or other free ones ?
        
         | jasonwcfan wrote:
         | It uses all-MiniLM-L6-v2 from huggingface by default
         | 
         | https://huggingface.co/sentence-transformers/all-MiniLM-L6-v...
         | 
         | You can also specify a specific embeddings model from
         | SentenceTransformers to use in /server/.env
        
       | generalizations wrote:
       | Is there a version of this set up to be cpu-only, as in something
       | that can use ggml tech? I'd love to deploy this on some servers
       | with lots of ram and cpu horsepower, but no gpus.
        
         | jasonwcfan wrote:
         | Not yet, but we can definitely add it. Created an issue:
         | https://github.com/psychic-api/rag-stack/issues/2
         | 
         | In the meantime it uses GPT4all when running locally so you can
         | technically deploy it as well, but it's not very good.
        
       | WillPostForFood wrote:
       | Approximately, what would the hourly cost of running this be on
       | Google Cloud?
       | 
       | >In the default-pool > Nodes tab, set:
       | 
       | >Machine Configuration from General Purpose to GPU
       | 
       | >GPU type: Nvidia TF
       | 
       | >Number of GPUs: 1
       | 
       | >Enable GPU time sharing
       | 
       | >Max shared clients per GPU: 8
       | 
       | >Machine type: n1-standard-4
       | 
       | >Boot disk size: 50 GB
       | 
       | >Enable nodes on spot VMs
       | 
       | Not familiar with GCP, but I see n1-standard-4's are in an
       | instance type that is $.19/hr. Are there any other significant
       | costs to take into account?
        
         | jasonwcfan wrote:
         | Just ran our deployed cluster through GCP's pricing calculator
         | and it's about $300 USD per month with Llama 2
        
       ___________________________________________________________________
       (page generated 2023-07-20 23:02 UTC)