[HN Gopher] A Comprehensive Guide for Building Rag-Based LLM App...
       ___________________________________________________________________
        
       A Comprehensive Guide for Building Rag-Based LLM Applications
        
       Author : robertnishihara
       Score  : 160 points
       Date   : 2023-09-14 06:33 UTC (16 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | version_five wrote:
       | FWIW, having written a simple RAG system from "scratch" (meaning
       | not using frameworks or api calls), it's not more complicated
       | than doing it this way with langchain etc.
       | 
       | This post is mostly about plumbing. It's probably the right way
       | to do it if it needs to be scaled. But for learning, it obscures
       | what is essentially simple stuff going on behind the scenes.
        
         | jamesblonde wrote:
         | I tend to agree - i haven't seen the value in existing
         | "retriever" components in langchain and others.
        
           | haxton wrote:
           | My favorite example is the asana loader[0] for llama-index.
           | It's literally just the most basic wrapper around the Asana
           | SDK to concatenate some strings.
           | 
           | [0] - https://github.com/emptycrown/llama-
           | hub/blob/main/llama_hub/...
        
         | clharman wrote:
         | For serious implementations, frameworks are not very helpful,
         | even LangChain. All the components provide good SDKs/APIs,
         | having a bunch of "integrations" doesn't add any real value.
         | 
         | If you know what you want to build, building from scratch is
         | easier than you think. If you're tinkering on the weekend, then
         | maybe the frameworks are helpful.
        
           | lmeyerov wrote:
           | Yeah as soon as we write the word 'thread' or thinking about
           | LLM API concurrency control across many user requests, all
           | frameworks we tried are basically a wall instead of an
           | accelerator. For a single user demo video on Twitter or a
           | low-traffic streamlit POC to get a repo with lots of star
           | gazers, they work quite well, and that's not far from what
           | someone needs for an internal project with a small userbase.
           | Just once this is supposed to be infra for production-grade
           | software, the tools we have tried so far are still
           | prioritizing features over being a foundation.
        
         | iampims wrote:
         | Opportunity for you to write a blogpost about your approach :)
        
           | ajhai wrote:
           | We did write one just yesterday that talks about rags and
           | some techniques to improve their performance in production at
           | https://llmstack.ai/blog/retrieval-augmented-generation
        
       | bguberfain wrote:
       | What brings to my attention in this article is the section named
       | "Cold Start", where it generates questions based on a provided
       | context. I think it is a good way to cheaply generate an Q&A
       | dataset that can later be used to finetune a model. But the
       | problem is that it generates some questions and answers of bad
       | quality. All generated examples have issues: - "What is the
       | context discussing about?" - which context? - "The context does
       | not provide information on what Ray Tune is." - Not an answer -
       | "The context does not provide information on what external
       | library integrations are." - same as before I could only think of
       | manual review to remove these noise questions. Any ideas on how
       | to improve this QA generation? I've tried it before, but with
       | paltry results.
        
         | maxrmk wrote:
         | I recently quit my job to build specialized tooling in this
         | space. We're broadly focusing on eval in general, but are
         | starting with high quality question and answer generation for
         | testing these kinds of RAG pipelines. It's surprisingly hard!
        
           | resiros wrote:
           | Sounds very interesting. I am building an open-source LLM
           | building platform (agenta.ai) and looking for eval approaches
           | to integrate for our users. Do you have already a product/api
           | that we could use?
        
       | deanmoriarty wrote:
       | My question is: if I want to use LLM to help me sift through a
       | large amount of structured data, say for example all the logs for
       | a bunch of different applications from a certain cloud
       | environment, each with their own idiosyncrasies and specific
       | formats (many GBs of data), can the RAG pattern be useful here?
       | 
       | Some of my concerns:
       | 
       | 1) Is sentence embedding using an off-the-shelf embedding model
       | going to capture the "meaning" of my logs? My answer is "probably
       | not". For example, if a portion of my logs is in this format
       | timestamp_start,ClassName,FunctionName,timestamp_end
       | 
       | Will I be able to get meaningful embeddings that satisfy a query
       | such as "what components in my system exhibited an anomalously
       | high latency lately?" (this is just an example among many
       | different queries I'd have)
       | 
       | Based on the little I know, it seems to me off-the-shelf
       | embeddings wouldn't be able to match the embedding of my query
       | with the embeddings for the relevant log lines, given the
       | complexity of this task.
       | 
       | 2) Is it going to be even feasible (cost/performance-wise) to use
       | embeddings when one has a firehose of data coming through, or is
       | it better suited for a mostly-static corpus of data (e.g. your
       | typical corporate documentation or product catalog)?
       | 
       | I know that I can achieve something similar with a Code
       | Interpreter-like approach, so in theory I could build a multi-
       | step reasoning agent that starting from my query and the data
       | would try to (1) discover the schema and then (2) crunch the data
       | to try to get to my answer, but I don't know how scalable this
       | approach would effectively be.
        
         | thewataccount wrote:
         | Just to clarify - are you wanting the LLM itself to identify
         | what a "anomalous latency" would be based on the data itself?
         | If so then I don't think this will help you at all until we can
         | actually fit the log into the context.
         | 
         | What RAG here is doing is using embeddings and a vector store
         | to identify close pieces of information, for example "in this
         | django project add a textfield" will be very close to
         | documentation in the django docs that say "textfield", and it
         | will then add that to the prompt so the LLM has the relevant
         | docs in its context.
         | 
         | The problem is that you'll need a heuristic to identify at
         | least "potentially anomalous" and even then you'll still have
         | to make sure there's enough context for it to know "is this a
         | normal daily fluctuation".
         | 
         | A multi-step agent is definitely what you want, you could have
         | it build an SQL query itself, for example "was there any high
         | latency requests yesterday?" it may identify it should filter
         | the time, possibly design the query to determine what is
         | "high".
         | 
         | ---
         | 
         | At the moment I don't think it's well suited to identifying
         | when the "latency is abnormally high". However if you have some
         | other system/human identify heuristics to feed to the LLM, it
         | may then be able to do at least answer the query.
        
           | deanmoriarty wrote:
           | Yes, this clarifies well what is possible vs not.
           | 
           | I was trying to understand if there is an opportunity to
           | introduce some of this technology to solve "anomaly
           | detection" on large amount of structured data, where anomaly
           | might be an incredibly overloaded term (it might imply a
           | performance regression, a security issue, etc). That is a
           | business need I have today.
           | 
           | It seems that what is possible today is an assistant that can
           | aid a user to get to these answers faster (by, for instance,
           | suggesting a SQL query based on the schema, etc). Again,
           | roughly the equivalent of what Code Interpreter does, just
           | without the local environment limitations.
        
         | IKantRead wrote:
         | > can the RAG pattern be useful here?
         | 
         | From your questions it looks like you are only interesting in
         | the R part. RAG implies the retrieval step is then used to
         | augment a user prompt.
         | 
         | To answer 1, a good heuristic would be "can a human reasonably
         | familiar with the terminology answer questions about the
         | meaning?" If a human would need extra info to make sense of
         | your data then so would an LLM.
         | 
         | This is where RAG typically comes in. For example if you had
         | documentation about ClassName and FunctionName, a retrieval
         | model might be able to find the most likely candidates based on
         | a file containing full definitions of these classes and
         | function, then pass that info into the LLM appended to your
         | query.
         | 
         | For 2: It depends if the fire house is the query or the data.
         | If you have queries coming in very quickly, then you might be
         | able to if your firehose doesn't have too much volume since you
         | can batch requests and get responses fairly quickly.
         | 
         | If the fire hose is the data going into the vector DB then you
         | might have some difficultly inserting and indexing the data
         | fast enough.
        
         | warkdarrior wrote:
         | For this kind of structured data and this kind of structured
         | queries, it may be more useful to stick to a data query
         | language (SQL, or some analytics engine).
        
           | deanmoriarty wrote:
           | Thanks. I wonder if a reasonable approach could then be to
           | first insert the data in a datawarehouse-like database
           | suitable for analytics, and then use an LLM application to
           | (1) generate SQL queries that could answer my question,
           | reasoning about the schema (2) potentially summarize the
           | output result set. It could still result in a significant
           | boost of productivity.
        
             | warkdarrior wrote:
             | Indeed, that is a promising path. Fundamentally you still
             | want to reply on a human to figure out what analytics are
             | interesting to consider, then having the LLM act as a
             | helper that generates queries corresponding to the
             | analytics.
        
       | zackproser wrote:
       | While you don't strictly "need" a vector db to do RAG, as others
       | have pointed out, vector databases excel when you're dealing with
       | natural language - which is ambiguous.
       | 
       | This will be the case when you're exposing an interface to end
       | users that they can submit arbitrary queries to - such as "how do
       | I turn off reverse breaking".
       | 
       | By converting the user's query to vectors before sending it to
       | your vector store, you're getting at the user's actual intent
       | behind their words - which can help you retrieve more accurate
       | context to feed to your LLM when asking it to perform a chat
       | completion, for example.
       | 
       | This is also important if you're dealing with proprietary or non-
       | public data that a search engine can't see. Context-specific
       | natural language queries are well suited to vector databases.
       | 
       | We wrote up a guide with examples here:
       | https://www.pinecone.io/learn/retrieval-augmented-generation...
       | 
       | And we've got several example notebooks you can run end to end
       | using our free-tier here: https://docs.pinecone.io/page/examples
        
         | phillipcarter wrote:
         | Ehhh, I don't think you're telling the whole story here.
         | Vectors aren't really a complete solution here either. Consider
         | a use case like ours where we need to support extremely vague
         | inputs (since users give us extremely vague inputs):
         | https://twitter.com/_cartermp/status/1700586154599559464/
         | 
         | Cosine similarity across vectors isn't enough here, but when
         | combined with an LLM we get the right behavior. As you mention,
         | without the vector store reducing the size of data we pass to
         | the LLM, hallucinations happen more often. It's a balancing
         | act.
         | 
         | The other nasty one to consider is when people write "how do I
         | not turn off reverse breaking". Again, a comparison will show
         | that as very similar to your input, but it's really the
         | opposite. And so if implementers aren't careful to account for
         | that, they've now got a nasty subtle bug on their hands.
        
           | danielbln wrote:
           | A neat way of dealing with sparse input is to take the entire
           | chat history (if any) into account and ask the LLM to expand
           | the query so that the semantic search has more to work with.
           | Generally, using the LLM to add more data to the user query
           | based on context, previous conversation, or just having it
           | produce a fake document all together based on the sparse
           | query can work well to improve the vectors you use in the
           | similarity search. A concern with this strategy is latency,
           | as you need to add another generation hop before you can
           | query the vector db.
        
             | brandall10 wrote:
             | Interesting. Do you have specific examples or a link to a
             | post detailing this?
        
               | danielbln wrote:
               | The approach is based on hypothetical document embeddings
               | (HyDE). Here is a good description of it in the context
               | of langchain: https://python.langchain.com/docs/use_cases
               | /question_answeri...
               | 
               | The original paper proposing this technique can be found
               | here: https://arxiv.org/pdf/2212.10496.pdf
        
       | tshrjn007 wrote:
       | What do you use to generate the diagrams in the post? Super Neat.
        
       | pplonski86 wrote:
       | Can it be easier to do RAG? Do we always need to have Vector DB?
       | Why LLM can't do search through the context by itself?
        
         | clharman wrote:
         | You need a vector db because all the vector db companies need
         | customers...
         | 
         | You definitely do need information retrieval. It just shouldn't
         | be limited to vector dbs. Unfortunately vector db companies and
         | the VCs that back them have flooded the internet with
         | propaganda suggesting vector db is the only choice.
         | https://colinharman.substack.com/p/beware-tunnel-vision-in-a...
         | 
         | For most serious use cases, you'll have far too much data to
         | fit into 1 (or several) inference contexts.
        
         | petesergeant wrote:
         | Petroni 2020 got pretty far with TFDIF iirc, for a related but
         | slightly different task, still I've got to believe the semantic
         | search element provided by vector DBS is going to add a lot
        
         | chandureddyvari wrote:
         | The context length is limited, for gpt-3.5 it's 4k tokens,
         | there are other offerings which offer upto 100k(claude). 100k
         | tokens is ~1 book., but it priced steeply for each call. It's
         | often wiser, cheaper to Retrieve the context from your text &
         | Augment your query to the LLM to Generate more contextual
         | answers. That's the reason for the name Retrieval Augmented
         | Generation (RAG). For Retrieving - you'd need a vector database
         | (for similarity comparison you can use semantic or vector
         | embedding based similarity search)
        
           | halflings wrote:
           | Minor note: you only need a vector database if you have so
           | many possible inputs that linear retrieval is too slow.
           | 
           | Arguably, for many use cases (e.g. searching through a
           | document with ~200 passages), loading embeddings in memory
           | and running a simple linear search would be fast enough.
        
             | chandureddyvari wrote:
             | Yeah what you mentioned might be true. Currently our
             | understanding on how LLMs really work behind the screens is
             | limited. For example, there was a recent research[1] where
             | LLM's accuracy is better if the context is added at the
             | beginning when compared to the end of the prompt. So it's
             | mostly by trial & error to figure out what works out best
             | for you. You can use FAISS or similar to have the
             | embeddings in-memory instead of a full fledged vector DB.
             | But pg vector is convenient plugin if you already have
             | postgres instance running
             | 
             | [1]- https://towardsdatascience.com/in-context-learning-
             | approache...
        
               | halflings wrote:
               | What I mentioned doesn't depend on how LLMs work, the end
               | result is the same (retrieving useful inputs to pass to
               | your LLM). Just meant that a lot of people can just do
               | this in-memory or in ad-hoc ways if they're not too
               | latency constrained.
        
             | azmodeus wrote:
             | I think unless you need a vector db definitely don't use
             | it.
             | 
             | A vector storage could help in reduce the time it takes to
             | retrieve the most similar hit. I used faiss as a local
             | vector store quite a bit to retrieve vectors fast. Though I
             | had 1.5 million vectors to work through.
        
               | chandureddyvari wrote:
               | Interesting. I thought anything >1million would need a
               | vector db to scale on production. What was your machine
               | config for running faiss? Also did you plan for
               | redundancy or was it just faiss as a service VM?
        
               | Tostino wrote:
               | People seem to underestimate the scale you can get to on
               | a single machine, and overestimate how easy it will be to
               | go up from there.
               | 
               | An in memory index is about as good as it gets for a
               | single node performance, and fitting that many vectors
               | into memory on a single machine is easy.
        
         | phillipcarter wrote:
         | Others have chimed in as well, but I'll mention that we've been
         | live with our product, for all users, for several months now
         | doing RAG with OpenAI vector embeddings stored in Redis.
         | 
         | We then just fetch up to the vectors related to a customer's
         | schema in memory (largest is ~200MB) and run cosine similarity
         | in a few ms in Go (handwritten, ~25 lines of code), and then
         | we've got out top N things to place in our prompt.
         | 
         | Primitive? You betcha. Works extremely well for our entire
         | customer base? Yup. You definitely don't need a Vector DB
         | unless you have an enormous amount of vectors. For us it means
         | having to run our own Redis clusters, but we know how to do
         | that, and so we don't need to involve another vendor.
        
           | gsuuon wrote:
           | For local stuff with a handful of documents, you can even
           | just throw it into a json and call it a day. The similarity
           | search is as simple as an np.dot: https://github.com/gsuuon/l
           | lm.nvim/blob/main/python3/store.p...
        
         | potatoman22 wrote:
         | You can hook up any search engine to an LLM. Vector databases
         | are just an easy* way to make a decent search engine.
        
         | simonw wrote:
         | No, you don't need a vector database. You can get OK results by
         | prompting "give me ten search terms that are relevant to this
         | question", then running those searches against a regular full-
         | text search engine and pasting those results back into the LLM
         | as context along with the original question.
         | 
         | You're likely to get better results from vector-based semantic
         | search though, just because it takes you beyond needing exact
         | matches on search terms.
        
           | clharman wrote:
           | Vector is better for some use cases (open-domain, more
           | conversational data) and term-based search is better for
           | others (closed-domain, more keyword-based).
           | 
           | I've found that internal enterprise projects tend to be very
           | keyword based, and vector search often produces weird, head-
           | scratcher results that users hate - whereas term-based search
           | does a better job of capturing the right terms, if you do the
           | proper synonym/abbreviation expansions.
           | 
           | That said, I use them both, usually with vector search as a
           | fallback after the initial keyword-based RAG pass
        
         | ofermend wrote:
         | RAG is a very useful flow but I agree the complexity is often
         | overwhelming, esp as you move from a toy example to a real
         | production deployment. It's not just choosing a vector DB (last
         | time I checked there were about 50), managing it, deciding on
         | how to chunk data, etc. You also need to ensure your retrieval
         | pipeline is accurate and fast, ensuring data is secure and
         | private, and manage the whole thing as it scales. That's one of
         | the main benefits of using Vectara (https://vectara.com; FD: I
         | work there) - it's a GenAI platform that abstracts all this
         | complexity away, and you can focus on building your
         | application.
        
       | gsuuon wrote:
       | Wow this was indeed super comprehensive. A few things I noticed:
       | 
       | - In the cold start section, a couple of the synthetic_data
       | responses say 'context does not provide info..'
       | 
       | - It's strange that retrieval_score would decrease while
       | quality_score increases at the higher chunk sizes. Could this
       | just be that the retrieved chunk is starting to be larger than
       | the reference?
       | 
       | - Gpt 3.5 pricing looks out of date, it's currently $0.0015 for
       | input for the 4k model
       | 
       | - Interesting that pricing needs to be shown on a log scale.
       | Gpt-4 is 46x more expensive than llama 2 70B for ~.3 score
       | increase. Training a simple classifier seems like a great way to
       | handle this.
       | 
       | - I wonder how stable the quality_score assessment is given the
       | exact same configuration. I guess the score differences between
       | falcon-180b, llama-2-70b and gpt-3.5 are insignificant?
       | 
       | Is there a similarly comprehensive deep dive into chunking
       | methods anywhere? Especially for queries that require multiple
       | chunks to answer at all - producing more relevant chunks would
       | have a massive impact on response quality I imagine.
        
       | ajhai wrote:
       | Kudos to the team for a very detailed notebook going into things
       | like pipeline evaluation wrt performance and costs etc. Even if
       | we ignore the framework specific bits, it is a great guide to
       | follow when building RAG systems in production.
       | 
       | We have been building RAG systems in production for a few months
       | and have been tinkering with different strategies to get the most
       | performance out of these pipelines. As others have pointed out,
       | vector database may not be the right strategy for every problem.
       | Similarly there are things like lost in the middle problems
       | (https://arxiv.org/abs/2307.03172) that one may have to deal
       | with. We put together our learnings building and optimizing these
       | pipelines in a post at https://llmstack.ai/blog/retrieval-
       | augmented-generation.
       | 
       | https://github.com/trypromptly/LLMStack is a low-code platform we
       | open-sourced recently that ships these RAG pipelines out of the
       | box with some app templates if anyone wants to try them out.
        
       | yujian wrote:
       | Anyscale consistently posts great projects. Very cool to see the
       | cost comparison and quality comparison. Not surprising to see
       | that OSS is less expensive, but also rated as slightly lower
       | quality than gpt-3.5-turbo.
       | 
       | I do wonder, is there some bias in quality measures? Using GPT 4
       | to evaluate GPT 4's output?
       | https://www.linkedin.com/feed/update/urn:li:activity:7103398...
        
       ___________________________________________________________________
       (page generated 2023-09-14 23:02 UTC)