[HN Gopher] A Comprehensive Guide for Building Rag-Based LLM App...
___________________________________________________________________
A Comprehensive Guide for Building Rag-Based LLM Applications
Author : robertnishihara
Score : 160 points
Date : 2023-09-14 06:33 UTC (16 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| version_five wrote:
| FWIW, having written a simple RAG system from "scratch" (meaning
| not using frameworks or api calls), it's not more complicated
| than doing it this way with langchain etc.
|
| This post is mostly about plumbing. It's probably the right way
| to do it if it needs to be scaled. But for learning, it obscures
| what is essentially simple stuff going on behind the scenes.
| jamesblonde wrote:
| I tend to agree - i haven't seen the value in existing
| "retriever" components in langchain and others.
| haxton wrote:
| My favorite example is the asana loader[0] for llama-index.
| It's literally just the most basic wrapper around the Asana
| SDK to concatenate some strings.
|
| [0] - https://github.com/emptycrown/llama-
| hub/blob/main/llama_hub/...
| clharman wrote:
| For serious implementations, frameworks are not very helpful,
| even LangChain. All the components provide good SDKs/APIs,
| having a bunch of "integrations" doesn't add any real value.
|
| If you know what you want to build, building from scratch is
| easier than you think. If you're tinkering on the weekend, then
| maybe the frameworks are helpful.
| lmeyerov wrote:
| Yeah as soon as we write the word 'thread' or thinking about
| LLM API concurrency control across many user requests, all
| frameworks we tried are basically a wall instead of an
| accelerator. For a single user demo video on Twitter or a
| low-traffic streamlit POC to get a repo with lots of star
| gazers, they work quite well, and that's not far from what
| someone needs for an internal project with a small userbase.
| Just once this is supposed to be infra for production-grade
| software, the tools we have tried so far are still
| prioritizing features over being a foundation.
| iampims wrote:
| Opportunity for you to write a blogpost about your approach :)
| ajhai wrote:
| We did write one just yesterday that talks about rags and
| some techniques to improve their performance in production at
| https://llmstack.ai/blog/retrieval-augmented-generation
| bguberfain wrote:
| What brings to my attention in this article is the section named
| "Cold Start", where it generates questions based on a provided
| context. I think it is a good way to cheaply generate an Q&A
| dataset that can later be used to finetune a model. But the
| problem is that it generates some questions and answers of bad
| quality. All generated examples have issues: - "What is the
| context discussing about?" - which context? - "The context does
| not provide information on what Ray Tune is." - Not an answer -
| "The context does not provide information on what external
| library integrations are." - same as before I could only think of
| manual review to remove these noise questions. Any ideas on how
| to improve this QA generation? I've tried it before, but with
| paltry results.
| maxrmk wrote:
| I recently quit my job to build specialized tooling in this
| space. We're broadly focusing on eval in general, but are
| starting with high quality question and answer generation for
| testing these kinds of RAG pipelines. It's surprisingly hard!
| resiros wrote:
| Sounds very interesting. I am building an open-source LLM
| building platform (agenta.ai) and looking for eval approaches
| to integrate for our users. Do you have already a product/api
| that we could use?
| deanmoriarty wrote:
| My question is: if I want to use LLM to help me sift through a
| large amount of structured data, say for example all the logs for
| a bunch of different applications from a certain cloud
| environment, each with their own idiosyncrasies and specific
| formats (many GBs of data), can the RAG pattern be useful here?
|
| Some of my concerns:
|
| 1) Is sentence embedding using an off-the-shelf embedding model
| going to capture the "meaning" of my logs? My answer is "probably
| not". For example, if a portion of my logs is in this format
| timestamp_start,ClassName,FunctionName,timestamp_end
|
| Will I be able to get meaningful embeddings that satisfy a query
| such as "what components in my system exhibited an anomalously
| high latency lately?" (this is just an example among many
| different queries I'd have)
|
| Based on the little I know, it seems to me off-the-shelf
| embeddings wouldn't be able to match the embedding of my query
| with the embeddings for the relevant log lines, given the
| complexity of this task.
|
| 2) Is it going to be even feasible (cost/performance-wise) to use
| embeddings when one has a firehose of data coming through, or is
| it better suited for a mostly-static corpus of data (e.g. your
| typical corporate documentation or product catalog)?
|
| I know that I can achieve something similar with a Code
| Interpreter-like approach, so in theory I could build a multi-
| step reasoning agent that starting from my query and the data
| would try to (1) discover the schema and then (2) crunch the data
| to try to get to my answer, but I don't know how scalable this
| approach would effectively be.
| thewataccount wrote:
| Just to clarify - are you wanting the LLM itself to identify
| what a "anomalous latency" would be based on the data itself?
| If so then I don't think this will help you at all until we can
| actually fit the log into the context.
|
| What RAG here is doing is using embeddings and a vector store
| to identify close pieces of information, for example "in this
| django project add a textfield" will be very close to
| documentation in the django docs that say "textfield", and it
| will then add that to the prompt so the LLM has the relevant
| docs in its context.
|
| The problem is that you'll need a heuristic to identify at
| least "potentially anomalous" and even then you'll still have
| to make sure there's enough context for it to know "is this a
| normal daily fluctuation".
|
| A multi-step agent is definitely what you want, you could have
| it build an SQL query itself, for example "was there any high
| latency requests yesterday?" it may identify it should filter
| the time, possibly design the query to determine what is
| "high".
|
| ---
|
| At the moment I don't think it's well suited to identifying
| when the "latency is abnormally high". However if you have some
| other system/human identify heuristics to feed to the LLM, it
| may then be able to do at least answer the query.
| deanmoriarty wrote:
| Yes, this clarifies well what is possible vs not.
|
| I was trying to understand if there is an opportunity to
| introduce some of this technology to solve "anomaly
| detection" on large amount of structured data, where anomaly
| might be an incredibly overloaded term (it might imply a
| performance regression, a security issue, etc). That is a
| business need I have today.
|
| It seems that what is possible today is an assistant that can
| aid a user to get to these answers faster (by, for instance,
| suggesting a SQL query based on the schema, etc). Again,
| roughly the equivalent of what Code Interpreter does, just
| without the local environment limitations.
| IKantRead wrote:
| > can the RAG pattern be useful here?
|
| From your questions it looks like you are only interesting in
| the R part. RAG implies the retrieval step is then used to
| augment a user prompt.
|
| To answer 1, a good heuristic would be "can a human reasonably
| familiar with the terminology answer questions about the
| meaning?" If a human would need extra info to make sense of
| your data then so would an LLM.
|
| This is where RAG typically comes in. For example if you had
| documentation about ClassName and FunctionName, a retrieval
| model might be able to find the most likely candidates based on
| a file containing full definitions of these classes and
| function, then pass that info into the LLM appended to your
| query.
|
| For 2: It depends if the fire house is the query or the data.
| If you have queries coming in very quickly, then you might be
| able to if your firehose doesn't have too much volume since you
| can batch requests and get responses fairly quickly.
|
| If the fire hose is the data going into the vector DB then you
| might have some difficultly inserting and indexing the data
| fast enough.
| warkdarrior wrote:
| For this kind of structured data and this kind of structured
| queries, it may be more useful to stick to a data query
| language (SQL, or some analytics engine).
| deanmoriarty wrote:
| Thanks. I wonder if a reasonable approach could then be to
| first insert the data in a datawarehouse-like database
| suitable for analytics, and then use an LLM application to
| (1) generate SQL queries that could answer my question,
| reasoning about the schema (2) potentially summarize the
| output result set. It could still result in a significant
| boost of productivity.
| warkdarrior wrote:
| Indeed, that is a promising path. Fundamentally you still
| want to reply on a human to figure out what analytics are
| interesting to consider, then having the LLM act as a
| helper that generates queries corresponding to the
| analytics.
| zackproser wrote:
| While you don't strictly "need" a vector db to do RAG, as others
| have pointed out, vector databases excel when you're dealing with
| natural language - which is ambiguous.
|
| This will be the case when you're exposing an interface to end
| users that they can submit arbitrary queries to - such as "how do
| I turn off reverse breaking".
|
| By converting the user's query to vectors before sending it to
| your vector store, you're getting at the user's actual intent
| behind their words - which can help you retrieve more accurate
| context to feed to your LLM when asking it to perform a chat
| completion, for example.
|
| This is also important if you're dealing with proprietary or non-
| public data that a search engine can't see. Context-specific
| natural language queries are well suited to vector databases.
|
| We wrote up a guide with examples here:
| https://www.pinecone.io/learn/retrieval-augmented-generation...
|
| And we've got several example notebooks you can run end to end
| using our free-tier here: https://docs.pinecone.io/page/examples
| phillipcarter wrote:
| Ehhh, I don't think you're telling the whole story here.
| Vectors aren't really a complete solution here either. Consider
| a use case like ours where we need to support extremely vague
| inputs (since users give us extremely vague inputs):
| https://twitter.com/_cartermp/status/1700586154599559464/
|
| Cosine similarity across vectors isn't enough here, but when
| combined with an LLM we get the right behavior. As you mention,
| without the vector store reducing the size of data we pass to
| the LLM, hallucinations happen more often. It's a balancing
| act.
|
| The other nasty one to consider is when people write "how do I
| not turn off reverse breaking". Again, a comparison will show
| that as very similar to your input, but it's really the
| opposite. And so if implementers aren't careful to account for
| that, they've now got a nasty subtle bug on their hands.
| danielbln wrote:
| A neat way of dealing with sparse input is to take the entire
| chat history (if any) into account and ask the LLM to expand
| the query so that the semantic search has more to work with.
| Generally, using the LLM to add more data to the user query
| based on context, previous conversation, or just having it
| produce a fake document all together based on the sparse
| query can work well to improve the vectors you use in the
| similarity search. A concern with this strategy is latency,
| as you need to add another generation hop before you can
| query the vector db.
| brandall10 wrote:
| Interesting. Do you have specific examples or a link to a
| post detailing this?
| danielbln wrote:
| The approach is based on hypothetical document embeddings
| (HyDE). Here is a good description of it in the context
| of langchain: https://python.langchain.com/docs/use_cases
| /question_answeri...
|
| The original paper proposing this technique can be found
| here: https://arxiv.org/pdf/2212.10496.pdf
| tshrjn007 wrote:
| What do you use to generate the diagrams in the post? Super Neat.
| pplonski86 wrote:
| Can it be easier to do RAG? Do we always need to have Vector DB?
| Why LLM can't do search through the context by itself?
| clharman wrote:
| You need a vector db because all the vector db companies need
| customers...
|
| You definitely do need information retrieval. It just shouldn't
| be limited to vector dbs. Unfortunately vector db companies and
| the VCs that back them have flooded the internet with
| propaganda suggesting vector db is the only choice.
| https://colinharman.substack.com/p/beware-tunnel-vision-in-a...
|
| For most serious use cases, you'll have far too much data to
| fit into 1 (or several) inference contexts.
| petesergeant wrote:
| Petroni 2020 got pretty far with TFDIF iirc, for a related but
| slightly different task, still I've got to believe the semantic
| search element provided by vector DBS is going to add a lot
| chandureddyvari wrote:
| The context length is limited, for gpt-3.5 it's 4k tokens,
| there are other offerings which offer upto 100k(claude). 100k
| tokens is ~1 book., but it priced steeply for each call. It's
| often wiser, cheaper to Retrieve the context from your text &
| Augment your query to the LLM to Generate more contextual
| answers. That's the reason for the name Retrieval Augmented
| Generation (RAG). For Retrieving - you'd need a vector database
| (for similarity comparison you can use semantic or vector
| embedding based similarity search)
| halflings wrote:
| Minor note: you only need a vector database if you have so
| many possible inputs that linear retrieval is too slow.
|
| Arguably, for many use cases (e.g. searching through a
| document with ~200 passages), loading embeddings in memory
| and running a simple linear search would be fast enough.
| chandureddyvari wrote:
| Yeah what you mentioned might be true. Currently our
| understanding on how LLMs really work behind the screens is
| limited. For example, there was a recent research[1] where
| LLM's accuracy is better if the context is added at the
| beginning when compared to the end of the prompt. So it's
| mostly by trial & error to figure out what works out best
| for you. You can use FAISS or similar to have the
| embeddings in-memory instead of a full fledged vector DB.
| But pg vector is convenient plugin if you already have
| postgres instance running
|
| [1]- https://towardsdatascience.com/in-context-learning-
| approache...
| halflings wrote:
| What I mentioned doesn't depend on how LLMs work, the end
| result is the same (retrieving useful inputs to pass to
| your LLM). Just meant that a lot of people can just do
| this in-memory or in ad-hoc ways if they're not too
| latency constrained.
| azmodeus wrote:
| I think unless you need a vector db definitely don't use
| it.
|
| A vector storage could help in reduce the time it takes to
| retrieve the most similar hit. I used faiss as a local
| vector store quite a bit to retrieve vectors fast. Though I
| had 1.5 million vectors to work through.
| chandureddyvari wrote:
| Interesting. I thought anything >1million would need a
| vector db to scale on production. What was your machine
| config for running faiss? Also did you plan for
| redundancy or was it just faiss as a service VM?
| Tostino wrote:
| People seem to underestimate the scale you can get to on
| a single machine, and overestimate how easy it will be to
| go up from there.
|
| An in memory index is about as good as it gets for a
| single node performance, and fitting that many vectors
| into memory on a single machine is easy.
| phillipcarter wrote:
| Others have chimed in as well, but I'll mention that we've been
| live with our product, for all users, for several months now
| doing RAG with OpenAI vector embeddings stored in Redis.
|
| We then just fetch up to the vectors related to a customer's
| schema in memory (largest is ~200MB) and run cosine similarity
| in a few ms in Go (handwritten, ~25 lines of code), and then
| we've got out top N things to place in our prompt.
|
| Primitive? You betcha. Works extremely well for our entire
| customer base? Yup. You definitely don't need a Vector DB
| unless you have an enormous amount of vectors. For us it means
| having to run our own Redis clusters, but we know how to do
| that, and so we don't need to involve another vendor.
| gsuuon wrote:
| For local stuff with a handful of documents, you can even
| just throw it into a json and call it a day. The similarity
| search is as simple as an np.dot: https://github.com/gsuuon/l
| lm.nvim/blob/main/python3/store.p...
| potatoman22 wrote:
| You can hook up any search engine to an LLM. Vector databases
| are just an easy* way to make a decent search engine.
| simonw wrote:
| No, you don't need a vector database. You can get OK results by
| prompting "give me ten search terms that are relevant to this
| question", then running those searches against a regular full-
| text search engine and pasting those results back into the LLM
| as context along with the original question.
|
| You're likely to get better results from vector-based semantic
| search though, just because it takes you beyond needing exact
| matches on search terms.
| clharman wrote:
| Vector is better for some use cases (open-domain, more
| conversational data) and term-based search is better for
| others (closed-domain, more keyword-based).
|
| I've found that internal enterprise projects tend to be very
| keyword based, and vector search often produces weird, head-
| scratcher results that users hate - whereas term-based search
| does a better job of capturing the right terms, if you do the
| proper synonym/abbreviation expansions.
|
| That said, I use them both, usually with vector search as a
| fallback after the initial keyword-based RAG pass
| ofermend wrote:
| RAG is a very useful flow but I agree the complexity is often
| overwhelming, esp as you move from a toy example to a real
| production deployment. It's not just choosing a vector DB (last
| time I checked there were about 50), managing it, deciding on
| how to chunk data, etc. You also need to ensure your retrieval
| pipeline is accurate and fast, ensuring data is secure and
| private, and manage the whole thing as it scales. That's one of
| the main benefits of using Vectara (https://vectara.com; FD: I
| work there) - it's a GenAI platform that abstracts all this
| complexity away, and you can focus on building your
| application.
| gsuuon wrote:
| Wow this was indeed super comprehensive. A few things I noticed:
|
| - In the cold start section, a couple of the synthetic_data
| responses say 'context does not provide info..'
|
| - It's strange that retrieval_score would decrease while
| quality_score increases at the higher chunk sizes. Could this
| just be that the retrieved chunk is starting to be larger than
| the reference?
|
| - Gpt 3.5 pricing looks out of date, it's currently $0.0015 for
| input for the 4k model
|
| - Interesting that pricing needs to be shown on a log scale.
| Gpt-4 is 46x more expensive than llama 2 70B for ~.3 score
| increase. Training a simple classifier seems like a great way to
| handle this.
|
| - I wonder how stable the quality_score assessment is given the
| exact same configuration. I guess the score differences between
| falcon-180b, llama-2-70b and gpt-3.5 are insignificant?
|
| Is there a similarly comprehensive deep dive into chunking
| methods anywhere? Especially for queries that require multiple
| chunks to answer at all - producing more relevant chunks would
| have a massive impact on response quality I imagine.
| ajhai wrote:
| Kudos to the team for a very detailed notebook going into things
| like pipeline evaluation wrt performance and costs etc. Even if
| we ignore the framework specific bits, it is a great guide to
| follow when building RAG systems in production.
|
| We have been building RAG systems in production for a few months
| and have been tinkering with different strategies to get the most
| performance out of these pipelines. As others have pointed out,
| vector database may not be the right strategy for every problem.
| Similarly there are things like lost in the middle problems
| (https://arxiv.org/abs/2307.03172) that one may have to deal
| with. We put together our learnings building and optimizing these
| pipelines in a post at https://llmstack.ai/blog/retrieval-
| augmented-generation.
|
| https://github.com/trypromptly/LLMStack is a low-code platform we
| open-sourced recently that ships these RAG pipelines out of the
| box with some app templates if anyone wants to try them out.
| yujian wrote:
| Anyscale consistently posts great projects. Very cool to see the
| cost comparison and quality comparison. Not surprising to see
| that OSS is less expensive, but also rated as slightly lower
| quality than gpt-3.5-turbo.
|
| I do wonder, is there some bias in quality measures? Using GPT 4
| to evaluate GPT 4's output?
| https://www.linkedin.com/feed/update/urn:li:activity:7103398...
___________________________________________________________________
(page generated 2023-09-14 23:02 UTC)