[HN Gopher] Solving the out-of-context chunk problem for RAG
       ___________________________________________________________________
        
       Solving the out-of-context chunk problem for RAG
        
       Author : zmccormick7
       Score  : 185 points
       Date   : 2024-07-22 13:44 UTC (2 days ago)
        
 (HTM) web link (d-star.ai)
 (TXT) w3m dump (d-star.ai)
        
       | Satam wrote:
       | RAG feels hacky to me. We're coming up with these pseudo-
       | technical solutions to help but really they should be solved at
       | the level of the model by researchers. Until this is solved
       | natively, the attempts will be hacky duct-taped solutions.
        
         | repeekad wrote:
         | What about fresh data like an extremely relevant news headline
         | that was published 10 minutes ago? Private data that I don't
         | want stored offsite but am okay trusting an enterprise no log
         | api? Providing realtime context to LLMs isn't "hacky", model
         | intelligence and RAG can complement each other and make
         | advancements in tandem
        
           | jstummbillig wrote:
           | I don't think the parents idea was to bake all information
           | into the model, just that current RAG feels cumbersome to use
           | (but then again, so do most things AI right now) and
           | information access should be intrinsic part of the model.
        
             | randomdata wrote:
             | Is there a specific shortcoming of the model that could be
             | improved, or are we simply seeking better APIs?
        
           | PaulHoule wrote:
           | One of my favorite cases is sports chat. I'd expect ChatGPT
           | to be able to talk about sports legends but not be able to
           | talk about a game that happened last weekend. Copilot usually
           | does a good job because it can look up the game on Bing and
           | them summarize but the other day i asked it "What happened
           | last week in the NFL" and it told me about a Buffalo Bills
           | game from last year (did it know I was in the Bills
           | geography?)
           | 
           | Some kind of incremental fine tuning is probably necessary to
           | keep a model like ChatGPT up to date but I can't picture it
           | happening each time something happens in the news.
        
             | ec109685 wrote:
             | For the current game, it seems solvable by providing it the
             | Boxscore and the radio commentary as context, perhaps with
             | some additional data derived from recent games and news.
             | 
             | I think you'd get a close approximation of speaking with
             | someone who was watching the game with you.
        
         | viraptor wrote:
         | That's so vague I can't tell what you're suggesting. What
         | specifically do you think needs solving at the model level?
         | What should work differently?
        
           | Satam wrote:
           | There's probably lack of cpabalities on multiple fronts. RAG
           | might have the right general idea but currently the retrieval
           | seems to be too seperated from the model itself. I don't know
           | how our brains do it, but retrieval looks to be more
           | integrated there.
           | 
           | Models currently also have no way to update themselves with
           | new info besides us putting data into their context window.
           | They don't learn after the initial training. It seems if they
           | could, say, read documentation and internalize it, the need
           | for RAG or even large context windows would decrease. Humans
           | somehow are able to build understanding of extensive topics
           | with what feels to be a much shorter context-window.
        
             | simonw wrote:
             | Don't forget the importance of data privacy. Updating a
             | model with fresh information makes that information
             | available to ALL users of that model. This often isn't what
             | you want - you can run RAG against a user's private email
             | to answer just their queries, without making that email
             | "baked in" to the model.
        
               | viraptor wrote:
               | You don't need to update the whole model for everyone.
               | Fine tuning exists and is even available as a service in
               | openai. The updates are only visible in the specific
               | models you see.
        
               | simonw wrote:
               | Maintaining a fine-tuned model for every one of your
               | users - even with techniques like LoRA - sounds
               | complicated and expensive to me!
        
               | lowdest wrote:
               | It is, but it's also not that bad. A copy of the weights
               | is X GB of cloud storage, which can be stored as a diff
               | if it helps, and added compute time for loading a custom
               | model and unloading for the next customer. It's not free,
               | but it's an approachable cost for a premium service.
        
             | PaulHoule wrote:
             | I can answer questions off the cuff based on the weights of
             | the neural network in my head. If I really wanted to get
             | the right answers I would do "RAG" in the sense of looking
             | up answers on the web or at the library and summarizing
             | them.
             | 
             | For instance I have a policy that I try hard not to say
             | anything like "most people think that..." without providing
             | links because I work at an archive of public opinion data
             | and if it gets out that one of our people was spouting
             | false information about our domain, even if we weren't
             | advertising the affiliation, that would look bad.
        
             | michalwarda wrote:
             | I guess it's because people are not using tools enough yet.
             | In my tests giving LLM access to tools for retrieval works
             | much better then trying to guess what the RAG would need to
             | answer. ie. LLM decides if it has all of the necessary
             | information to answer the question. If not, let it search
             | for it. If it still fails than let it search more :D
        
               | zmccormick7 wrote:
               | Agreed. Retrieval performance is very dependent on the
               | quality of the search queries. Letting the LLM generate
               | the search queries is much more reliable than just
               | embedding the user input. Also, no retrieval system is
               | going to return everything needed on the first try, so
               | using a multi-step agent approach to retrieving
               | information is the only way I've found to get extremely
               | high accuracy.
        
             | mewpmewp2 wrote:
             | Our brains aren't even doing it also. We can't memorise all
             | the things in the World. For us a library/Google Search is
             | what RAG is for an LLM.
        
           | emrah wrote:
           | I think he is saying we should be making fine-tuning or
           | otherwise similar model altering methods easier rather than
           | messing with bolt-on solutions like RAG
           | 
           | Those are being worked on and RAG is the ducktape solution
           | until they become available
        
         | ac1spkrbox wrote:
         | The set of techniques for retrieval is immature, but it's
         | important to note that just relying on model context or few-
         | shot prompting has many drawbacks. Perhaps the most important
         | is that retrieval as a task should not rely on generative
         | outputs.
        
           | danielbln wrote:
           | It's also subject to significantly more hallucination when
           | the knowledge is baked into the model, vs being injected into
           | the context at runtime.
        
         | l72 wrote:
         | I've described it this way to my colleagues:
         | 
         | RAG is a bit like having a pretty smart person take an open
         | book test on a subject they are not an expert in. If your book
         | has a good chapter layout and index, you probably do an ok job
         | trying to find relevant information, quickly read it, and try
         | to come up with an answer. But your not going to be able to
         | test for a deep understanding of the material. This person is
         | going to struggle if each chapter/concept builds on the
         | previous concept, as you can't just look up something in
         | Chapter 10 and be able to understand it without understanding
         | Chapter 1-9.
         | 
         | Fine-tuning is a bit more like having someone go off and do a
         | phd and specialize in a specific area. They get a much deeper
         | understanding for the problem space and can conceptualize at a
         | different level.
        
         | williamtrask wrote:
         | Fwiw, I used to think this way too but LLMs are more RAG-like
         | internally than we initially realised. Attention is all you
         | need ~= RAG is a big attention mechanism. Models have reverse
         | curse, memorisation issues etc. I personally think of LLMs as a
         | kind of decomposed RAG. Check out DeepMind's RETRO paper for an
         | even closer integration.
        
         | jejeyyy77 wrote:
         | The biggest problem with RAG is that the bottleneck for your
         | product is now the RAG (i.e, results are only as good as what
         | your vector store sends to the LLM). This is a step backwards.
         | 
         | Source: built a few products using RAG+LLM products.
        
         | zby wrote:
         | I guess you can imagine an LLM that contains all information
         | there is - but it would have to be at least as big as all
         | information there is or it would have to hallucinate. And also
         | you Not to mention that it seems that you would also require it
         | to learn everything immediately. I don't see any realistic way
         | to reach that goal.
         | 
         | To reach their potential LLMs need to know how to use external
         | sources.
         | 
         | Update: After some more thinking - if you required it to know
         | information about itself - then this would lead to some paradox
         | - I am sure.
        
       | CharlieDigital wrote:
       | The easiest solution to this is to stuff the heading into the
       | chunk. The heading is hierarchical navigation within the sections
       | of the document.
       | 
       | I found Azure Document Intelligence specifically with the Layout
       | Model to be fantastic for this because it can identify headers.
       | All the better if you write a parser for the output JSON to track
       | depth and stuff multiple headers from the path into the chunk.
        
         | williamcotton wrote:
         | _Contextual chunk headers
         | 
         | The idea here is to add in higher-level context to the chunk by
         | prepending a chunk header. This chunk header could be as simple
         | as just the document title, or it could use a combination of
         | document title, a concise document summary, and the full
         | hierarchy of section and sub-section titles._
         | 
         | That is from the article. Is this different from your suggested
         | approach?
        
           | CharlieDigital wrote:
           | No, but this is also not really a novel solution.
        
         | lmeyerov wrote:
         | So subtle! The article is on doing that, which is something we
         | are doing a lot on right now... though it seems to snatch
         | defeat from the jaws of victory:
         | 
         | If we think about what this is about, it is basically entity
         | augmentation & lexical linking / citations.
         | 
         | Ex: A patient document may be all about patient id 123. That
         | won't be spelled out in every paragraph, but by carrying along
         | the patient ID (semantic entity) and the document (citation),
         | the combined model gets access to them. A naive one-shot
         | retrieval over a naive chunked vector index would want it at
         | the text/embedding, while a smarter one also in the entry
         | metadata. And as others write, this helps move reasoning from
         | the symbolic domain to the semantic domain, so less of a hack.
         | 
         | We are working on some fun 'pure-vector' graph RAG work here to
         | tackle production problems around scale, quality, & always-on
         | scenarios like alerting - happy to chat!
        
           | CharlieDigital wrote:
           | Also working with GRAG (via Neo4j) and I'm somewhat skeptical
           | that for most cases where a natural hierarchical structure
           | already exists that graph will significantly exceed RAG with
           | the hierarchical structure.
           | 
           | A better solution I had thought about its "local RAG". I came
           | across this while processing embeddings from chunks parsed
           | from Azure Document Intelligence JSON. The realization is
           | that relevant topics are often localized within a document.
           | Even across a corpus of documents, relevant passages are
           | localized.
           | 
           | Because the chunks are processed sequentially, one needs only
           | to keep track o the sequence number of the chunk. Assume that
           | the embedding matches with a chunk _n_ , then it would follow
           | that the most important context are the chunks localized at
           | _n - m_ and _n + p_. So find the top _x_ chunks via hybrid
           | embedding + full text match and expand outwards from each of
           | the chunks to grab the chunks around it.
           | 
           | While a chunk may represent just a few sentences of a larger
           | block of text, this strategy will grab possibly the whole
           | section or page of text localized around the chunk with the
           | highest match.
        
             | michalwarda wrote:
             | This works until relevant information is colocated.
             | Sometimes though, for example in financial documents,
             | important parts reference each other through keywords etc.
             | That's why you can always try and retrieve not only
             | positionally related chunks but also semantically related
             | ones.
             | 
             | Go for chunk n, n - m, n + p and n' where n' are closest
             | chunks to n semantically.
             | 
             | Moreover you can give this traversal possibility to your
             | LLM to use itself as a tool or w/e whenever it is missing
             | crucial information to answer the question. Thanks to that
             | you don't always retrieve thousands of tokens even when not
             | needed.
        
               | CharlieDigital wrote:
               | > positionally related chunks but also semantically
               | related ones
               | 
               | That's why the entry point would still be an embedding
               | search; it's just that instead of using the first 20
               | embedding hits, you take the first 5 and if the reference
               | is "semantically adjacent" to the entry concept, we would
               | expect that some of the first few chunks would capture it
               | in most cases.
               | 
               | I think where GRAG yields more relevancy is when the
               | referenced content is _not_ semantically similar nor even
               | semantically adjacent to the entry concept but is
               | semantically similar to some sub fragment of a matched
               | chunk. Depending on the corpus, this can either be common
               | (no familiarity with financial documents) or rare. I 've
               | primarily worked with clinical trial protocols and at
               | least in that space, the concepts are what I would
               | consider "snowflake-shaped" in that it branches out
               | pretty cleanly and rarely cross-references (because it is
               | more common that it repeats the relevant reference).
               | 
               | All that said, I think that as a matter of practicality,
               | most teams will probably get much bigger yield with much
               | less effort doing local expansion based on matching for
               | semantic similarity first since it addresses two core
               | problems with embeddings (text chunk size vs embedding
               | accuracy, relevancy or embeddings matched below a given
               | threshold). Experiment with GRAG depending on the type of
               | questions you're trying to answer and the nature of the
               | underlying content. Don't get me wrong; I'm not saying
               | GRAG has _no_ benefit, but that most teams can explore
               | other ways of using RAG before trying GRAG.
        
               | lmeyerov wrote:
               | Neo4j graph rag is typically not graph rag in the AI
               | sense / MSR Graph RAG paper sense, but KG or lexical
               | extraction & embedding, and some retrieval time hope of
               | the neighborhood being ok
               | 
               | GRAG in the direction of the MSR paper adds some
               | important areas:
               | 
               | - summary indexes that can be lexical (document
               | hierarchy) or not (topic, patient ID, etc), esp via
               | careful entity extraction & linking
               | 
               | - domain-optimized summarization templates, both
               | automated & manual
               | 
               | - + as mentioned, wider context around these at retrieval
               | 
               | - introducing a more generalized framework for handling
               | different kinds of concept relations, summary indexing,
               | and retrieval around these
               | 
               | Ex: The same patient over time & docz, and seperately,
               | similar kinds of patients across documents
               | 
               | Note that I'm not actually a big fan of how the MSR paper
               | indirects the work through KG extraction, as that exits
               | the semantic domain, and we don't do it that way
               | 
               | Fundamentally, that both moves away from paltry retrieval
               | result sets that are small/gaps/etc, and enables cleaner
               | input to the runtime query
               | 
               | I agree it is a quick win if quality can be low and you
               | have low budget/time. Like combine a few out of the box
               | index types and do rank retrieval. But a lot of the power
               | gets lost. We are working on infra (+ OSSing it) because
               | that is an unfortunate and unnecessary state of affairs.
               | Right now llamaindex/langchain and raw vector DBs feel
               | like adhoc and unprincipled ways to build these pipelines
               | in a software engineering and AI perspective, so from an
               | investment side, moving away from hacks and to more
               | semantic, composable, & scalable pipelines is important
               | IMO.
        
               | CharlieDigital wrote:
               | > Neo4j graph rag is typically not graph rag
               | 
               | I would mildly disagree with this; Neo4j just serves as
               | an underlying storage mechanism much like
               | Postgres+pgvector could be the underlying storage
               | mechanism for embedding-only RAG. How one extracts
               | entities and connects them in the graph happens a layer
               | above the storage layer of Neo4j (though Neo4j can also
               | do this internally). Neo4j is not magic; the application
               | layer and data modelling still has to define which
               | entities and how they are connected.
               | 
               | But why Neo4j? Neo4j has some nice amenities for building
               | GRAG on top of. In particular, it has packages to support
               | community partitioning including Leiden[0] (also used by
               | Microsoft's GraphRAG[1]) and Louvain[2] as well as
               | several other community detection algorithms. The built-
               | in support for node embeddings[3] as well as external AI
               | APIs[4] make the DX -- in so far as building the
               | underlying storage for complex retrieval -- quite good,
               | IMO.
               | 
               | The approach that we are taking is that we are importing
               | a corpus of information into Neo4j and performing ETL on
               | the way in to create additional relationships;
               | effectively connecting individual chunks by some related
               | "facet". Then we plan to run community detection over it
               | to identify communities of interest and use a combination
               | of communities, locality, and embedding match to retrieve
               | chunks.
               | 
               | I just started exploring it over the past week and I
               | would say that if your team is going to end up doing some
               | more complex GRAG, then Neo4j feels like it has the right
               | tooling to be the underlying storage layer and you
               | _could_ even feasibly implement other parts of your
               | workflow in there as well, but entity extraction and such
               | feels like it belongs one layer up in the application
               | layer. Most notably, having direct query access to the
               | underlying graph with a graph query language (Cypher)
               | means that you will have more control and different ways
               | to experiment with retrieval. However; as I mentioned, I
               | would encourage most teams to be more clever with
               | embedding RAG before adding more infrastructure like
               | Neo4j.
               | 
               | [0] https://neo4j.com/docs/graph-data-
               | science/current/algorithms...
               | 
               | [1] https://microsoft.github.io/graphrag/
               | 
               | [2] https://neo4j.com/docs/graph-data-
               | science/current/algorithms...
               | 
               | [3] https://neo4j.com/docs/graph-data-
               | science/current/machine-le...
               | 
               | [4] https://neo4j.com/labs/apoc/5/ml/openai/
        
               | lmeyerov wrote:
               | We generally stick with using neo4j/neptune/etc for more
               | operational OLTP graph queries, basically large-scale
               | managed storage for small neighborhood lookups. As soon
               | as the task becomes more compute-tier AI workloads, like
               | LLM summary indexing of 1M tweets or 10K documents, we
               | prefer to use GPU-based compute stacks & external APIs
               | with more fidelity. Think pipelines combining bulk
               | embeddings, rich enrichment & wrangling, GNNs, community
               | detection, etc. We only dump into DBs at the end.
               | Speedups are generally in the 2-100X territory with even
               | cheapo GPUs, so this ends up a big deal for both
               | development + production. Likewise, continuous update
               | flows end up being awkward in these environments vs full
               | compute-tier ones, even ignoring the GPU aspect.
               | 
               | Separately, we're still unsure about vector search inside
               | vs outside the graph DB during retrieval, both in the
               | graph RAG scenario and the more general intelligence work
               | domains. I'm more optimistic there for keeping in these
               | graph DB, especially for small cases (< 10M node+edges)
               | we do in notebooks.
               | 
               | And agreed, it's unfortunate neo4j uses graph RAG to
               | market a variety of mostly bad quality solutions and
               | conflate it with graph db storage, and the MSR
               | researchers used it for a more specific and more notable
               | technique (in AI circles) that doesn't need a graph DB
               | and IMO, fundamentally, not even a KG. It's especially
               | confusing that both groups are 'winning' on the term...
               | in different circles.
        
         | ec109685 wrote:
         | Would it be better to go all the way and completely rewrite the
         | source material in a way more suitable for retrieval? To some
         | extent these headers are a step in that direction, but you're
         | still at the mercy of the chunk of text being suitable to
         | answer the question.
         | 
         | Instead, completely transforming the text into a dense set of
         | denormalized "notes" that cover every concept present in the
         | text seems like it would be easier to mine for answers to user
         | questions.
         | 
         | Essentially, it would be like taking comprehensive notes from a
         | book and handing them to a friend who didn't take the class for
         | a test. What would they need to be effective?
         | 
         | Longer term, the sequence would likely be "get question", hand
         | it to research assistant who has full access to source material
         | and can run a variety of AI / retrieval strategies to customize
         | the notes, and then hand those notes back for answers. By
         | spending more time on the note gathering step, it will be more
         | likely the LLM will be able to answer the question.
        
           | CharlieDigital wrote:
           | For a large corpus, this would be quite expensive in terms of
           | time and storage space. My experience is that embeddings work
           | pretty well around 144-160 tokens (pure trial and error) with
           | clinical trial protocols. I am certain that this value will
           | be different by domain and document types.
           | 
           | If you generate and then "stuff" more text into this, my
           | hunch is that accuracy drops off as the token count increases
           | and it becomes "muddy". GRAG or even normal RAG can solve
           | this to an extent because -- as you propose -- you can
           | generate a congruent "note" and then embed that and link them
           | together.
           | 
           | I'd propose something more flexible: expand on the input
           | query instead and basically multiplex it to the related
           | topics and ideas instead and perform cheap embedding search
           | using more than 1 input vector.
        
       | aster0id wrote:
       | I'd like to see more evaluation data. There are 100s of RAG
       | strategies, most of them only work on specific types of queries.
        
         | gillesjacobs wrote:
         | Yeah exactly, existing benchmark datasets available are
         | underutilized (eg KILT, Natural questions, etc.).
         | 
         | But it is only natural that different QA use cases require
         | different strategies. I built 3 production RAG systems /
         | virtual assistant now, and 4 that didn't make it past PoC and
         | what advanced techniques works really depends on document type,
         | text content and genre, use case, source knowledgebase
         | structure and metadata to exploit etc.
         | 
         | Current go-to is semantic similarity chunking (with overlap) +
         | title or question generation > retriever with fusion on bienc
         | vector sim + classic bm25 + condensed question reformulated QA
         | agent. If you don't get some decent results with that setup
         | there is no hope.
         | 
         | For every project we start the creation of a use-case eval set
         | immediately in parallel with the actual RAG agent, but
         | sometimes the client doesn't think this is priority. We
         | convinced them all it's highly important though, because it is.
         | 
         | Having an evaluation set is doubly important in GenAI projects:
         | a generative system will do unexpected things and an objective
         | measure is needed. Your client will run into weird behaviour
         | when testing and they will get hung up on a 1-in-100
         | undesirable generation.
        
           | drittich wrote:
           | How do you weight results between vector search and bm25? Do
           | you fall back to bm25 when vector similarity is below a
           | threshold, or maybe you tweak the weights by hand for each
           | data set?
        
             | gillesjacobs wrote:
             | The algorithm I use to get a final ranking from multiple
             | rankings is called "reciprocal ranked fusion". I use the
             | implementation described here: https://docs.llamaindex.ai/e
             | n/stable/examples/low_level/fusi...
             | 
             | Which is the implementation from the original paper.
        
               | drittich wrote:
               | Thanks, much appreciated!
        
         | SkyPuncher wrote:
         | RAG is akin to "search engine".
         | 
         | It's such a broad term that it's essentially useless. Nearly
         | anyone doing anything interesting with LLMs is doing RAG.
        
           | simonw wrote:
           | The definition for RAG that works for me is that you perform
           | some form of "retrieval" (could be full-text search, could be
           | vector search, could be some combination of the two or even
           | another technique like a regular expression search) and you
           | then include the results of that retrieval in the context.
           | 
           | I think it's a useful term.
        
       | siquick wrote:
       | I can't imagine any serious RAG application is not doing this -
       | adding a contextual title, summary, keywords, and questions to
       | the metadata of each chunk is a pretty low effort/high return
       | implementation.
        
         | visarga wrote:
         | Text embeds don't capture inferred data, like "second letter of
         | this text" does not embed close to "e". LLM chain of thought is
         | required to deduce the meaning more completely.
        
           | derefr wrote:
           | Given current SOTA, no, they don't.
           | 
           | But there's no reason why they _couldn't_ -- just capture the
           | vectors of some of the earlier hidden layers during the RAG
           | encoder's inference run, and append these intermediate
           | vectors to the final embedding vector of the output layer to
           | become the vectors you throw into your vector DB. (And then
           | do the same at runtime for embedding your query prompts.)
           | 
           | Probably you'd want to bias those internal-layer vectors,
           | giving them an increasingly-high "artificial distance"
           | coefficient for increasingly-early layers -- so that a
           | document closely matching in token space or word space or
           | syntax-node space improves its retrieval rank a _bit_ , but
           | not nearly as much as if the document were a close match in
           | concept space. (But maybe do something nonlinear instead of
           | multiplication here -- you might want _near-identical_ token-
           | wise or syntax-wise matches to show up despite different
           | meanings, depending on your use-case.)
           | 
           | Come to think, you could probably build a pretty good source-
           | code search RAG off of this approach.
           | 
           | (Also, it should hopefully be obvious here that if you fine-
           | tuned an encoder-decoder LLM to label matches based on
           | criteria where some of those criteria are only available in
           | earlier layers, then you'd be training pass-through vector
           | dimensions into the intermediate layers of the encoder --
           | such that using such an encoder on its own for RAG embedding
           | should produce the same effect as capturing + weighting the
           | intermediate layers of a non-fine-tuned LLM.)
        
         | samx18 wrote:
         | I agree, most production RAG systems have been doing this since
         | last year
        
         | J_Shelby_J wrote:
         | How do you generate keywords in a low effort way for each
         | chunk?
         | 
         | Asking an LLM is low effort to do, but its not efficient nor
         | guaranteed to be correct.
        
           | kkzz99 wrote:
           | If the economical case justifies it you can use a cheap or
           | lower end model to generate the meta information. Considering
           | how cheap gpt-4o-mini is, seems pretty plausible to do that.
           | 
           | At my startup we also got pretty good results using 7B/8B
           | models to generate meta information about chunks/parts of
           | text.
        
         | derefr wrote:
         | > adding a contextual title, summary, keywords, and questions
         | 
         | That's interesting; do you then transform the question-as-
         | prompt before embedding it at runtime, so that it "asks for"
         | that metadata to be in the response? Because otherwise, it
         | would seem to me that you're just making it harder for the
         | prompt vector and the document vectors to match.
         | 
         | (I guess, if it's _equally_ harder in all cases, then that
         | might be fine. But if some of your documents have few tags or
         | no title or something, they might be unfairly advantaged in a
         | vector-distance-ranked search, because the formats of the
         | documents more closely resemble the response format the
         | question was expecting...)
        
       | gillesjacobs wrote:
       | I really want to see some evaluation benchmark comparisons on in-
       | chunk augmentation approaches like this (and question, title,
       | header-generation) and the hybrid retrieval approach where you
       | match at multiple levels: first retrieve/filter on a higher-level
       | summary, title or header, then match the related chunks.
       | 
       | The pure vector approach of in-chunk text augmentation is much
       | simpler of course, but my hypothesis is that the resulting vector
       | will cause too much false positives in retrieval.
       | 
       | In my experience retrieval precision is most commonly the problem
       | not recall with vector similarity. This method will indeed
       | improve recall for out-of-context chunks, but for me recall has
       | not been a problem very often.
        
       | bob1029 wrote:
       | I've found the best approach is to start with traditional full
       | text search. Get it to a point where manual human searches are
       | useful - Especially for users who don't have a stake in the
       | development of an AI solution. _Then_ , look at building a RAG-
       | style solution around the FTS.
       | 
       | I never could get much beyond the basic search piece. I don't see
       | how mixing in a black box AI model with probabilistic outcomes
       | could add any value without having this working first.
        
         | k__ wrote:
         | I always wondered why a RAG index has to be a vector DB.
         | 
         | If the model understands text/code and can generate text/code
         | it should be able to talk to OpenSearch no problem.
        
           | te_chris wrote:
           | Honestly you clocked the secret: it doesn't.
           | 
           | It makes sense for the hype, though. As we got LLM's we also
           | got wayyyy better embedding models, but they're not
           | dependencies.
        
           | simonw wrote:
           | It doesn't have to be a vector DB - and in fact I'm seeing
           | increasing skepticism that embedding vector DBs are the best
           | way to implement RAG.
           | 
           | A full-text search index using BM25 or similar may actually
           | work a lot better for many RAG applications.
           | 
           | I wrote up some notes on building FTS-based RAG here:
           | https://simonwillison.net/2024/Jun/21/search-based-rag/
        
             | rcarmo wrote:
             | I've been using SQLite FTS (which is essentially BM25) and
             | it works so well I haven't really bothered with vector
             | databases, or Postgres, or anything else yet. Maybe when my
             | corpus exceeds 2GB...
        
             | ianbutler wrote:
             | In 2019 I was using vector search to narrow the search
             | space within 100s of millions of documents and then do full
             | text search on the top 10k or so docs.
             | 
             | That seems like a better stacking of the technologies even
             | now
        
             | niam wrote:
             | What are the arguments for embedded vector DBs being
             | suboptimal in RAG, out of curiosity?
        
               | simonw wrote:
               | The biggest one is that it's hard to get "zero matches"
               | from an embeddings database. You get back all results
               | ordered by distance from the user's query, but it will
               | really scrape the bottom of the barrel if there aren't
               | any great matches - which can lead to bugs like this one:
               | https://simonwillison.net/2024/Jun/6/accidental-prompt-
               | injec...
               | 
               | The other problem is that embeddings search can miss
               | things that a direct keyword match would have caught. If
               | you have key terms that are specific to your corpus -
               | product names for example - there's a risk that a vector
               | match might not score those as highly as BM25 would have
               | so you may miss the most relevant documents.
               | 
               | Finally, embeddings are much more black box and hard to
               | debug and reason about. We have decades of experience
               | tweaking and debugging and improving BM25-style FTS
               | search - the whole field of "Information Retrieval".
               | Throwing that all away in favour of weird new embedding
               | vectors is suboptimal.
        
           | verdverm wrote:
           | You can view RAG as a bigger word2vec. The canonical example
           | being "king - man + woman = queen". Words, or now chunks,
           | have geometric distribution, cluster, and relationships... on
           | semantic levels
           | 
           | What is happening is that text is being embedded into a
           | different space, and that format is an array of floats (a
           | point in the embedding space). When we do retrieval, we embed
           | the query and then find other points close to that query. The
           | reason for Vector DB is (1) to optimize for this use-case, we
           | have many specialized data stores / indexes (redis, elastic,
           | dolt, RDBMS) (2) often to be memory based for faster
           | retrieval. PgVector will be interesting to watch. I
           | personally use Qdrant
           | 
           | Full-text search will never be able to do some of the things
           | that are possible in the embedding space. The most capable
           | systems will use both techniques
        
         | petercooper wrote:
         | You're right, and it's also possible to still use LLMs and
         | vector search in such a system, but instead you use them to
         | enrich the _queries_ made to traditional, pre-existing
         | knowledge bases and search systems. Arguably you could call
         | this  "generative assisted retrieval" or GAR.. sadly I didn't
         | coin the term, there's a paper about it ;-)
         | https://aclanthology.org/2021.acl-long.316/
        
         | alexmolas wrote:
         | But with FTS you don't solve the "out-of-context chunk
         | problem". You'll still miss relevant chunks with FTS. You still
         | can apply the approach proposed in the post to FTS, but instead
         | of using similarity you could use BM25.
        
         | zby wrote:
         | Traditional FTS returns the whole document - people take over
         | from that point and locate the interesting content there. The
         | problem with RAG is that it does not follow that procedure - it
         | tries to find the interesting chunk in one step. Even though
         | since ReAct we know that LLMs could follow the same procedure
         | as humans.
         | 
         | But we need an iterative RAG anyway:
         | https://zzbbyy.substack.com/p/why-iterative-thinking-is-cruc...
        
       | ankit219 wrote:
       | As is typical with any RAG strategy/algorithm, the implicit thing
       | is it works on a specific dataset. Then, it solves a very
       | specific use case. The thing is, if you have a dataset and a use
       | case, you can have a very custom algorithm which would work
       | wonders in terms of output you need. There need not be anything
       | generic.
       | 
       | My instinct at this point is, these algos look attractive because
       | we are constrained to giving a user a wow moment where they
       | upload something and get to chat with the doc/dataset within
       | minutes. As attractive as that is, it is a distinct second
       | priority to building a system that works 99% of the time, even if
       | takes a day or two to set up. You get a feel of the data, have a
       | feel of type of questions that may be asked, and create an algo
       | that works for a specific type of dataset-usecase combo (assuming
       | any more data you add in this system would be similar and work
       | pretty well). There is no silver bullet that we seem to be
       | searching for.
        
         | cl42 wrote:
         | 100% agree with you. I've built a # of RAG systems and find
         | that simple Q&A-style use cases actually do fine with
         | traditional chunking approaches.
         | 
         | ... and then you have situations where people ask complex
         | questions with multiple logical steps, or knowledge gathering
         | requirements, and using some sort of hierarchical RAG strategy
         | works better.
         | 
         | I think a lot of solutions (including this post) abstract to
         | building knowledge graphs of some sort... But knowledge graphs
         | still require an ontology associated to the problem you're
         | solving and will fail outside of those domains.
        
       | oshams wrote:
       | Have you considered this approach? Worked well for us:
       | https://news.ycombinator.com/item?id=40998497
        
       | Sharlin wrote:
       | "An Outside Context Problem was the sort of thing most
       | civilisations encountered just once, and which they tended to
       | encounter rather in the same way a sentence encountered a full
       | stop."
       | 
       | https://www.goodreads.com/quotes/9605621-an-outside-context-...
       | 
       | (Sorry, I just _had_ to post this quote because it was the first
       | thing that came to my mind when I saw the title, and I 've been
       | re-reading Banks lately.)
        
       | unixhero wrote:
       | I have identified a painpoint where my RAGs are insufficiently
       | answering what I already had with a long tail DAG in production.
        
       ___________________________________________________________________
       (page generated 2024-07-24 23:03 UTC)