[HN Gopher] Solving the out-of-context chunk problem for RAG
___________________________________________________________________
Solving the out-of-context chunk problem for RAG
Author : zmccormick7
Score : 185 points
Date : 2024-07-22 13:44 UTC (2 days ago)
(HTM) web link (d-star.ai)
(TXT) w3m dump (d-star.ai)
| Satam wrote:
| RAG feels hacky to me. We're coming up with these pseudo-
| technical solutions to help but really they should be solved at
| the level of the model by researchers. Until this is solved
| natively, the attempts will be hacky duct-taped solutions.
| repeekad wrote:
| What about fresh data like an extremely relevant news headline
| that was published 10 minutes ago? Private data that I don't
| want stored offsite but am okay trusting an enterprise no log
| api? Providing realtime context to LLMs isn't "hacky", model
| intelligence and RAG can complement each other and make
| advancements in tandem
| jstummbillig wrote:
| I don't think the parents idea was to bake all information
| into the model, just that current RAG feels cumbersome to use
| (but then again, so do most things AI right now) and
| information access should be intrinsic part of the model.
| randomdata wrote:
| Is there a specific shortcoming of the model that could be
| improved, or are we simply seeking better APIs?
| PaulHoule wrote:
| One of my favorite cases is sports chat. I'd expect ChatGPT
| to be able to talk about sports legends but not be able to
| talk about a game that happened last weekend. Copilot usually
| does a good job because it can look up the game on Bing and
| them summarize but the other day i asked it "What happened
| last week in the NFL" and it told me about a Buffalo Bills
| game from last year (did it know I was in the Bills
| geography?)
|
| Some kind of incremental fine tuning is probably necessary to
| keep a model like ChatGPT up to date but I can't picture it
| happening each time something happens in the news.
| ec109685 wrote:
| For the current game, it seems solvable by providing it the
| Boxscore and the radio commentary as context, perhaps with
| some additional data derived from recent games and news.
|
| I think you'd get a close approximation of speaking with
| someone who was watching the game with you.
| viraptor wrote:
| That's so vague I can't tell what you're suggesting. What
| specifically do you think needs solving at the model level?
| What should work differently?
| Satam wrote:
| There's probably lack of cpabalities on multiple fronts. RAG
| might have the right general idea but currently the retrieval
| seems to be too seperated from the model itself. I don't know
| how our brains do it, but retrieval looks to be more
| integrated there.
|
| Models currently also have no way to update themselves with
| new info besides us putting data into their context window.
| They don't learn after the initial training. It seems if they
| could, say, read documentation and internalize it, the need
| for RAG or even large context windows would decrease. Humans
| somehow are able to build understanding of extensive topics
| with what feels to be a much shorter context-window.
| simonw wrote:
| Don't forget the importance of data privacy. Updating a
| model with fresh information makes that information
| available to ALL users of that model. This often isn't what
| you want - you can run RAG against a user's private email
| to answer just their queries, without making that email
| "baked in" to the model.
| viraptor wrote:
| You don't need to update the whole model for everyone.
| Fine tuning exists and is even available as a service in
| openai. The updates are only visible in the specific
| models you see.
| simonw wrote:
| Maintaining a fine-tuned model for every one of your
| users - even with techniques like LoRA - sounds
| complicated and expensive to me!
| lowdest wrote:
| It is, but it's also not that bad. A copy of the weights
| is X GB of cloud storage, which can be stored as a diff
| if it helps, and added compute time for loading a custom
| model and unloading for the next customer. It's not free,
| but it's an approachable cost for a premium service.
| PaulHoule wrote:
| I can answer questions off the cuff based on the weights of
| the neural network in my head. If I really wanted to get
| the right answers I would do "RAG" in the sense of looking
| up answers on the web or at the library and summarizing
| them.
|
| For instance I have a policy that I try hard not to say
| anything like "most people think that..." without providing
| links because I work at an archive of public opinion data
| and if it gets out that one of our people was spouting
| false information about our domain, even if we weren't
| advertising the affiliation, that would look bad.
| michalwarda wrote:
| I guess it's because people are not using tools enough yet.
| In my tests giving LLM access to tools for retrieval works
| much better then trying to guess what the RAG would need to
| answer. ie. LLM decides if it has all of the necessary
| information to answer the question. If not, let it search
| for it. If it still fails than let it search more :D
| zmccormick7 wrote:
| Agreed. Retrieval performance is very dependent on the
| quality of the search queries. Letting the LLM generate
| the search queries is much more reliable than just
| embedding the user input. Also, no retrieval system is
| going to return everything needed on the first try, so
| using a multi-step agent approach to retrieving
| information is the only way I've found to get extremely
| high accuracy.
| mewpmewp2 wrote:
| Our brains aren't even doing it also. We can't memorise all
| the things in the World. For us a library/Google Search is
| what RAG is for an LLM.
| emrah wrote:
| I think he is saying we should be making fine-tuning or
| otherwise similar model altering methods easier rather than
| messing with bolt-on solutions like RAG
|
| Those are being worked on and RAG is the ducktape solution
| until they become available
| ac1spkrbox wrote:
| The set of techniques for retrieval is immature, but it's
| important to note that just relying on model context or few-
| shot prompting has many drawbacks. Perhaps the most important
| is that retrieval as a task should not rely on generative
| outputs.
| danielbln wrote:
| It's also subject to significantly more hallucination when
| the knowledge is baked into the model, vs being injected into
| the context at runtime.
| l72 wrote:
| I've described it this way to my colleagues:
|
| RAG is a bit like having a pretty smart person take an open
| book test on a subject they are not an expert in. If your book
| has a good chapter layout and index, you probably do an ok job
| trying to find relevant information, quickly read it, and try
| to come up with an answer. But your not going to be able to
| test for a deep understanding of the material. This person is
| going to struggle if each chapter/concept builds on the
| previous concept, as you can't just look up something in
| Chapter 10 and be able to understand it without understanding
| Chapter 1-9.
|
| Fine-tuning is a bit more like having someone go off and do a
| phd and specialize in a specific area. They get a much deeper
| understanding for the problem space and can conceptualize at a
| different level.
| williamtrask wrote:
| Fwiw, I used to think this way too but LLMs are more RAG-like
| internally than we initially realised. Attention is all you
| need ~= RAG is a big attention mechanism. Models have reverse
| curse, memorisation issues etc. I personally think of LLMs as a
| kind of decomposed RAG. Check out DeepMind's RETRO paper for an
| even closer integration.
| jejeyyy77 wrote:
| The biggest problem with RAG is that the bottleneck for your
| product is now the RAG (i.e, results are only as good as what
| your vector store sends to the LLM). This is a step backwards.
|
| Source: built a few products using RAG+LLM products.
| zby wrote:
| I guess you can imagine an LLM that contains all information
| there is - but it would have to be at least as big as all
| information there is or it would have to hallucinate. And also
| you Not to mention that it seems that you would also require it
| to learn everything immediately. I don't see any realistic way
| to reach that goal.
|
| To reach their potential LLMs need to know how to use external
| sources.
|
| Update: After some more thinking - if you required it to know
| information about itself - then this would lead to some paradox
| - I am sure.
| CharlieDigital wrote:
| The easiest solution to this is to stuff the heading into the
| chunk. The heading is hierarchical navigation within the sections
| of the document.
|
| I found Azure Document Intelligence specifically with the Layout
| Model to be fantastic for this because it can identify headers.
| All the better if you write a parser for the output JSON to track
| depth and stuff multiple headers from the path into the chunk.
| williamcotton wrote:
| _Contextual chunk headers
|
| The idea here is to add in higher-level context to the chunk by
| prepending a chunk header. This chunk header could be as simple
| as just the document title, or it could use a combination of
| document title, a concise document summary, and the full
| hierarchy of section and sub-section titles._
|
| That is from the article. Is this different from your suggested
| approach?
| CharlieDigital wrote:
| No, but this is also not really a novel solution.
| lmeyerov wrote:
| So subtle! The article is on doing that, which is something we
| are doing a lot on right now... though it seems to snatch
| defeat from the jaws of victory:
|
| If we think about what this is about, it is basically entity
| augmentation & lexical linking / citations.
|
| Ex: A patient document may be all about patient id 123. That
| won't be spelled out in every paragraph, but by carrying along
| the patient ID (semantic entity) and the document (citation),
| the combined model gets access to them. A naive one-shot
| retrieval over a naive chunked vector index would want it at
| the text/embedding, while a smarter one also in the entry
| metadata. And as others write, this helps move reasoning from
| the symbolic domain to the semantic domain, so less of a hack.
|
| We are working on some fun 'pure-vector' graph RAG work here to
| tackle production problems around scale, quality, & always-on
| scenarios like alerting - happy to chat!
| CharlieDigital wrote:
| Also working with GRAG (via Neo4j) and I'm somewhat skeptical
| that for most cases where a natural hierarchical structure
| already exists that graph will significantly exceed RAG with
| the hierarchical structure.
|
| A better solution I had thought about its "local RAG". I came
| across this while processing embeddings from chunks parsed
| from Azure Document Intelligence JSON. The realization is
| that relevant topics are often localized within a document.
| Even across a corpus of documents, relevant passages are
| localized.
|
| Because the chunks are processed sequentially, one needs only
| to keep track o the sequence number of the chunk. Assume that
| the embedding matches with a chunk _n_ , then it would follow
| that the most important context are the chunks localized at
| _n - m_ and _n + p_. So find the top _x_ chunks via hybrid
| embedding + full text match and expand outwards from each of
| the chunks to grab the chunks around it.
|
| While a chunk may represent just a few sentences of a larger
| block of text, this strategy will grab possibly the whole
| section or page of text localized around the chunk with the
| highest match.
| michalwarda wrote:
| This works until relevant information is colocated.
| Sometimes though, for example in financial documents,
| important parts reference each other through keywords etc.
| That's why you can always try and retrieve not only
| positionally related chunks but also semantically related
| ones.
|
| Go for chunk n, n - m, n + p and n' where n' are closest
| chunks to n semantically.
|
| Moreover you can give this traversal possibility to your
| LLM to use itself as a tool or w/e whenever it is missing
| crucial information to answer the question. Thanks to that
| you don't always retrieve thousands of tokens even when not
| needed.
| CharlieDigital wrote:
| > positionally related chunks but also semantically
| related ones
|
| That's why the entry point would still be an embedding
| search; it's just that instead of using the first 20
| embedding hits, you take the first 5 and if the reference
| is "semantically adjacent" to the entry concept, we would
| expect that some of the first few chunks would capture it
| in most cases.
|
| I think where GRAG yields more relevancy is when the
| referenced content is _not_ semantically similar nor even
| semantically adjacent to the entry concept but is
| semantically similar to some sub fragment of a matched
| chunk. Depending on the corpus, this can either be common
| (no familiarity with financial documents) or rare. I 've
| primarily worked with clinical trial protocols and at
| least in that space, the concepts are what I would
| consider "snowflake-shaped" in that it branches out
| pretty cleanly and rarely cross-references (because it is
| more common that it repeats the relevant reference).
|
| All that said, I think that as a matter of practicality,
| most teams will probably get much bigger yield with much
| less effort doing local expansion based on matching for
| semantic similarity first since it addresses two core
| problems with embeddings (text chunk size vs embedding
| accuracy, relevancy or embeddings matched below a given
| threshold). Experiment with GRAG depending on the type of
| questions you're trying to answer and the nature of the
| underlying content. Don't get me wrong; I'm not saying
| GRAG has _no_ benefit, but that most teams can explore
| other ways of using RAG before trying GRAG.
| lmeyerov wrote:
| Neo4j graph rag is typically not graph rag in the AI
| sense / MSR Graph RAG paper sense, but KG or lexical
| extraction & embedding, and some retrieval time hope of
| the neighborhood being ok
|
| GRAG in the direction of the MSR paper adds some
| important areas:
|
| - summary indexes that can be lexical (document
| hierarchy) or not (topic, patient ID, etc), esp via
| careful entity extraction & linking
|
| - domain-optimized summarization templates, both
| automated & manual
|
| - + as mentioned, wider context around these at retrieval
|
| - introducing a more generalized framework for handling
| different kinds of concept relations, summary indexing,
| and retrieval around these
|
| Ex: The same patient over time & docz, and seperately,
| similar kinds of patients across documents
|
| Note that I'm not actually a big fan of how the MSR paper
| indirects the work through KG extraction, as that exits
| the semantic domain, and we don't do it that way
|
| Fundamentally, that both moves away from paltry retrieval
| result sets that are small/gaps/etc, and enables cleaner
| input to the runtime query
|
| I agree it is a quick win if quality can be low and you
| have low budget/time. Like combine a few out of the box
| index types and do rank retrieval. But a lot of the power
| gets lost. We are working on infra (+ OSSing it) because
| that is an unfortunate and unnecessary state of affairs.
| Right now llamaindex/langchain and raw vector DBs feel
| like adhoc and unprincipled ways to build these pipelines
| in a software engineering and AI perspective, so from an
| investment side, moving away from hacks and to more
| semantic, composable, & scalable pipelines is important
| IMO.
| CharlieDigital wrote:
| > Neo4j graph rag is typically not graph rag
|
| I would mildly disagree with this; Neo4j just serves as
| an underlying storage mechanism much like
| Postgres+pgvector could be the underlying storage
| mechanism for embedding-only RAG. How one extracts
| entities and connects them in the graph happens a layer
| above the storage layer of Neo4j (though Neo4j can also
| do this internally). Neo4j is not magic; the application
| layer and data modelling still has to define which
| entities and how they are connected.
|
| But why Neo4j? Neo4j has some nice amenities for building
| GRAG on top of. In particular, it has packages to support
| community partitioning including Leiden[0] (also used by
| Microsoft's GraphRAG[1]) and Louvain[2] as well as
| several other community detection algorithms. The built-
| in support for node embeddings[3] as well as external AI
| APIs[4] make the DX -- in so far as building the
| underlying storage for complex retrieval -- quite good,
| IMO.
|
| The approach that we are taking is that we are importing
| a corpus of information into Neo4j and performing ETL on
| the way in to create additional relationships;
| effectively connecting individual chunks by some related
| "facet". Then we plan to run community detection over it
| to identify communities of interest and use a combination
| of communities, locality, and embedding match to retrieve
| chunks.
|
| I just started exploring it over the past week and I
| would say that if your team is going to end up doing some
| more complex GRAG, then Neo4j feels like it has the right
| tooling to be the underlying storage layer and you
| _could_ even feasibly implement other parts of your
| workflow in there as well, but entity extraction and such
| feels like it belongs one layer up in the application
| layer. Most notably, having direct query access to the
| underlying graph with a graph query language (Cypher)
| means that you will have more control and different ways
| to experiment with retrieval. However; as I mentioned, I
| would encourage most teams to be more clever with
| embedding RAG before adding more infrastructure like
| Neo4j.
|
| [0] https://neo4j.com/docs/graph-data-
| science/current/algorithms...
|
| [1] https://microsoft.github.io/graphrag/
|
| [2] https://neo4j.com/docs/graph-data-
| science/current/algorithms...
|
| [3] https://neo4j.com/docs/graph-data-
| science/current/machine-le...
|
| [4] https://neo4j.com/labs/apoc/5/ml/openai/
| lmeyerov wrote:
| We generally stick with using neo4j/neptune/etc for more
| operational OLTP graph queries, basically large-scale
| managed storage for small neighborhood lookups. As soon
| as the task becomes more compute-tier AI workloads, like
| LLM summary indexing of 1M tweets or 10K documents, we
| prefer to use GPU-based compute stacks & external APIs
| with more fidelity. Think pipelines combining bulk
| embeddings, rich enrichment & wrangling, GNNs, community
| detection, etc. We only dump into DBs at the end.
| Speedups are generally in the 2-100X territory with even
| cheapo GPUs, so this ends up a big deal for both
| development + production. Likewise, continuous update
| flows end up being awkward in these environments vs full
| compute-tier ones, even ignoring the GPU aspect.
|
| Separately, we're still unsure about vector search inside
| vs outside the graph DB during retrieval, both in the
| graph RAG scenario and the more general intelligence work
| domains. I'm more optimistic there for keeping in these
| graph DB, especially for small cases (< 10M node+edges)
| we do in notebooks.
|
| And agreed, it's unfortunate neo4j uses graph RAG to
| market a variety of mostly bad quality solutions and
| conflate it with graph db storage, and the MSR
| researchers used it for a more specific and more notable
| technique (in AI circles) that doesn't need a graph DB
| and IMO, fundamentally, not even a KG. It's especially
| confusing that both groups are 'winning' on the term...
| in different circles.
| ec109685 wrote:
| Would it be better to go all the way and completely rewrite the
| source material in a way more suitable for retrieval? To some
| extent these headers are a step in that direction, but you're
| still at the mercy of the chunk of text being suitable to
| answer the question.
|
| Instead, completely transforming the text into a dense set of
| denormalized "notes" that cover every concept present in the
| text seems like it would be easier to mine for answers to user
| questions.
|
| Essentially, it would be like taking comprehensive notes from a
| book and handing them to a friend who didn't take the class for
| a test. What would they need to be effective?
|
| Longer term, the sequence would likely be "get question", hand
| it to research assistant who has full access to source material
| and can run a variety of AI / retrieval strategies to customize
| the notes, and then hand those notes back for answers. By
| spending more time on the note gathering step, it will be more
| likely the LLM will be able to answer the question.
| CharlieDigital wrote:
| For a large corpus, this would be quite expensive in terms of
| time and storage space. My experience is that embeddings work
| pretty well around 144-160 tokens (pure trial and error) with
| clinical trial protocols. I am certain that this value will
| be different by domain and document types.
|
| If you generate and then "stuff" more text into this, my
| hunch is that accuracy drops off as the token count increases
| and it becomes "muddy". GRAG or even normal RAG can solve
| this to an extent because -- as you propose -- you can
| generate a congruent "note" and then embed that and link them
| together.
|
| I'd propose something more flexible: expand on the input
| query instead and basically multiplex it to the related
| topics and ideas instead and perform cheap embedding search
| using more than 1 input vector.
| aster0id wrote:
| I'd like to see more evaluation data. There are 100s of RAG
| strategies, most of them only work on specific types of queries.
| gillesjacobs wrote:
| Yeah exactly, existing benchmark datasets available are
| underutilized (eg KILT, Natural questions, etc.).
|
| But it is only natural that different QA use cases require
| different strategies. I built 3 production RAG systems /
| virtual assistant now, and 4 that didn't make it past PoC and
| what advanced techniques works really depends on document type,
| text content and genre, use case, source knowledgebase
| structure and metadata to exploit etc.
|
| Current go-to is semantic similarity chunking (with overlap) +
| title or question generation > retriever with fusion on bienc
| vector sim + classic bm25 + condensed question reformulated QA
| agent. If you don't get some decent results with that setup
| there is no hope.
|
| For every project we start the creation of a use-case eval set
| immediately in parallel with the actual RAG agent, but
| sometimes the client doesn't think this is priority. We
| convinced them all it's highly important though, because it is.
|
| Having an evaluation set is doubly important in GenAI projects:
| a generative system will do unexpected things and an objective
| measure is needed. Your client will run into weird behaviour
| when testing and they will get hung up on a 1-in-100
| undesirable generation.
| drittich wrote:
| How do you weight results between vector search and bm25? Do
| you fall back to bm25 when vector similarity is below a
| threshold, or maybe you tweak the weights by hand for each
| data set?
| gillesjacobs wrote:
| The algorithm I use to get a final ranking from multiple
| rankings is called "reciprocal ranked fusion". I use the
| implementation described here: https://docs.llamaindex.ai/e
| n/stable/examples/low_level/fusi...
|
| Which is the implementation from the original paper.
| drittich wrote:
| Thanks, much appreciated!
| SkyPuncher wrote:
| RAG is akin to "search engine".
|
| It's such a broad term that it's essentially useless. Nearly
| anyone doing anything interesting with LLMs is doing RAG.
| simonw wrote:
| The definition for RAG that works for me is that you perform
| some form of "retrieval" (could be full-text search, could be
| vector search, could be some combination of the two or even
| another technique like a regular expression search) and you
| then include the results of that retrieval in the context.
|
| I think it's a useful term.
| siquick wrote:
| I can't imagine any serious RAG application is not doing this -
| adding a contextual title, summary, keywords, and questions to
| the metadata of each chunk is a pretty low effort/high return
| implementation.
| visarga wrote:
| Text embeds don't capture inferred data, like "second letter of
| this text" does not embed close to "e". LLM chain of thought is
| required to deduce the meaning more completely.
| derefr wrote:
| Given current SOTA, no, they don't.
|
| But there's no reason why they _couldn't_ -- just capture the
| vectors of some of the earlier hidden layers during the RAG
| encoder's inference run, and append these intermediate
| vectors to the final embedding vector of the output layer to
| become the vectors you throw into your vector DB. (And then
| do the same at runtime for embedding your query prompts.)
|
| Probably you'd want to bias those internal-layer vectors,
| giving them an increasingly-high "artificial distance"
| coefficient for increasingly-early layers -- so that a
| document closely matching in token space or word space or
| syntax-node space improves its retrieval rank a _bit_ , but
| not nearly as much as if the document were a close match in
| concept space. (But maybe do something nonlinear instead of
| multiplication here -- you might want _near-identical_ token-
| wise or syntax-wise matches to show up despite different
| meanings, depending on your use-case.)
|
| Come to think, you could probably build a pretty good source-
| code search RAG off of this approach.
|
| (Also, it should hopefully be obvious here that if you fine-
| tuned an encoder-decoder LLM to label matches based on
| criteria where some of those criteria are only available in
| earlier layers, then you'd be training pass-through vector
| dimensions into the intermediate layers of the encoder --
| such that using such an encoder on its own for RAG embedding
| should produce the same effect as capturing + weighting the
| intermediate layers of a non-fine-tuned LLM.)
| samx18 wrote:
| I agree, most production RAG systems have been doing this since
| last year
| J_Shelby_J wrote:
| How do you generate keywords in a low effort way for each
| chunk?
|
| Asking an LLM is low effort to do, but its not efficient nor
| guaranteed to be correct.
| kkzz99 wrote:
| If the economical case justifies it you can use a cheap or
| lower end model to generate the meta information. Considering
| how cheap gpt-4o-mini is, seems pretty plausible to do that.
|
| At my startup we also got pretty good results using 7B/8B
| models to generate meta information about chunks/parts of
| text.
| derefr wrote:
| > adding a contextual title, summary, keywords, and questions
|
| That's interesting; do you then transform the question-as-
| prompt before embedding it at runtime, so that it "asks for"
| that metadata to be in the response? Because otherwise, it
| would seem to me that you're just making it harder for the
| prompt vector and the document vectors to match.
|
| (I guess, if it's _equally_ harder in all cases, then that
| might be fine. But if some of your documents have few tags or
| no title or something, they might be unfairly advantaged in a
| vector-distance-ranked search, because the formats of the
| documents more closely resemble the response format the
| question was expecting...)
| gillesjacobs wrote:
| I really want to see some evaluation benchmark comparisons on in-
| chunk augmentation approaches like this (and question, title,
| header-generation) and the hybrid retrieval approach where you
| match at multiple levels: first retrieve/filter on a higher-level
| summary, title or header, then match the related chunks.
|
| The pure vector approach of in-chunk text augmentation is much
| simpler of course, but my hypothesis is that the resulting vector
| will cause too much false positives in retrieval.
|
| In my experience retrieval precision is most commonly the problem
| not recall with vector similarity. This method will indeed
| improve recall for out-of-context chunks, but for me recall has
| not been a problem very often.
| bob1029 wrote:
| I've found the best approach is to start with traditional full
| text search. Get it to a point where manual human searches are
| useful - Especially for users who don't have a stake in the
| development of an AI solution. _Then_ , look at building a RAG-
| style solution around the FTS.
|
| I never could get much beyond the basic search piece. I don't see
| how mixing in a black box AI model with probabilistic outcomes
| could add any value without having this working first.
| k__ wrote:
| I always wondered why a RAG index has to be a vector DB.
|
| If the model understands text/code and can generate text/code
| it should be able to talk to OpenSearch no problem.
| te_chris wrote:
| Honestly you clocked the secret: it doesn't.
|
| It makes sense for the hype, though. As we got LLM's we also
| got wayyyy better embedding models, but they're not
| dependencies.
| simonw wrote:
| It doesn't have to be a vector DB - and in fact I'm seeing
| increasing skepticism that embedding vector DBs are the best
| way to implement RAG.
|
| A full-text search index using BM25 or similar may actually
| work a lot better for many RAG applications.
|
| I wrote up some notes on building FTS-based RAG here:
| https://simonwillison.net/2024/Jun/21/search-based-rag/
| rcarmo wrote:
| I've been using SQLite FTS (which is essentially BM25) and
| it works so well I haven't really bothered with vector
| databases, or Postgres, or anything else yet. Maybe when my
| corpus exceeds 2GB...
| ianbutler wrote:
| In 2019 I was using vector search to narrow the search
| space within 100s of millions of documents and then do full
| text search on the top 10k or so docs.
|
| That seems like a better stacking of the technologies even
| now
| niam wrote:
| What are the arguments for embedded vector DBs being
| suboptimal in RAG, out of curiosity?
| simonw wrote:
| The biggest one is that it's hard to get "zero matches"
| from an embeddings database. You get back all results
| ordered by distance from the user's query, but it will
| really scrape the bottom of the barrel if there aren't
| any great matches - which can lead to bugs like this one:
| https://simonwillison.net/2024/Jun/6/accidental-prompt-
| injec...
|
| The other problem is that embeddings search can miss
| things that a direct keyword match would have caught. If
| you have key terms that are specific to your corpus -
| product names for example - there's a risk that a vector
| match might not score those as highly as BM25 would have
| so you may miss the most relevant documents.
|
| Finally, embeddings are much more black box and hard to
| debug and reason about. We have decades of experience
| tweaking and debugging and improving BM25-style FTS
| search - the whole field of "Information Retrieval".
| Throwing that all away in favour of weird new embedding
| vectors is suboptimal.
| verdverm wrote:
| You can view RAG as a bigger word2vec. The canonical example
| being "king - man + woman = queen". Words, or now chunks,
| have geometric distribution, cluster, and relationships... on
| semantic levels
|
| What is happening is that text is being embedded into a
| different space, and that format is an array of floats (a
| point in the embedding space). When we do retrieval, we embed
| the query and then find other points close to that query. The
| reason for Vector DB is (1) to optimize for this use-case, we
| have many specialized data stores / indexes (redis, elastic,
| dolt, RDBMS) (2) often to be memory based for faster
| retrieval. PgVector will be interesting to watch. I
| personally use Qdrant
|
| Full-text search will never be able to do some of the things
| that are possible in the embedding space. The most capable
| systems will use both techniques
| petercooper wrote:
| You're right, and it's also possible to still use LLMs and
| vector search in such a system, but instead you use them to
| enrich the _queries_ made to traditional, pre-existing
| knowledge bases and search systems. Arguably you could call
| this "generative assisted retrieval" or GAR.. sadly I didn't
| coin the term, there's a paper about it ;-)
| https://aclanthology.org/2021.acl-long.316/
| alexmolas wrote:
| But with FTS you don't solve the "out-of-context chunk
| problem". You'll still miss relevant chunks with FTS. You still
| can apply the approach proposed in the post to FTS, but instead
| of using similarity you could use BM25.
| zby wrote:
| Traditional FTS returns the whole document - people take over
| from that point and locate the interesting content there. The
| problem with RAG is that it does not follow that procedure - it
| tries to find the interesting chunk in one step. Even though
| since ReAct we know that LLMs could follow the same procedure
| as humans.
|
| But we need an iterative RAG anyway:
| https://zzbbyy.substack.com/p/why-iterative-thinking-is-cruc...
| ankit219 wrote:
| As is typical with any RAG strategy/algorithm, the implicit thing
| is it works on a specific dataset. Then, it solves a very
| specific use case. The thing is, if you have a dataset and a use
| case, you can have a very custom algorithm which would work
| wonders in terms of output you need. There need not be anything
| generic.
|
| My instinct at this point is, these algos look attractive because
| we are constrained to giving a user a wow moment where they
| upload something and get to chat with the doc/dataset within
| minutes. As attractive as that is, it is a distinct second
| priority to building a system that works 99% of the time, even if
| takes a day or two to set up. You get a feel of the data, have a
| feel of type of questions that may be asked, and create an algo
| that works for a specific type of dataset-usecase combo (assuming
| any more data you add in this system would be similar and work
| pretty well). There is no silver bullet that we seem to be
| searching for.
| cl42 wrote:
| 100% agree with you. I've built a # of RAG systems and find
| that simple Q&A-style use cases actually do fine with
| traditional chunking approaches.
|
| ... and then you have situations where people ask complex
| questions with multiple logical steps, or knowledge gathering
| requirements, and using some sort of hierarchical RAG strategy
| works better.
|
| I think a lot of solutions (including this post) abstract to
| building knowledge graphs of some sort... But knowledge graphs
| still require an ontology associated to the problem you're
| solving and will fail outside of those domains.
| oshams wrote:
| Have you considered this approach? Worked well for us:
| https://news.ycombinator.com/item?id=40998497
| Sharlin wrote:
| "An Outside Context Problem was the sort of thing most
| civilisations encountered just once, and which they tended to
| encounter rather in the same way a sentence encountered a full
| stop."
|
| https://www.goodreads.com/quotes/9605621-an-outside-context-...
|
| (Sorry, I just _had_ to post this quote because it was the first
| thing that came to my mind when I saw the title, and I 've been
| re-reading Banks lately.)
| unixhero wrote:
| I have identified a painpoint where my RAGs are insufficiently
| answering what I already had with a long tail DAG in production.
___________________________________________________________________
(page generated 2024-07-24 23:03 UTC)