[HN Gopher] Contextual Retrieval
___________________________________________________________________
Contextual Retrieval
Author : loganfrederick
Score : 237 points
Date : 2024-09-20 01:57 UTC (21 hours ago)
(HTM) web link (www.anthropic.com)
(TXT) w3m dump (www.anthropic.com)
| skybrian wrote:
| This sounds a lot like how we used to do research, by reading
| books and writing any interesting quotes on index cards, along
| with where they came from. I wonder if prompting for that would
| result in better chunks? It might make it easier to review if you
| wanted to do it manually.
| visarga wrote:
| The fundamental problem of both keyword and embedding based
| retrieval is that they only access surface level features. If
| your document contains 5+5 and you search "where is the result
| 10" you won't find the answer. That is why all texts need to be
| "digested" with LLM before indexing, to draw out implicit
| information and make it explicit. It's also what Anthropic
| proposes we do to improve RAG.
|
| "study your data before indexing it"
| skybrian wrote:
| Makes sense. It seems after retrieval, both would be useful -
| both the exact quote and a summary of its context.
| skeptrune wrote:
| I'm not a fan of this technique. I agree the scenario they lay
| out is a common problem, but the proposed solution feels odd.
|
| Vector embeddings have bag-of-words compression properties and
| can over-index on the first newline separated text block to the
| extent that certain indices in the resulting vector end up much
| closer to 0 than they otherwise would. With quantization, they
| can eventually become 0 and cause you to lose out on lots of
| precision with the dense vectors. IDF search overcomes this to
| some extent, but not enough.
|
| You can "semantically boost" embeddings such that they move
| closer to your document's title, summary, abstract, etc. and get
| the recall benefits of this "context" prepend without polluting
| the underlying vector. Implementation wise it's a weighted sum.
| During the augmentation step where you put things in the context
| window, you can always inject the summary chunk when the doc
| matches as well. Much cleaner solution imo.
|
| Description of "semantic boost" in the Trieve API[1]:
|
| >semantic_boost: Semantic boost is useful for moving the
| embedding vector of the chunk in the direction of the distance
| phrase. I.e. you can push a chunk with a chunk_html of "iphone"
| 25% closer to the term "flagship" by using the distance phrase
| "flagship" and a distance factor of 0.25. Conceptually it's
| drawing a line (euclidean/L2 distance) between the vector for the
| innerText of the chunk_html and distance_phrase then moving the
| vector of the chunk_html distance_factorL2Distance closer to or
| away from the distance_phrase point along the line between the
| two points.
|
| [1]:https://docs.trieve.ai/api-reference/chunk/create-or-
| upsert-...
| torginus wrote:
| Sorry random question - do vector dbs work across models? I'd
| guess no, since embeddings are models specific afaik, but that
| means that a vector db would lock you into using a single LLM
| and even within that, a single version, like Claude-3.5 Sonnet,
| and you couldn't move to 3.5 Haiku, Opus etc., never mind
| ChatGPT or Llama without reindexing.
| rvnx wrote:
| In short: no.
|
| The vector databases are here to store vectors and
| calculating distance between vectors.
|
| The embeddings model is the model that you pick to generate
| these vectors from a string or an image.
|
| You give "bart simpson" to an embeddings model and it becomes
| (43, -23, 2, 3, 4, 843, 34, 230, 324, 234, ...)
|
| You can imagine it like geometric points in space (well, it's
| a vector though), except that instead of being 2D, or
| 3D-space, they are typically in higher-number of dimensions
| (e.g: 768).
|
| When you want to find similar entries, you just generate a
| new vector "homer simpson" (64, -13, 2, 3, 4, 843, 34, 230,
| 324, 234, ...) and send it to the vector database and it will
| return you all the nearest neighbors (= the existing entries
| with the smallest distance).
|
| To generate these vectors, you can use any model that you
| want, however, you have to stay consistent.
|
| It means that once you are using one embedding model, you are
| "forever" stuck with it, as there is no practical way to
| project from one vector space to another.
| torginus wrote:
| that sucks :(. I wonder if there are other approaches to
| this, like simple word lookup, with storing a few synonyms,
| and prompting the LLM to always use the proper technical
| terms when performing a lookup.
| kordlessagain wrote:
| Back of the book index or inverted indexes can be stored
| in a set store and give decent results that compare to
| vector lookups. The issue with them is you have to do an
| extraction inference to get the keywords.
| passion__desire wrote:
| Embedding is a transformation which allows us to find
| semantically relevant chunks from a catalogue given a query.
| Through some nearness criteria, you would retrieve
| "semantically relevant" chunks which along with query would
| be fed to LLMs and ask them to synthesize the best answer.
| Vespa docs are very great if you are thinking of building in
| this space. Retrieval part is independent of synthesis, hence
| it has its separate leaderboard on huggingface.
|
| https://docs.vespa.ai/en/embedding.html
|
| https://huggingface.co/spaces/mteb/leaderboard
| simonw wrote:
| My favorite thing about this is the way it takes advantage of
| prompt caching.
|
| That's priced at around 1/10th of what the prompts would normally
| cost if they weren't cached, which means that tricks like this
| (running every single chunk against a full copy of the original
| document) become feasible where previously they wouldn't have
| financially made sense.
|
| I bet there are all sorts of other neat tricks like this which
| are opened up by caching cost savings.
|
| My notes on contextual retrieval:
| https://simonwillison.net/2024/Sep/20/introducing-contextual...
| and prompt caching: https://simonwillison.net/2024/Aug/14/prompt-
| caching-with-cl...
| jillesvangurp wrote:
| You could do a lot of stuff with pre-calculating things for
| your embeddings. Why cache when you can pre-calculate. That
| brings into play a whole lot of things people commonly do as
| part of ETL.
|
| I come from a traditional search back ground. It's quite
| obvious to me that RAG is a bit of a naive strategy if you
| limit it to just using vector search with some off the shelf
| embedding model. Vector search simply isn't that good. You need
| additional information retrieval strategies if you want to
| improve the context you provide to the LLM. That is effectively
| what they are doing here.
|
| Microsoft published an interesting paper on graph RAG some time
| ago where they combine RAG with vector search based on a
| conceptual graph that they construct from the indexed data
| using entity extraction. This allows them to pull in
| contextually relevant information for matching chunks.
|
| I have a hunch that you could probably get quite far without
| doing any vector search at all. It would be a lot cheaper too.
| Simply use a traditional search engine and some tuned query.
| The trick is of course query tuning. Which may not work that
| well for general purpose use cases but it could work for more
| specialized use cases.
| TmpstsTrrctta wrote:
| I have experience in traditional search as well and I think
| this is doing some limiting of my imagination when it comes
| to vector search. In the post, I did like the introduction of
| the Contextual BM25 compared to other hybrid approaches then
| doing rrf.
|
| For question answering, vector/semantic search is clearly a
| better fit in my mind, and I can see how the contextual
| models can enable and bolster that. However, because I've
| implemented and used so many keyword based systems, that just
| doesn't seem to be how my brain works.
|
| An example I'm thinking of is finding a sushi restaurant near
| me with availability this weekend around dinner time. I'd
| love to be able to search for this as I've written it. How I
| would search for it would be search for sushi restaurant,
| sort by distance and hope the application does a proper job
| of surfacing time filtering.
|
| Conversely, this is mostly how I would build this system.
| Perhaps with a layer to determine user intention to pull out
| restaurant type, location sorting, and time filtering.
|
| I could see using semantic search for filtering down the
| restaurants to related to sushi, but do we then drop back
| into traditional search for filtering and sorting? Utilize
| function calling to have the LLM parameterize our search
| query?
|
| As stated, perhaps I'm not thinking of these the right way
| because of my experiences with existing systems, which I find
| seem to give me better results when well built
| ValentinA23 wrote:
| Another approach I saw is to build a conceptual graph using
| entity extraction and have the LLM suggest search paths
| through that graph to enhance the retrieval step. The LMM
| is fine-tuned on the conceptual graph for this specific
| task. Could work in your case, but you need to deal with an
| ontology that suits your use case, in other words it must
| already contain restaurant location, type of dishes served
| and opening hours.
| postalcoder wrote:
| Graph RAG is very cool and outstanding at filling some
| niches. IIRC, Perplexity's actual search is just BM25 (based
| a lex fridman interview of the founder).
| jillesvangurp wrote:
| Makes sense; perplexity is really responsive and fast
| usually.
|
| I need to check out that interview with Lex Fridman.
| _hfqa wrote:
| Do you have the link and the time in the video where he
| mentions it?
| rty32 wrote:
| https://youtu.be/e-gwvmhyU7A?t=2h5m41s
| visarga wrote:
| GraphRAG requires you define upfront the schema of entity and
| relation types. This works when you are in a known domain,
| but in general, when you want to just answer questions from a
| large reference, you don't know what you need to put in the
| graph.
| lmeyerov wrote:
| This was my exact question. Why do an LLM rewrite, when you
| can add a context vector to a chunk vector, and for plaintext
| indexing, add a context string (eg, tfidf)?
|
| The article claimed other context augmentation fails, and
| that you are better off paying anthropic to run an LLM on all
| your data, but it seems quite handwavy. What vector+text
| search nuance does a full document cache LLM rewrite catch
| that cheapo methods miss? Reminds me of "It is difficult to
| get a man to understand something when his salary depends on
| his not understanding it". (We process enough data that we
| try to limit LLMs to the retrieval step, and only embeddings
| & light LLMs to the indexing step, so it's a $$$ distinction
| for our customers.)
|
| The context caching is neat in general, so I have to wonder
| if this use case is more about paying for ease than quality,
| and its value for quality is elsewhere.
| thruway516 wrote:
| I follow your blog and read almost everything you write about
| Llms. Just curious (if you havent already written about it
| somewhere and I missed it), how much do you spend monthly,
| exploring all the various Llms and their features? (I think its
| a useful context for having a grasp of how much I would have to
| spend to keep up to date with the models out there and the
| latest features)
| simonw wrote:
| Most months I spend less than $10 total across the OpenAI,
| Anthropic and Google APIs - for the kind of stuff I do I'm
| just not racking up really high token counts.
|
| I spend $20/month on ChatGPT plus and $20/month on Claude
| Pro. I get GitHub Copilot for free as an open source
| maintainer.
| davedx wrote:
| Cost is one aspect, but what about ingest time? You're adding
| significant processing time to your pipeline with this method
| right?
| simonw wrote:
| I expect most implementations of RAG don't mind this too much
| - if you're dealing with only a few hundred more pages of
| documents a day the ingestion time from using fancy tricks
| like this is going to be measured in minutes.
| valstu wrote:
| We're doing something similar. We first chunk the documents based
| on h1,h2,h3 headings. Then we add headers in the beginning of the
| chunk as a context. As an imagenary example, instead of one chunk
| being: The usual dose for adults is one or two
| 200mg tablets or capsules 3 times a day.
|
| It is now something like: # Fever ##
| Treatment --- The usual dose for adults is one or two
| 200mg tablets or capsules 3 times a day.
|
| This seems to work pretty well, and doesn't require any LLMs when
| indexing documents.
|
| (Edited formatting)
| cabidaher wrote:
| Did you experiment with different ways to format those included
| headers? Asking because I am doing something similar to that as
| well.
| valstu wrote:
| Nope, not yet. We have sticked with markdownish syntax so
| far.
| visarga wrote:
| I am working on question answering based on long documents /
| bundles of documents, 100+ pages, and I took a similar
| approach. I first summarize each page, give it a title and
| extract a list of subsections. Then I put all the summaries
| together and I ask the model to provide a hierarchical index.
| It will organize the whole bundle into a tree. At querying time
| I combine the path in the tree as additional context.
| passion__desire wrote:
| I used to always wonder how do llms know whether a particular
| long article or audio transcript was written by say Alan Watts.
| Basically these kind of metadata annotation would be common
| while preparing training data for Llama models and so on. This
| could also be reason for the genesis for the argument that
| ChatGPT got slower in December. That "date" metadata would
| "inform" ChatGPT to be unhelpful.
| timwaagh wrote:
| I guess this does give some insights. Using a more space
| efficient language for your codebase will mean more functionality
| in the ais context window when working with Claude and code.
| postalcoder wrote:
| To add some context, this isn't that novel of an approach. A
| common approach to improve RAG results is to "expand" the
| underlying chunks using an llm, so as to increase the semantic
| surface area to match against. You can further improve your
| results by running query expansion using HyDE[1], though it's not
| always an improvement. I use it as a fallback.
|
| I'm not sure what Anthropic is introducing here. I looked at the
| cookbook code and it's just showing the process of producing said
| context, but there's no actual change to their API regarding
| "contextual retrieval".
|
| The one change is prompt caching, introduced a month back, which
| allows you to very cheaply add better context to individual
| chunks by providing the entire (long) document as context.
| Caching is an awesome feature to expose to developers and I don't
| want to take anything away from that.
|
| However, other than that, the only thing I see introduced is just
| a cookbook on how to do a particular rag workflow.
|
| As an aside, Cohere may be my favorite API to work with. (no
| affiliation) Their RAG API is a delight, and unlike anything else
| provided by other providers. I highly recommend it.
|
| 1: https://arxiv.org/abs/2212.10496
| resiros wrote:
| I think the innovation is using caching as so to make the cost
| of the approach manageable. The way they implemented it is that
| each time you create a chunk, you ask the llm to create an
| atomic chunk from the whole context. You need to do this for
| all tens of thousands of chunks in your data. This costs a lot.
| By caching the documents, you can spare costs
| skeptrune wrote:
| You could also just save the first outputted atomic chunk and
| store it then re-use it each time yourself. Easier and more
| consistent.
| postalcoder wrote:
| To be fair, that only works if you keep chunk windows
| static.
| IanCal wrote:
| I don't understand how that helps here. They're not
| regenerating each chunk every time, this is about caching
| the state after running a large doc through a model. You
| can only do this kind of thing if you have access to the
| model itself, or it's provided by the API you use.
| postalcoder wrote:
| Yup. Caching is very nice.. but the framing is weird.
| "Introducing" to me, connotes a product release, not a new
| tutorial.
| bayesianbot wrote:
| I was trying to do this using Prompt Caching like a month ago,
| but then noticed there's five minute maximum lifetime for the
| cached prompts - doesn't really work for my RAG needs (or
| probably most), where the queries would be ran during the next
| month or a year. I can't see any changes to that policy. Little
| surprised to see them talk about Prompt Caching relating to
| RAG.
| spott wrote:
| They aren't using the prompt caching on the query side, only
| on the embedding side... so you cache the document in the
| context window when ingesting it, but not during retrieval.
| KTibow wrote:
| It seems a little odd to make multiple requests instead of
| using one request to create all the context for all the
| chunks.
| _bramses wrote:
| The technique I find most useful is to implement a "linked list"
| strategy where a chunk has multiple pointers to the entry it is
| referenced by. This task is done manually, but the diversity of
| the ways you can reference a particular node go up dramatically.
|
| Another way to look at it, comments. Imagine every comment under
| this post is a pointer back to the original post. Some will be
| close in distance, and others will be farther, due to the
| perception of the authors of the comments themselves. But if you
| assign each comment a "parent_id", your access to the post
| multiplies.
|
| You can see an example of this technique here [1]. I don't
| attempt to mind read what the end user will query for, I simply
| let them tell me, and then index that as a pointer. There are
| only a finite number of options to represent a given object. But
| some representations are very, very, very far from the semantic
| meaning of the core object.
|
| [1] - https://x.com/yourcommonbase/status/1833262865194557505
| vendiddy wrote:
| I don't know anything about AI but I've always wished I could
| just upload a bunch of documents/books and the AI would perform
| some basic keyword searches to figure out what is relevant, then
| auto include that in the prompt.
| average_r_user wrote:
| It would help if you tried Notebooklm by Google. It does this,
| you can upload a document/PDF whatever, and ask questions. The
| model replies to you giving also a reference to your material
| mark_l_watson wrote:
| +1 Google's NotebookLM is amazing. In addition to the
| functionality you mention, I tried loading the PDF for my
| entire Practical AI Programming with Clojure book and had it
| generate an 8 minute podcast that was very nuanced - to be
| honest, it seriously blew my mind how well it works. Here is
| a link to the audio file it automatically generated
| https://markwatson.com/audio/AIClojureBook.wav
|
| NotebookLM is currently free to use and was so good I almost
| immediately started paying Google $20 a month to get access
| to their pro version of Gemini.
|
| I still think the Groq APIs for open weight models are the
| best value for the money, but the way OpenAI, Google,
| Anthropic, etc. are productizing LLMs is very impressive.
| underlines wrote:
| We build a corporate RAG for a government entity. What I've
| learned so far by applying an experimental A/B testing approach
| to RAG using RAGAS metrics:
|
| - Hybrid Retrieval (semantic + vector) and then LLM based
| Reranking made no significant change using synthetic eva-
| questions
|
| - HyDE decreased answer quality and retrieval quality severly
| when measured with RAGAS using synthetic eval-questions
|
| (we still have to do a RAGAS eval using expert and real user
| questions)
|
| So yes, hybrid retrieval is always good - that's no news to
| anyone building production ready or enterprise RAG solutions. But
| one method doesn't always win. We found semantic search of Azure
| AI Search being sufficient as a second method, next to vector
| similarity. Others might find BM25 great, or a fine tuned query
| post processing SLM. Depends on the use case. Test, test, test.
|
| Next things we're going to try:
|
| - RAPTOR
|
| - SelfRAG
|
| - Agentic RAG
|
| - Query Refinement (expansion and sub-queries)
|
| - GraphRAG
|
| Learning so far:
|
| - Always use a baseline and an experiment to try to refute your
| null hypothesis using measures like RAGAS or others.
|
| - Use three types of evaluation questions/answers: 1. Expert
| written q&a, 2. Real user questions (from logs), 3. Synthetic q&a
| generated from your source documents
| williamcotton wrote:
| Could you explain or link to explanations of all of the
| acronyms you've used in your comment?
| jiggawatts wrote:
| It makes me chuckle a bit to see this kind of request in a
| tech forum, particularly when discussing advanced LLM-related
| topics.
|
| This is akin to a HN comment asking someone to search the
| Internet for something on their behalf, while discussing
| search engine algorithms!
| williamcotton wrote:
| It adds useful context to the discussion and spurs further
| conversation.
| williamcotton wrote:
| HyDE: Hypothetical Document Embeddings [1]
|
| RAGAS: RAG Assessment [2]
|
| RAPTOR: Recursive Abstractive Processing for Tree-
| Organized Retrieval [3]
|
| Self-RAG: Self-Reflective Retrieval-Augmented Generation
| [4]
|
| Agentic RAG: Agentic Retrieval-Augmented Generation [5]
|
| GraphRAG: Graph Retrieval-Augmented Generation [6]
|
| [1] https://docs.haystack.deepset.ai/docs/hypothetical-
| document-...
|
| [2] https://docs.ragas.io/en/stable/
|
| [3] https://arxiv.org/html/2401.18059v1
|
| [4] https://selfrag.github.io
|
| [5] https://langchain-
| ai.github.io/langgraph/tutorials/rag/langg...
|
| [6] https://www.microsoft.com/en-
| us/research/blog/graphrag-unloc...
| _kb wrote:
| A lot of people here (myself included) work across
| different specialisations and are here to learn from
| discussion that is intentionally unfamiliar.
| jiggawatts wrote:
| Yes, but ChatGPT knows these things! Just ask it to
| expand the acronyms.
|
| This is the new "can you Google that for me?"
| _kb wrote:
| ymkn l ChatGPT 'yDan ltrjm@ mn l`rby@ l~ lnjlyzy@, wlkn
| sykwn mn lmz`j stkhdmh llmHdth@ fy hdh lsyq
| turing_complete wrote:
| What do you think of HippoRAG? Did you try it or plan to do?
| thelastparadise wrote:
| Can someone explain simply how these benchmarks work?
|
| What exactly is a "failure rate" and how is it computed?
| quantadev wrote:
| They simply ask the AI a question about a large document (or
| set of docs). It either gets the answer right or wrong. They
| count the number of hits and misses.
| regularfry wrote:
| I've been wondering for a while if having ElasticSearch as just
| another function to call might be interesting. If the LLM can
| just generate queries it's an easy deployment.
| mark_l_watson wrote:
| I just took the time to read through all source code and docs.
| Nice ideas. I like to experiment with LLMs running on my local
| computer so I will probably convert this example to use the light
| weight Python library Rank-BM25 instead of Elastic Search, and a
| long context model running on Ollama. I wouldn't have prompt
| caching though.
|
| This example is well written and documented, easy to understand.
| Well done.
| msp26 wrote:
| > If your knowledge base is smaller than 200,000 tokens (about
| 500 pages of material)
|
| I would prefer that anthropic just release their tokeniser so we
| don't have to make guesses.
| paxys wrote:
| Waiting for the day when the entire AI industry goes back full
| circle to TF-IDF.
| davedx wrote:
| Yeah it did make me chuckle. I'm guessing products like
| elasticsearch support all the classic text matching algos out
| of the box anyway?
| ValentinA23 wrote:
| Interesting. One problem I'm facing is using RAG to retrieve
| applicable rules instead of knowledge (chunks): only rules that
| may apply to the context should be injected into the context. I
| haven't done any experiment, but one approach that I think could
| work would be to train small classifiers to determine whether a
| specific rule _could_ apply. The main LLM would be tasked with
| determining whether the rule indeed applies or not for the
| current context.
|
| An example: let's suppose you're using an LLM to play a multi
| user dungeon. In the past your character has behaved badly with
| taxis so that the game has decided to create a rule that says
| that whenever you try to enter a taxi you're kicked out: "we know
| who you are, we refuse to have you as a client until you formally
| apologize to the taxi company director". Upon apologizing, the
| rule is removed. Note that the director of the taxi company could
| be another player and be the one who issued the rule in the first
| place, to be enforced by his NPC fleet of taxis.
|
| I'm wondering how well this could scale (with respect of number
| of active rules) and to which extent traditional RAG could be
| applied. It seems deciding whether a rule applies or not is a
| problem that is more abstract and difficult than deciding whether
| a chunk of knowledge is relevant or not.
|
| In particular the main problem I have identified that makes it
| more difficult is the following dependency loop that doesn't
| appear with knowledge retrieval: you need to retrieve a rule to
| identify whether it applies or not. Does anyone know how this
| problem could be solved ?
| davedx wrote:
| Even with prompt caching this adds a huge extra time to your
| vector database create/update, right? That may be okay for some
| use cases but I'm always wary of adding multiple LLM layers into
| these kinds of applications. It's nice for the cloud LLM
| providers of course.
|
| I wonder how it would work if you generated the contexts yourself
| algorithmically. Depending on how well structured your docs are
| this could be quite trivial (eg for an html doc insert the title
| > h1 > h2 > chunk).
| will-burner wrote:
| I wish they included the datasets they used for the evaluations.
| As far as I can tell, in appendix II they include some sample
| questions, answers, and golden chunks but they do not give the
| entire dataset or give an explicit information on exactly what
| the datasets are.
|
| Does anyone know if the datasets they used for the evaluation are
| publicly available or if they give more information on the
| datasets than what's in appendix II?
|
| There are standard publically available datasets for this type of
| evaluation, like MTEB (https://github.com/embeddings-
| benchmark/mteb). I wonder how this technique does on the MTEB
| dataset.
| justanotheratom wrote:
| Looking forward to some guidance on "chunking":
|
| "Chunk boundaries: Consider how you split your documents into
| chunks. The choice of chunk size, chunk boundary, and chunk
| overlap can affect retrieval performance1."
___________________________________________________________________
(page generated 2024-09-20 23:00 UTC)