[HN Gopher] Contextual Retrieval
       ___________________________________________________________________
        
       Contextual Retrieval
        
       Author : loganfrederick
       Score  : 237 points
       Date   : 2024-09-20 01:57 UTC (21 hours ago)
        
 (HTM) web link (www.anthropic.com)
 (TXT) w3m dump (www.anthropic.com)
        
       | skybrian wrote:
       | This sounds a lot like how we used to do research, by reading
       | books and writing any interesting quotes on index cards, along
       | with where they came from. I wonder if prompting for that would
       | result in better chunks? It might make it easier to review if you
       | wanted to do it manually.
        
         | visarga wrote:
         | The fundamental problem of both keyword and embedding based
         | retrieval is that they only access surface level features. If
         | your document contains 5+5 and you search "where is the result
         | 10" you won't find the answer. That is why all texts need to be
         | "digested" with LLM before indexing, to draw out implicit
         | information and make it explicit. It's also what Anthropic
         | proposes we do to improve RAG.
         | 
         | "study your data before indexing it"
        
           | skybrian wrote:
           | Makes sense. It seems after retrieval, both would be useful -
           | both the exact quote and a summary of its context.
        
       | skeptrune wrote:
       | I'm not a fan of this technique. I agree the scenario they lay
       | out is a common problem, but the proposed solution feels odd.
       | 
       | Vector embeddings have bag-of-words compression properties and
       | can over-index on the first newline separated text block to the
       | extent that certain indices in the resulting vector end up much
       | closer to 0 than they otherwise would. With quantization, they
       | can eventually become 0 and cause you to lose out on lots of
       | precision with the dense vectors. IDF search overcomes this to
       | some extent, but not enough.
       | 
       | You can "semantically boost" embeddings such that they move
       | closer to your document's title, summary, abstract, etc. and get
       | the recall benefits of this "context" prepend without polluting
       | the underlying vector. Implementation wise it's a weighted sum.
       | During the augmentation step where you put things in the context
       | window, you can always inject the summary chunk when the doc
       | matches as well. Much cleaner solution imo.
       | 
       | Description of "semantic boost" in the Trieve API[1]:
       | 
       | >semantic_boost: Semantic boost is useful for moving the
       | embedding vector of the chunk in the direction of the distance
       | phrase. I.e. you can push a chunk with a chunk_html of "iphone"
       | 25% closer to the term "flagship" by using the distance phrase
       | "flagship" and a distance factor of 0.25. Conceptually it's
       | drawing a line (euclidean/L2 distance) between the vector for the
       | innerText of the chunk_html and distance_phrase then moving the
       | vector of the chunk_html distance_factorL2Distance closer to or
       | away from the distance_phrase point along the line between the
       | two points.
       | 
       | [1]:https://docs.trieve.ai/api-reference/chunk/create-or-
       | upsert-...
        
         | torginus wrote:
         | Sorry random question - do vector dbs work across models? I'd
         | guess no, since embeddings are models specific afaik, but that
         | means that a vector db would lock you into using a single LLM
         | and even within that, a single version, like Claude-3.5 Sonnet,
         | and you couldn't move to 3.5 Haiku, Opus etc., never mind
         | ChatGPT or Llama without reindexing.
        
           | rvnx wrote:
           | In short: no.
           | 
           | The vector databases are here to store vectors and
           | calculating distance between vectors.
           | 
           | The embeddings model is the model that you pick to generate
           | these vectors from a string or an image.
           | 
           | You give "bart simpson" to an embeddings model and it becomes
           | (43, -23, 2, 3, 4, 843, 34, 230, 324, 234, ...)
           | 
           | You can imagine it like geometric points in space (well, it's
           | a vector though), except that instead of being 2D, or
           | 3D-space, they are typically in higher-number of dimensions
           | (e.g: 768).
           | 
           | When you want to find similar entries, you just generate a
           | new vector "homer simpson" (64, -13, 2, 3, 4, 843, 34, 230,
           | 324, 234, ...) and send it to the vector database and it will
           | return you all the nearest neighbors (= the existing entries
           | with the smallest distance).
           | 
           | To generate these vectors, you can use any model that you
           | want, however, you have to stay consistent.
           | 
           | It means that once you are using one embedding model, you are
           | "forever" stuck with it, as there is no practical way to
           | project from one vector space to another.
        
             | torginus wrote:
             | that sucks :(. I wonder if there are other approaches to
             | this, like simple word lookup, with storing a few synonyms,
             | and prompting the LLM to always use the proper technical
             | terms when performing a lookup.
        
               | kordlessagain wrote:
               | Back of the book index or inverted indexes can be stored
               | in a set store and give decent results that compare to
               | vector lookups. The issue with them is you have to do an
               | extraction inference to get the keywords.
        
           | passion__desire wrote:
           | Embedding is a transformation which allows us to find
           | semantically relevant chunks from a catalogue given a query.
           | Through some nearness criteria, you would retrieve
           | "semantically relevant" chunks which along with query would
           | be fed to LLMs and ask them to synthesize the best answer.
           | Vespa docs are very great if you are thinking of building in
           | this space. Retrieval part is independent of synthesis, hence
           | it has its separate leaderboard on huggingface.
           | 
           | https://docs.vespa.ai/en/embedding.html
           | 
           | https://huggingface.co/spaces/mteb/leaderboard
        
       | simonw wrote:
       | My favorite thing about this is the way it takes advantage of
       | prompt caching.
       | 
       | That's priced at around 1/10th of what the prompts would normally
       | cost if they weren't cached, which means that tricks like this
       | (running every single chunk against a full copy of the original
       | document) become feasible where previously they wouldn't have
       | financially made sense.
       | 
       | I bet there are all sorts of other neat tricks like this which
       | are opened up by caching cost savings.
       | 
       | My notes on contextual retrieval:
       | https://simonwillison.net/2024/Sep/20/introducing-contextual...
       | and prompt caching: https://simonwillison.net/2024/Aug/14/prompt-
       | caching-with-cl...
        
         | jillesvangurp wrote:
         | You could do a lot of stuff with pre-calculating things for
         | your embeddings. Why cache when you can pre-calculate. That
         | brings into play a whole lot of things people commonly do as
         | part of ETL.
         | 
         | I come from a traditional search back ground. It's quite
         | obvious to me that RAG is a bit of a naive strategy if you
         | limit it to just using vector search with some off the shelf
         | embedding model. Vector search simply isn't that good. You need
         | additional information retrieval strategies if you want to
         | improve the context you provide to the LLM. That is effectively
         | what they are doing here.
         | 
         | Microsoft published an interesting paper on graph RAG some time
         | ago where they combine RAG with vector search based on a
         | conceptual graph that they construct from the indexed data
         | using entity extraction. This allows them to pull in
         | contextually relevant information for matching chunks.
         | 
         | I have a hunch that you could probably get quite far without
         | doing any vector search at all. It would be a lot cheaper too.
         | Simply use a traditional search engine and some tuned query.
         | The trick is of course query tuning. Which may not work that
         | well for general purpose use cases but it could work for more
         | specialized use cases.
        
           | TmpstsTrrctta wrote:
           | I have experience in traditional search as well and I think
           | this is doing some limiting of my imagination when it comes
           | to vector search. In the post, I did like the introduction of
           | the Contextual BM25 compared to other hybrid approaches then
           | doing rrf.
           | 
           | For question answering, vector/semantic search is clearly a
           | better fit in my mind, and I can see how the contextual
           | models can enable and bolster that. However, because I've
           | implemented and used so many keyword based systems, that just
           | doesn't seem to be how my brain works.
           | 
           | An example I'm thinking of is finding a sushi restaurant near
           | me with availability this weekend around dinner time. I'd
           | love to be able to search for this as I've written it. How I
           | would search for it would be search for sushi restaurant,
           | sort by distance and hope the application does a proper job
           | of surfacing time filtering.
           | 
           | Conversely, this is mostly how I would build this system.
           | Perhaps with a layer to determine user intention to pull out
           | restaurant type, location sorting, and time filtering.
           | 
           | I could see using semantic search for filtering down the
           | restaurants to related to sushi, but do we then drop back
           | into traditional search for filtering and sorting? Utilize
           | function calling to have the LLM parameterize our search
           | query?
           | 
           | As stated, perhaps I'm not thinking of these the right way
           | because of my experiences with existing systems, which I find
           | seem to give me better results when well built
        
             | ValentinA23 wrote:
             | Another approach I saw is to build a conceptual graph using
             | entity extraction and have the LLM suggest search paths
             | through that graph to enhance the retrieval step. The LMM
             | is fine-tuned on the conceptual graph for this specific
             | task. Could work in your case, but you need to deal with an
             | ontology that suits your use case, in other words it must
             | already contain restaurant location, type of dishes served
             | and opening hours.
        
           | postalcoder wrote:
           | Graph RAG is very cool and outstanding at filling some
           | niches. IIRC, Perplexity's actual search is just BM25 (based
           | a lex fridman interview of the founder).
        
             | jillesvangurp wrote:
             | Makes sense; perplexity is really responsive and fast
             | usually.
             | 
             | I need to check out that interview with Lex Fridman.
        
             | _hfqa wrote:
             | Do you have the link and the time in the video where he
             | mentions it?
        
               | rty32 wrote:
               | https://youtu.be/e-gwvmhyU7A?t=2h5m41s
        
           | visarga wrote:
           | GraphRAG requires you define upfront the schema of entity and
           | relation types. This works when you are in a known domain,
           | but in general, when you want to just answer questions from a
           | large reference, you don't know what you need to put in the
           | graph.
        
           | lmeyerov wrote:
           | This was my exact question. Why do an LLM rewrite, when you
           | can add a context vector to a chunk vector, and for plaintext
           | indexing, add a context string (eg, tfidf)?
           | 
           | The article claimed other context augmentation fails, and
           | that you are better off paying anthropic to run an LLM on all
           | your data, but it seems quite handwavy. What vector+text
           | search nuance does a full document cache LLM rewrite catch
           | that cheapo methods miss? Reminds me of "It is difficult to
           | get a man to understand something when his salary depends on
           | his not understanding it". (We process enough data that we
           | try to limit LLMs to the retrieval step, and only embeddings
           | & light LLMs to the indexing step, so it's a $$$ distinction
           | for our customers.)
           | 
           | The context caching is neat in general, so I have to wonder
           | if this use case is more about paying for ease than quality,
           | and its value for quality is elsewhere.
        
         | thruway516 wrote:
         | I follow your blog and read almost everything you write about
         | Llms. Just curious (if you havent already written about it
         | somewhere and I missed it), how much do you spend monthly,
         | exploring all the various Llms and their features? (I think its
         | a useful context for having a grasp of how much I would have to
         | spend to keep up to date with the models out there and the
         | latest features)
        
           | simonw wrote:
           | Most months I spend less than $10 total across the OpenAI,
           | Anthropic and Google APIs - for the kind of stuff I do I'm
           | just not racking up really high token counts.
           | 
           | I spend $20/month on ChatGPT plus and $20/month on Claude
           | Pro. I get GitHub Copilot for free as an open source
           | maintainer.
        
         | davedx wrote:
         | Cost is one aspect, but what about ingest time? You're adding
         | significant processing time to your pipeline with this method
         | right?
        
           | simonw wrote:
           | I expect most implementations of RAG don't mind this too much
           | - if you're dealing with only a few hundred more pages of
           | documents a day the ingestion time from using fancy tricks
           | like this is going to be measured in minutes.
        
       | valstu wrote:
       | We're doing something similar. We first chunk the documents based
       | on h1,h2,h3 headings. Then we add headers in the beginning of the
       | chunk as a context. As an imagenary example, instead of one chunk
       | being:                 The usual dose for adults is one or two
       | 200mg tablets or        capsules 3 times a day.
       | 
       | It is now something like:                 # Fever       ##
       | Treatment       ---       The usual dose for adults is one or two
       | 200mg tablets or        capsules 3 times a day.
       | 
       | This seems to work pretty well, and doesn't require any LLMs when
       | indexing documents.
       | 
       | (Edited formatting)
        
         | cabidaher wrote:
         | Did you experiment with different ways to format those included
         | headers? Asking because I am doing something similar to that as
         | well.
        
           | valstu wrote:
           | Nope, not yet. We have sticked with markdownish syntax so
           | far.
        
         | visarga wrote:
         | I am working on question answering based on long documents /
         | bundles of documents, 100+ pages, and I took a similar
         | approach. I first summarize each page, give it a title and
         | extract a list of subsections. Then I put all the summaries
         | together and I ask the model to provide a hierarchical index.
         | It will organize the whole bundle into a tree. At querying time
         | I combine the path in the tree as additional context.
        
         | passion__desire wrote:
         | I used to always wonder how do llms know whether a particular
         | long article or audio transcript was written by say Alan Watts.
         | Basically these kind of metadata annotation would be common
         | while preparing training data for Llama models and so on. This
         | could also be reason for the genesis for the argument that
         | ChatGPT got slower in December. That "date" metadata would
         | "inform" ChatGPT to be unhelpful.
        
       | timwaagh wrote:
       | I guess this does give some insights. Using a more space
       | efficient language for your codebase will mean more functionality
       | in the ais context window when working with Claude and code.
        
       | postalcoder wrote:
       | To add some context, this isn't that novel of an approach. A
       | common approach to improve RAG results is to "expand" the
       | underlying chunks using an llm, so as to increase the semantic
       | surface area to match against. You can further improve your
       | results by running query expansion using HyDE[1], though it's not
       | always an improvement. I use it as a fallback.
       | 
       | I'm not sure what Anthropic is introducing here. I looked at the
       | cookbook code and it's just showing the process of producing said
       | context, but there's no actual change to their API regarding
       | "contextual retrieval".
       | 
       | The one change is prompt caching, introduced a month back, which
       | allows you to very cheaply add better context to individual
       | chunks by providing the entire (long) document as context.
       | Caching is an awesome feature to expose to developers and I don't
       | want to take anything away from that.
       | 
       | However, other than that, the only thing I see introduced is just
       | a cookbook on how to do a particular rag workflow.
       | 
       | As an aside, Cohere may be my favorite API to work with. (no
       | affiliation) Their RAG API is a delight, and unlike anything else
       | provided by other providers. I highly recommend it.
       | 
       | 1: https://arxiv.org/abs/2212.10496
        
         | resiros wrote:
         | I think the innovation is using caching as so to make the cost
         | of the approach manageable. The way they implemented it is that
         | each time you create a chunk, you ask the llm to create an
         | atomic chunk from the whole context. You need to do this for
         | all tens of thousands of chunks in your data. This costs a lot.
         | By caching the documents, you can spare costs
        
           | skeptrune wrote:
           | You could also just save the first outputted atomic chunk and
           | store it then re-use it each time yourself. Easier and more
           | consistent.
        
             | postalcoder wrote:
             | To be fair, that only works if you keep chunk windows
             | static.
        
             | IanCal wrote:
             | I don't understand how that helps here. They're not
             | regenerating each chunk every time, this is about caching
             | the state after running a large doc through a model. You
             | can only do this kind of thing if you have access to the
             | model itself, or it's provided by the API you use.
        
           | postalcoder wrote:
           | Yup. Caching is very nice.. but the framing is weird.
           | "Introducing" to me, connotes a product release, not a new
           | tutorial.
        
         | bayesianbot wrote:
         | I was trying to do this using Prompt Caching like a month ago,
         | but then noticed there's five minute maximum lifetime for the
         | cached prompts - doesn't really work for my RAG needs (or
         | probably most), where the queries would be ran during the next
         | month or a year. I can't see any changes to that policy. Little
         | surprised to see them talk about Prompt Caching relating to
         | RAG.
        
           | spott wrote:
           | They aren't using the prompt caching on the query side, only
           | on the embedding side... so you cache the document in the
           | context window when ingesting it, but not during retrieval.
        
             | KTibow wrote:
             | It seems a little odd to make multiple requests instead of
             | using one request to create all the context for all the
             | chunks.
        
       | _bramses wrote:
       | The technique I find most useful is to implement a "linked list"
       | strategy where a chunk has multiple pointers to the entry it is
       | referenced by. This task is done manually, but the diversity of
       | the ways you can reference a particular node go up dramatically.
       | 
       | Another way to look at it, comments. Imagine every comment under
       | this post is a pointer back to the original post. Some will be
       | close in distance, and others will be farther, due to the
       | perception of the authors of the comments themselves. But if you
       | assign each comment a "parent_id", your access to the post
       | multiplies.
       | 
       | You can see an example of this technique here [1]. I don't
       | attempt to mind read what the end user will query for, I simply
       | let them tell me, and then index that as a pointer. There are
       | only a finite number of options to represent a given object. But
       | some representations are very, very, very far from the semantic
       | meaning of the core object.
       | 
       | [1] - https://x.com/yourcommonbase/status/1833262865194557505
        
       | vendiddy wrote:
       | I don't know anything about AI but I've always wished I could
       | just upload a bunch of documents/books and the AI would perform
       | some basic keyword searches to figure out what is relevant, then
       | auto include that in the prompt.
        
         | average_r_user wrote:
         | It would help if you tried Notebooklm by Google. It does this,
         | you can upload a document/PDF whatever, and ask questions. The
         | model replies to you giving also a reference to your material
        
           | mark_l_watson wrote:
           | +1 Google's NotebookLM is amazing. In addition to the
           | functionality you mention, I tried loading the PDF for my
           | entire Practical AI Programming with Clojure book and had it
           | generate an 8 minute podcast that was very nuanced - to be
           | honest, it seriously blew my mind how well it works. Here is
           | a link to the audio file it automatically generated
           | https://markwatson.com/audio/AIClojureBook.wav
           | 
           | NotebookLM is currently free to use and was so good I almost
           | immediately started paying Google $20 a month to get access
           | to their pro version of Gemini.
           | 
           | I still think the Groq APIs for open weight models are the
           | best value for the money, but the way OpenAI, Google,
           | Anthropic, etc. are productizing LLMs is very impressive.
        
       | underlines wrote:
       | We build a corporate RAG for a government entity. What I've
       | learned so far by applying an experimental A/B testing approach
       | to RAG using RAGAS metrics:
       | 
       | - Hybrid Retrieval (semantic + vector) and then LLM based
       | Reranking made no significant change using synthetic eva-
       | questions
       | 
       | - HyDE decreased answer quality and retrieval quality severly
       | when measured with RAGAS using synthetic eval-questions
       | 
       | (we still have to do a RAGAS eval using expert and real user
       | questions)
       | 
       | So yes, hybrid retrieval is always good - that's no news to
       | anyone building production ready or enterprise RAG solutions. But
       | one method doesn't always win. We found semantic search of Azure
       | AI Search being sufficient as a second method, next to vector
       | similarity. Others might find BM25 great, or a fine tuned query
       | post processing SLM. Depends on the use case. Test, test, test.
       | 
       | Next things we're going to try:
       | 
       | - RAPTOR
       | 
       | - SelfRAG
       | 
       | - Agentic RAG
       | 
       | - Query Refinement (expansion and sub-queries)
       | 
       | - GraphRAG
       | 
       | Learning so far:
       | 
       | - Always use a baseline and an experiment to try to refute your
       | null hypothesis using measures like RAGAS or others.
       | 
       | - Use three types of evaluation questions/answers: 1. Expert
       | written q&a, 2. Real user questions (from logs), 3. Synthetic q&a
       | generated from your source documents
        
         | williamcotton wrote:
         | Could you explain or link to explanations of all of the
         | acronyms you've used in your comment?
        
           | jiggawatts wrote:
           | It makes me chuckle a bit to see this kind of request in a
           | tech forum, particularly when discussing advanced LLM-related
           | topics.
           | 
           | This is akin to a HN comment asking someone to search the
           | Internet for something on their behalf, while discussing
           | search engine algorithms!
        
             | williamcotton wrote:
             | It adds useful context to the discussion and spurs further
             | conversation.
        
               | williamcotton wrote:
               | HyDE: Hypothetical Document Embeddings [1]
               | 
               | RAGAS: RAG Assessment [2]
               | 
               | RAPTOR: Recursive Abstractive Processing for Tree-
               | Organized Retrieval [3]
               | 
               | Self-RAG: Self-Reflective Retrieval-Augmented Generation
               | [4]
               | 
               | Agentic RAG: Agentic Retrieval-Augmented Generation [5]
               | 
               | GraphRAG: Graph Retrieval-Augmented Generation [6]
               | 
               | [1] https://docs.haystack.deepset.ai/docs/hypothetical-
               | document-...
               | 
               | [2] https://docs.ragas.io/en/stable/
               | 
               | [3] https://arxiv.org/html/2401.18059v1
               | 
               | [4] https://selfrag.github.io
               | 
               | [5] https://langchain-
               | ai.github.io/langgraph/tutorials/rag/langg...
               | 
               | [6] https://www.microsoft.com/en-
               | us/research/blog/graphrag-unloc...
        
             | _kb wrote:
             | A lot of people here (myself included) work across
             | different specialisations and are here to learn from
             | discussion that is intentionally unfamiliar.
        
               | jiggawatts wrote:
               | Yes, but ChatGPT knows these things! Just ask it to
               | expand the acronyms.
               | 
               | This is the new "can you Google that for me?"
        
               | _kb wrote:
               | ymkn l ChatGPT 'yDan ltrjm@ mn l`rby@ l~ lnjlyzy@, wlkn
               | sykwn mn lmz`j stkhdmh llmHdth@ fy hdh lsyq
        
         | turing_complete wrote:
         | What do you think of HippoRAG? Did you try it or plan to do?
        
       | thelastparadise wrote:
       | Can someone explain simply how these benchmarks work?
       | 
       | What exactly is a "failure rate" and how is it computed?
        
         | quantadev wrote:
         | They simply ask the AI a question about a large document (or
         | set of docs). It either gets the answer right or wrong. They
         | count the number of hits and misses.
        
       | regularfry wrote:
       | I've been wondering for a while if having ElasticSearch as just
       | another function to call might be interesting. If the LLM can
       | just generate queries it's an easy deployment.
        
       | mark_l_watson wrote:
       | I just took the time to read through all source code and docs.
       | Nice ideas. I like to experiment with LLMs running on my local
       | computer so I will probably convert this example to use the light
       | weight Python library Rank-BM25 instead of Elastic Search, and a
       | long context model running on Ollama. I wouldn't have prompt
       | caching though.
       | 
       | This example is well written and documented, easy to understand.
       | Well done.
        
       | msp26 wrote:
       | > If your knowledge base is smaller than 200,000 tokens (about
       | 500 pages of material)
       | 
       | I would prefer that anthropic just release their tokeniser so we
       | don't have to make guesses.
        
       | paxys wrote:
       | Waiting for the day when the entire AI industry goes back full
       | circle to TF-IDF.
        
         | davedx wrote:
         | Yeah it did make me chuckle. I'm guessing products like
         | elasticsearch support all the classic text matching algos out
         | of the box anyway?
        
       | ValentinA23 wrote:
       | Interesting. One problem I'm facing is using RAG to retrieve
       | applicable rules instead of knowledge (chunks): only rules that
       | may apply to the context should be injected into the context. I
       | haven't done any experiment, but one approach that I think could
       | work would be to train small classifiers to determine whether a
       | specific rule _could_ apply. The main LLM would be tasked with
       | determining whether the rule indeed applies or not for the
       | current context.
       | 
       | An example: let's suppose you're using an LLM to play a multi
       | user dungeon. In the past your character has behaved badly with
       | taxis so that the game has decided to create a rule that says
       | that whenever you try to enter a taxi you're kicked out: "we know
       | who you are, we refuse to have you as a client until you formally
       | apologize to the taxi company director". Upon apologizing, the
       | rule is removed. Note that the director of the taxi company could
       | be another player and be the one who issued the rule in the first
       | place, to be enforced by his NPC fleet of taxis.
       | 
       | I'm wondering how well this could scale (with respect of number
       | of active rules) and to which extent traditional RAG could be
       | applied. It seems deciding whether a rule applies or not is a
       | problem that is more abstract and difficult than deciding whether
       | a chunk of knowledge is relevant or not.
       | 
       | In particular the main problem I have identified that makes it
       | more difficult is the following dependency loop that doesn't
       | appear with knowledge retrieval: you need to retrieve a rule to
       | identify whether it applies or not. Does anyone know how this
       | problem could be solved ?
        
       | davedx wrote:
       | Even with prompt caching this adds a huge extra time to your
       | vector database create/update, right? That may be okay for some
       | use cases but I'm always wary of adding multiple LLM layers into
       | these kinds of applications. It's nice for the cloud LLM
       | providers of course.
       | 
       | I wonder how it would work if you generated the contexts yourself
       | algorithmically. Depending on how well structured your docs are
       | this could be quite trivial (eg for an html doc insert the title
       | > h1 > h2 > chunk).
        
       | will-burner wrote:
       | I wish they included the datasets they used for the evaluations.
       | As far as I can tell, in appendix II they include some sample
       | questions, answers, and golden chunks but they do not give the
       | entire dataset or give an explicit information on exactly what
       | the datasets are.
       | 
       | Does anyone know if the datasets they used for the evaluation are
       | publicly available or if they give more information on the
       | datasets than what's in appendix II?
       | 
       | There are standard publically available datasets for this type of
       | evaluation, like MTEB (https://github.com/embeddings-
       | benchmark/mteb). I wonder how this technique does on the MTEB
       | dataset.
        
       | justanotheratom wrote:
       | Looking forward to some guidance on "chunking":
       | 
       | "Chunk boundaries: Consider how you split your documents into
       | chunks. The choice of chunk size, chunk boundary, and chunk
       | overlap can affect retrieval performance1."
        
       ___________________________________________________________________
       (page generated 2024-09-20 23:00 UTC)