[HN Gopher] Breaking up is hard to do: Chunking in RAG applications
       ___________________________________________________________________
        
       Breaking up is hard to do: Chunking in RAG applications
        
       Author : meysamazad
       Score  : 86 points
       Date   : 2024-06-08 08:14 UTC (14 hours ago)
        
 (HTM) web link (stackoverflow.blog)
 (TXT) w3m dump (stackoverflow.blog)
        
       | seabass wrote:
       | When they recommend using smaller chunks, how small in general
       | are we talking? One sentence? One paragraph? 100 tokens or 1000?
       | While I understand that the context of the data is important,
       | it's hard for me to ground vague statements like that without a
       | concrete realistic example. I'm curious what chunk sizes people
       | have found the most success with for various tasks in the wild.
        
         | homarp wrote:
         | think this way: you are going to return the chunk to your user
         | as 'proof'.
         | 
         | the size therefore depends of your content style.
         | 
         | E.g. for a HN discussion, I would go 'paragraph' of each
         | comment.
         | 
         | in a contract, each clause.
         | 
         | in a non-fiction book, maybe each section...
         | 
         | you can also decide to do some kind of reverse adaptive tree:
         | you chunk at sentences level, then compare, if 'close enough',
         | you merge them into a bigger chunk
        
           | seabass wrote:
           | Suppose you don't know the shape of the content ahead of
           | time. This would be the case for most apps that allow users
           | to upload their own sources.
           | 
           | It sounds like (at the expense of more computation and time)
           | the reverse adaptive tree approach you described would be
           | ideal for those scenarios.
        
             | homarp wrote:
             | that's why the article talks about using machine learning
             | to decide of best strategy: to deal with unknown content
             | 'shape'
        
         | leobg wrote:
         | Why not overlapping sizes? (1), (1,2), (1,2,3) Sometimes the
         | match is in a single sentence. Sometimes in the full paragraph.
         | Sometimes across two paragraphs. If, in your top n results,
         | some of these items overlap, you use the greater unit. And you
         | slide these windows.
         | 
         | Also, I wouldn't necessarily use a "sentence" as the lower
         | bound, since that can be something like "Yes."
        
         | CuriouslyC wrote:
         | When using vector search, you want chunks in the ~1k tokens
         | range. If you're using full text search then chunks should
         | probably be smaller, say a few paragraphs at most. If you use
         | trigrams, you want chunks to be short, maybe even sentence
         | level.
        
         | laborcontract wrote:
         | I'll share my strategy. I usually keep chunks at the max of
         | 0.5% of a document or 270 tokens. Multiple that by three and
         | those are the size of the sliding windows that are then
         | worried.
        
       | bob1029 wrote:
       | Has anyone tried skipping the embedding path in favor of
       | combining a proper FTS engine with an LLM yet?
       | 
       | A tool like Lucene seems far more competent at the task of "find
       | most relevant text fragment" compared to what is realized in a
       | typical vector search application today. I'd also argue that you
       | get more inspectability and control this way. You could even
       | manage the preferred size of the fragments on a per-document
       | basis using an entirely separate heuristic at indexing time.
       | 
       | The semantic capabilities of vector search seem nice, but could
       | you not achieve a similar outcome by using the LLM to project
       | synonymous OR clauses into the FTS query based upon static
       | background material or prior search iteration(s)?
        
         | lmeyerov wrote:
         | Generally you do both and rank the combined results
         | 
         | This is also why generally you don't get a vectordb and instead
         | just add a vector index to the DB you are already using
         | 
         | Re:semantic vs synonym, that's really domain dependent. As the
         | number of hits go up, and queries get more interesting, the
         | more vectors get interesting. At the same time, vectors are
         | heavy, so there's also the question of pushing the semantic
         | aspect to the ranker vs the index, but I don't see that
         | discussed much (search & storage vs compute & latency)
        
           | bob1029 wrote:
           | > There's also the question of pushing the semantic aspect to
           | the ranker vs the index
           | 
           | Could it make sense to perform dynamic vector lookup over the
           | FTS result set best fragments? This could save a lot of money
           | if you have a massive corpus to index because you'd only be
           | paying to embed things that are being searched for at
           | runtime.
           | 
           | Focusing on just the best fragments could also improve the
           | SnR going into the final vector search phase, especially if
           | the fragment length is managed appropriately for each kind of
           | document. If we are dealing with a method from a codebase,
           | then we might prefer to have an unlimited fragment length.
           | For a 20 megabyte PDF, it could be closer to the size of a
           | tweet.
        
         | woodson wrote:
         | Query expansion has been done since forever, long before even
         | word2vec and other semantic embedding models, e.g. using
         | WordNet (not a DNN, despite its name), see
         | https://lucene.apache.org/core/3_3_0/api/contrib-
         | wordnet/org....
        
         | maxlamb wrote:
         | What's FTS?
        
           | shallmn wrote:
           | In this context "Full Text Search". In the context of rage-
           | quitting, something entirely different.
        
           | DandyDev wrote:
           | Thank you!
           | 
           | I really hate it when people throw around acronyms instead of
           | just hitting a few extra keys on their keyboard for clarity
        
         | prng2021 wrote:
         | You can use both search methods by using hybrid search. Here's
         | an implementation:
         | 
         | https://learn.microsoft.com/en-us/azure/search/hybrid-search...
        
           | manishsharan wrote:
           | I wonder how I could implement something similar using Lucene
           | and a vercor db ?
        
             | bob1029 wrote:
             | Build a service that talks to both at the same time, or
             | maybe some project like this:
             | 
             | https://github.com/JuniusLuo/VecLucene
        
             | esafak wrote:
             | All the vector stores are going hybrid; it's becoming a
             | table stakes feature.
        
       | diptanu wrote:
       | We use grid search to figure out what's the best chunking
       | strategy to use. Create a bunch of different strategies such as
       | recursive chunking, semantic, etc, and parameterize them and see
       | which one works best. The "best" chunking strategy depends on the
       | nature of the documents and the questions being asked.
        
       | stoicjumbotron wrote:
       | What would be the chunking strategy for q&a pairs? Right now I'm
       | embedding the complete question and answer but the query results
       | are not good as the response contains data not related to the
       | question at all.
        
         | harpastum wrote:
         | It depends on the specifics of your format, but we've had
         | success embedding the questions and answers separately. If
         | either match, you return the complete question and answer text.
         | Make sure to deduplicate before returning, in case both match.
        
         | laborcontract wrote:
         | Join q and a when vectorizing them. Questions alone are too
         | short to carry a lot of semantic richness.
         | 
         | When you get a query, you then run two semantic search queries:
         | one using the original question and one using a HYDE version of
         | the question. Take those results and run it through cohere's
         | rerank.
        
       | constantinum wrote:
       | with ever increasing context size -- Claude 3 Sonnet, 204,800
       | token context size, 500 page document -- sometimes not opting for
       | chunking but at the same time optimizing for costs and latency
       | might be a better solution, strategies like: Summary extraction
       | and single pass extraction works well[1]
       | 
       | [1] - https://docs.unstract.com/editions/cloud_edition#summary-
       | ext...
        
         | J_Shelby_J wrote:
         | In theory. In practice, context size degrades performance.
        
       | J_Shelby_J wrote:
       | The problem with most chunking schemes I've seen is they're
       | naive. They don't care about content; only token count. That's
       | fine, but what would be best is to chunk by topic.
       | 
       | I'm currently trying to implement chunking by topic using an LLM.
       | It's much slower, but I hope it will be a huge win in retrieval
       | accuracy. First step is to extract topics from the document by
       | asking the LLM to identify all topics, and then split by
       | sentences and feed each sentence to the LLM to identify the
       | topic. I'm hoping the result will be the original text, split by
       | topics. From there, they can be further chunked if needed. Of
       | course, it could be done by just one shot asking the LLM to
       | summarize each topic in the document, but the more the LLM is
       | relied on to write, the more distortion is introduced. Retaining
       | the original text is the goal and the LLM should just be used for
       | decision making.
       | 
       | Here is the crate I'm working out of. The chunking hasn't been
       | pushed, but you can see the decision making workflows.
       | https://github.com/ShelbyJenkins/llm_client
        
         | Swizec wrote:
         | > They don't care about content; only token count. That's fine,
         | but what would be best is to chunk by topic.
         | 
         | I've had a lot of success with chunking documents by subtitle.
         | Works especially well for web published documents because each
         | section tends to be fairly short.
        
         | serjester wrote:
         | You're basically reinventing embeddings. Using an LLM for this
         | is overkill since capturing the meaning of a sentence can be
         | done with a model 1000x smaller.
         | 
         | I have a chunking library that does something similar to this,
         | and there's actually quite a few different libraries that have
         | implemented some variant of "semantic chunking".
         | 
         | [1] https://github.com/Filimoa/open-parse
        
           | nostrebored wrote:
           | Excellent, now show me a single one working in production
           | with various docs.
           | 
           | I've talked through this problem with dozens of startups
           | working on RAG, and everyone has the same problem.
           | 
           | LLMs are arguably not reinventing embeddings if you're using
           | them to infer the structure of a broad document.
           | Understanding the "external structure" of the document, the
           | internal structure of each segment, and the semantic
           | structure of each chunk is important.
        
             | serjester wrote:
             | See my other comment on this thread - larger and larger
             | context windows make perfect chunking irrelevant. We've
             | ingested millions of docs and have a product in production,
             | using LLM's for chunking would have 10Xed our ingestion
             | costs for marginal performance improvement.
        
         | emporas wrote:
         | Why not use langchain-rust and make your own client? If you
         | don't know about langchain, i think you are missing out. I took
         | a look at other langchain implementions in js and python, in
         | each one people have done some serious work. Langchain-rust
         | also uses tree-sitter to chunk code, it works very well in some
         | quick tests i tried.
         | 
         | >The problem with most chunking schemes I've seen is they're
         | naive. They don't care about content; only token count.
         | 
         | I think controlling different inputs depending on context is
         | used in agents. For the moment i haven't seen anything really
         | impressive coming out of agents. Maybe Perplexity style web
         | search, but nothing more.
        
         | dcsan wrote:
         | isn't this just semantic chunking? There's been white papers
         | and already implementations in Langchain and llama index you
         | could look through.
        
       | serjester wrote:
       | As embedding models become more performant and context windows
       | increase, "ideal chunking" becomes less relevant.
       | 
       | Cost isn't as important to us, so we use small chunks and then
       | just pull in the page before and after. If you do this on 20+
       | matches (since you're decomposing the query multiple times),
       | you're very likely finding the content.
       | 
       | Queries can get more expensive but you're getting a corpus of
       | "great answers" to test against as you refine your approach.
       | Model costs are also plummeting which makes brute forcing it more
       | and more viable.
        
         | emporas wrote:
         | Bigger context windows don't work as well as advertised. The
         | "hay in a haystack" problem is not solved yet.
         | 
         | Also bigger context windows mean a lot more time waiting for an
         | answer. Given the quadratic nature of context windows, we are
         | stuck using transformers in smaller chunks. Other architectures
         | like Mamba may solve that, but even then, increases in context
         | window accuracy are not 1000x.
        
       ___________________________________________________________________
       (page generated 2024-06-08 23:01 UTC)