[HN Gopher] Breaking up is hard to do: Chunking in RAG applications
___________________________________________________________________
Breaking up is hard to do: Chunking in RAG applications
Author : meysamazad
Score : 86 points
Date : 2024-06-08 08:14 UTC (14 hours ago)
(HTM) web link (stackoverflow.blog)
(TXT) w3m dump (stackoverflow.blog)
| seabass wrote:
| When they recommend using smaller chunks, how small in general
| are we talking? One sentence? One paragraph? 100 tokens or 1000?
| While I understand that the context of the data is important,
| it's hard for me to ground vague statements like that without a
| concrete realistic example. I'm curious what chunk sizes people
| have found the most success with for various tasks in the wild.
| homarp wrote:
| think this way: you are going to return the chunk to your user
| as 'proof'.
|
| the size therefore depends of your content style.
|
| E.g. for a HN discussion, I would go 'paragraph' of each
| comment.
|
| in a contract, each clause.
|
| in a non-fiction book, maybe each section...
|
| you can also decide to do some kind of reverse adaptive tree:
| you chunk at sentences level, then compare, if 'close enough',
| you merge them into a bigger chunk
| seabass wrote:
| Suppose you don't know the shape of the content ahead of
| time. This would be the case for most apps that allow users
| to upload their own sources.
|
| It sounds like (at the expense of more computation and time)
| the reverse adaptive tree approach you described would be
| ideal for those scenarios.
| homarp wrote:
| that's why the article talks about using machine learning
| to decide of best strategy: to deal with unknown content
| 'shape'
| leobg wrote:
| Why not overlapping sizes? (1), (1,2), (1,2,3) Sometimes the
| match is in a single sentence. Sometimes in the full paragraph.
| Sometimes across two paragraphs. If, in your top n results,
| some of these items overlap, you use the greater unit. And you
| slide these windows.
|
| Also, I wouldn't necessarily use a "sentence" as the lower
| bound, since that can be something like "Yes."
| CuriouslyC wrote:
| When using vector search, you want chunks in the ~1k tokens
| range. If you're using full text search then chunks should
| probably be smaller, say a few paragraphs at most. If you use
| trigrams, you want chunks to be short, maybe even sentence
| level.
| laborcontract wrote:
| I'll share my strategy. I usually keep chunks at the max of
| 0.5% of a document or 270 tokens. Multiple that by three and
| those are the size of the sliding windows that are then
| worried.
| bob1029 wrote:
| Has anyone tried skipping the embedding path in favor of
| combining a proper FTS engine with an LLM yet?
|
| A tool like Lucene seems far more competent at the task of "find
| most relevant text fragment" compared to what is realized in a
| typical vector search application today. I'd also argue that you
| get more inspectability and control this way. You could even
| manage the preferred size of the fragments on a per-document
| basis using an entirely separate heuristic at indexing time.
|
| The semantic capabilities of vector search seem nice, but could
| you not achieve a similar outcome by using the LLM to project
| synonymous OR clauses into the FTS query based upon static
| background material or prior search iteration(s)?
| lmeyerov wrote:
| Generally you do both and rank the combined results
|
| This is also why generally you don't get a vectordb and instead
| just add a vector index to the DB you are already using
|
| Re:semantic vs synonym, that's really domain dependent. As the
| number of hits go up, and queries get more interesting, the
| more vectors get interesting. At the same time, vectors are
| heavy, so there's also the question of pushing the semantic
| aspect to the ranker vs the index, but I don't see that
| discussed much (search & storage vs compute & latency)
| bob1029 wrote:
| > There's also the question of pushing the semantic aspect to
| the ranker vs the index
|
| Could it make sense to perform dynamic vector lookup over the
| FTS result set best fragments? This could save a lot of money
| if you have a massive corpus to index because you'd only be
| paying to embed things that are being searched for at
| runtime.
|
| Focusing on just the best fragments could also improve the
| SnR going into the final vector search phase, especially if
| the fragment length is managed appropriately for each kind of
| document. If we are dealing with a method from a codebase,
| then we might prefer to have an unlimited fragment length.
| For a 20 megabyte PDF, it could be closer to the size of a
| tweet.
| woodson wrote:
| Query expansion has been done since forever, long before even
| word2vec and other semantic embedding models, e.g. using
| WordNet (not a DNN, despite its name), see
| https://lucene.apache.org/core/3_3_0/api/contrib-
| wordnet/org....
| maxlamb wrote:
| What's FTS?
| shallmn wrote:
| In this context "Full Text Search". In the context of rage-
| quitting, something entirely different.
| DandyDev wrote:
| Thank you!
|
| I really hate it when people throw around acronyms instead of
| just hitting a few extra keys on their keyboard for clarity
| prng2021 wrote:
| You can use both search methods by using hybrid search. Here's
| an implementation:
|
| https://learn.microsoft.com/en-us/azure/search/hybrid-search...
| manishsharan wrote:
| I wonder how I could implement something similar using Lucene
| and a vercor db ?
| bob1029 wrote:
| Build a service that talks to both at the same time, or
| maybe some project like this:
|
| https://github.com/JuniusLuo/VecLucene
| esafak wrote:
| All the vector stores are going hybrid; it's becoming a
| table stakes feature.
| diptanu wrote:
| We use grid search to figure out what's the best chunking
| strategy to use. Create a bunch of different strategies such as
| recursive chunking, semantic, etc, and parameterize them and see
| which one works best. The "best" chunking strategy depends on the
| nature of the documents and the questions being asked.
| stoicjumbotron wrote:
| What would be the chunking strategy for q&a pairs? Right now I'm
| embedding the complete question and answer but the query results
| are not good as the response contains data not related to the
| question at all.
| harpastum wrote:
| It depends on the specifics of your format, but we've had
| success embedding the questions and answers separately. If
| either match, you return the complete question and answer text.
| Make sure to deduplicate before returning, in case both match.
| laborcontract wrote:
| Join q and a when vectorizing them. Questions alone are too
| short to carry a lot of semantic richness.
|
| When you get a query, you then run two semantic search queries:
| one using the original question and one using a HYDE version of
| the question. Take those results and run it through cohere's
| rerank.
| constantinum wrote:
| with ever increasing context size -- Claude 3 Sonnet, 204,800
| token context size, 500 page document -- sometimes not opting for
| chunking but at the same time optimizing for costs and latency
| might be a better solution, strategies like: Summary extraction
| and single pass extraction works well[1]
|
| [1] - https://docs.unstract.com/editions/cloud_edition#summary-
| ext...
| J_Shelby_J wrote:
| In theory. In practice, context size degrades performance.
| J_Shelby_J wrote:
| The problem with most chunking schemes I've seen is they're
| naive. They don't care about content; only token count. That's
| fine, but what would be best is to chunk by topic.
|
| I'm currently trying to implement chunking by topic using an LLM.
| It's much slower, but I hope it will be a huge win in retrieval
| accuracy. First step is to extract topics from the document by
| asking the LLM to identify all topics, and then split by
| sentences and feed each sentence to the LLM to identify the
| topic. I'm hoping the result will be the original text, split by
| topics. From there, they can be further chunked if needed. Of
| course, it could be done by just one shot asking the LLM to
| summarize each topic in the document, but the more the LLM is
| relied on to write, the more distortion is introduced. Retaining
| the original text is the goal and the LLM should just be used for
| decision making.
|
| Here is the crate I'm working out of. The chunking hasn't been
| pushed, but you can see the decision making workflows.
| https://github.com/ShelbyJenkins/llm_client
| Swizec wrote:
| > They don't care about content; only token count. That's fine,
| but what would be best is to chunk by topic.
|
| I've had a lot of success with chunking documents by subtitle.
| Works especially well for web published documents because each
| section tends to be fairly short.
| serjester wrote:
| You're basically reinventing embeddings. Using an LLM for this
| is overkill since capturing the meaning of a sentence can be
| done with a model 1000x smaller.
|
| I have a chunking library that does something similar to this,
| and there's actually quite a few different libraries that have
| implemented some variant of "semantic chunking".
|
| [1] https://github.com/Filimoa/open-parse
| nostrebored wrote:
| Excellent, now show me a single one working in production
| with various docs.
|
| I've talked through this problem with dozens of startups
| working on RAG, and everyone has the same problem.
|
| LLMs are arguably not reinventing embeddings if you're using
| them to infer the structure of a broad document.
| Understanding the "external structure" of the document, the
| internal structure of each segment, and the semantic
| structure of each chunk is important.
| serjester wrote:
| See my other comment on this thread - larger and larger
| context windows make perfect chunking irrelevant. We've
| ingested millions of docs and have a product in production,
| using LLM's for chunking would have 10Xed our ingestion
| costs for marginal performance improvement.
| emporas wrote:
| Why not use langchain-rust and make your own client? If you
| don't know about langchain, i think you are missing out. I took
| a look at other langchain implementions in js and python, in
| each one people have done some serious work. Langchain-rust
| also uses tree-sitter to chunk code, it works very well in some
| quick tests i tried.
|
| >The problem with most chunking schemes I've seen is they're
| naive. They don't care about content; only token count.
|
| I think controlling different inputs depending on context is
| used in agents. For the moment i haven't seen anything really
| impressive coming out of agents. Maybe Perplexity style web
| search, but nothing more.
| dcsan wrote:
| isn't this just semantic chunking? There's been white papers
| and already implementations in Langchain and llama index you
| could look through.
| serjester wrote:
| As embedding models become more performant and context windows
| increase, "ideal chunking" becomes less relevant.
|
| Cost isn't as important to us, so we use small chunks and then
| just pull in the page before and after. If you do this on 20+
| matches (since you're decomposing the query multiple times),
| you're very likely finding the content.
|
| Queries can get more expensive but you're getting a corpus of
| "great answers" to test against as you refine your approach.
| Model costs are also plummeting which makes brute forcing it more
| and more viable.
| emporas wrote:
| Bigger context windows don't work as well as advertised. The
| "hay in a haystack" problem is not solved yet.
|
| Also bigger context windows mean a lot more time waiting for an
| answer. Given the quadratic nature of context windows, we are
| stuck using transformers in smaller chunks. Other architectures
| like Mamba may solve that, but even then, increases in context
| window accuracy are not 1000x.
___________________________________________________________________
(page generated 2024-06-08 23:01 UTC)