[HN Gopher] Show HN: Chonkie - A Fast, Lightweight Text Chunking...
___________________________________________________________________
Show HN: Chonkie - A Fast, Lightweight Text Chunking Library for
RAG
I built Chonkie because I was tired of rewriting chunking code for
RAG applications. Existing libraries were either too bloated
(80MB+) or too basic, with no middle ground. Core features: -
21MB default install vs 80-171MB alternatives - 33x faster token
chunking than popular alternatives - Supports multiple chunking
strategies: token, word, sentence, and semantic - Works with all
major tokenizers (transformers, tokenizers, tiktoken) - Zero
external dependencies for basic functionality Technical
optimizations: - Uses tiktoken with multi-threading for faster
tokenization - Implements aggressive caching and precomputation -
Running mean pooling for efficient semantic chunking - Modular
dependency system (install only what you need) Benchmarks and
code: https://github.com/bhavnicksm/chonkie Looking for feedback
on the architecture and performance optimizations. What other
chunking strategies would be useful for RAG applications?
Author : bhavnicksm
Score : 100 points
Date : 2024-11-10 15:58 UTC (7 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| mixeden wrote:
| > Token Chunking: 33x faster than the slowest alternative
|
| 1) what
| rkharsan64 wrote:
| There's only 3 competitors in that particular benchmark, and
| the speedup compared to the 2nd is only 1.06x.
|
| Edit: Also, from the same table, it seems that only this
| library was ran after warming up, while others were not.
| https://github.com/bhavnicksm/chonkie/blob/main/benchmarks/R...
| bhavnicksm wrote:
| TokenChunking is really limited by the tokenizer and less by
| the Chunking algorithm. Tiktoken tokenizers seem to do better
| with warm-up which Chonkie defaults to -- which is also what
| the 2nd one is using.
|
| Algorithmically, there's not much difference in TokenChunking
| between Chonkie and LangChain or any other TokenChunking
| algorithm you might want to use. (except Llamaindex, I don't
| know what mess they made for 33x slower algo)
|
| If you only want TokenChunking (which I do not recommend
| completely), better than Chonkie or LangChain, just write
| your own for production :) At least don't install 80MiB
| packages for TokenChunking, Chonkie is 4x smaller than them.
|
| That's just my honest response... And these benchmarks are
| just the beginning, future optimizations on SemanticChunking
| which would increase the speed-up from the current 2nd (2.5x
| right now) to even higher.
| melony wrote:
| How does it compare with NLTK's chunking library? I have
| found that it works very well for sentence segmentation.
| samlinnfer wrote:
| How does it work for code? (Chunking code that is)
| nostrebored wrote:
| Poorly, just like it does for text.
|
| Chunking is easily where all of these problems die beyond PoC
| scale.
|
| I've talked to multiple code generation companies in the past
| week -- most are stuck with BM25 and taking in whole files.
| bhavnicksm wrote:
| Right now, we haven't worked on adding support for code -- some
| things like comments (#, //) have punctuations that adversely
| affect chunking, along with indentation and other issues.
|
| But, it's on the roadmap, so please hold on!
| bravura wrote:
| One thing I've been looking for, and was a bit tricky
| implementing myself to be very fast, is this:
|
| I have a particular max token length in mind, and I have a
| tokenizer like tiktoken. I have a string and I want to quickly
| find the maximum length truncation of the string that is <=
| target max token length.
|
| Does chonkie handle this?
| bhavnicksm wrote:
| I don't fully understand what you mean by "maximum length
| truncation of the string"; but if you're talking about
| splitting the sentence into 'chunks' which have token counts
| less than a pre-specified max_token length then, yes!
|
| Is that what you meant?
| spullara wrote:
| 21MB? to split text? have you analyzed the footprint?
| bhavnicksm wrote:
| Just to clarify, the 21MB is the size of the package itself!
| Other package sizes are way larger.
|
| Memory footprint of the chunking itself would vary widely based
| on the dataset, and it's not something we tested on... usually
| other providers don't test it either, as long as it doesn't
| bust up the computer/server.
|
| If saving memory during runtime is important for your
| application, let me know! I'd run some benchmarks for it...
|
| Thanks!
| ch1kkenm4ss4 wrote:
| Chonkie and lightweight? Good naming!
| bhavnicksm wrote:
| Haha~ thanks!
| opendang wrote:
| Wow, a whole repo of AI-generated slop, complete with an AI-
| generated logo. Anyone who runs this code unsandboxed is braver
| than I.
| xivusr wrote:
| IMO comments like this go against the spirit of HN - why not
| offer more constructive feedback? Implying defects without
| suggestions on how to improve (or proof) is low effort and what
| I expect on a YouTube comment thread, not HN.
| simonw wrote:
| Would it make sense for this to offer a chunking strategy that
| doesn't need a tokenizer at all? I love the goal to keep it
| small, but "tokenizers" is still a pretty huge dependency (and
| one that isn't currently compatible with Python 3.13).
|
| I've been hoping to find an ultra light-weight chunking library
| that can do things like very simple regex-based
| sentence/paragraph/markdown-aware chunking with minimal
| additional dependencies.
| parhamn wrote:
| Across a broad enough dataset (char count / 4) is very close to
| the actual token count in english -- we verified across
| millions of queries. We had to switch to using an actual
| tokenizer for chinese and other unicode languages, as that
| simple formula misses the mark for context stuffing.
|
| The more complicated stuff is the effective bin-packing problem
| that emerges depending on how much different contextual sources
| you have.
| petesergeant wrote:
| > What other chunking strategies would be useful for RAG
| applications?
|
| I'm using o1-preview for chunking, creating summary subdocuments.
| bhavnicksm wrote:
| That's pretty cool! I believe a research paper called
| LumberChunker recently evaluated that to be pretty decent as
| well.
|
| Thanks for responding, I'll try to make it easier to use
| something like that in Chonkie in the future!
| petesergeant wrote:
| Ah, that's an interesting paper, and a slightly different
| approach to what I'm doing, but possibly a superior one.
| Thanks!
| bhavnicksm wrote:
| Thank you so much for giving Chonkie a chance! Just to note
| Chonkie is still in beta mode (with v0.1.2 running) with a bunch
| of things planned for it. It's an initial working version, which
| seemed promising enough to present.
|
| I hope that you will stick with Chonkie for the journey of making
| the 'perfect' chunking library!
|
| Thanks again!
| vlovich123 wrote:
| Out of curiosity where does the 21 MiB come from? The codebase
| clone is 1.2 MiB and the src folder is only 68 KiB.
| ekianjo wrote:
| Dependencies in the venv?
| mattmein wrote:
| Also check out https://github.com/D-Star-AI/dsRAG/ for a bit more
| involved chunking strategy.
| cadence- wrote:
| This looks pretty amazing. I will take it for a spin next week.
| I want to make a RAG that will answer questions related to my
| new car. The manual is huge and it is often hard to find
| answers in it, so I think this will be a big help to owners of
| the same car. I think your library can help me chunk that huge
| PDF easily.
| andai wrote:
| How many tokens is the manual?
___________________________________________________________________
(page generated 2024-11-10 23:00 UTC)