[HN Gopher] Show HN: Chonkie - A Fast, Lightweight Text Chunking...
       ___________________________________________________________________
        
       Show HN: Chonkie - A Fast, Lightweight Text Chunking Library for
       RAG
        
       I built Chonkie because I was tired of rewriting chunking code for
       RAG applications. Existing libraries were either too bloated
       (80MB+) or too basic, with no middle ground.  Core features:  -
       21MB default install vs 80-171MB alternatives  - 33x faster token
       chunking than popular alternatives  - Supports multiple chunking
       strategies: token, word, sentence, and semantic  - Works with all
       major tokenizers (transformers, tokenizers, tiktoken)  - Zero
       external dependencies for basic functionality  Technical
       optimizations:  - Uses tiktoken with multi-threading for faster
       tokenization  - Implements aggressive caching and precomputation  -
       Running mean pooling for efficient semantic chunking  - Modular
       dependency system (install only what you need)  Benchmarks and
       code: https://github.com/bhavnicksm/chonkie  Looking for feedback
       on the architecture and performance optimizations. What other
       chunking strategies would be useful for RAG applications?
        
       Author : bhavnicksm
       Score  : 100 points
       Date   : 2024-11-10 15:58 UTC (7 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | mixeden wrote:
       | > Token Chunking: 33x faster than the slowest alternative
       | 
       | 1) what
        
         | rkharsan64 wrote:
         | There's only 3 competitors in that particular benchmark, and
         | the speedup compared to the 2nd is only 1.06x.
         | 
         | Edit: Also, from the same table, it seems that only this
         | library was ran after warming up, while others were not.
         | https://github.com/bhavnicksm/chonkie/blob/main/benchmarks/R...
        
           | bhavnicksm wrote:
           | TokenChunking is really limited by the tokenizer and less by
           | the Chunking algorithm. Tiktoken tokenizers seem to do better
           | with warm-up which Chonkie defaults to -- which is also what
           | the 2nd one is using.
           | 
           | Algorithmically, there's not much difference in TokenChunking
           | between Chonkie and LangChain or any other TokenChunking
           | algorithm you might want to use. (except Llamaindex, I don't
           | know what mess they made for 33x slower algo)
           | 
           | If you only want TokenChunking (which I do not recommend
           | completely), better than Chonkie or LangChain, just write
           | your own for production :) At least don't install 80MiB
           | packages for TokenChunking, Chonkie is 4x smaller than them.
           | 
           | That's just my honest response... And these benchmarks are
           | just the beginning, future optimizations on SemanticChunking
           | which would increase the speed-up from the current 2nd (2.5x
           | right now) to even higher.
        
           | melony wrote:
           | How does it compare with NLTK's chunking library? I have
           | found that it works very well for sentence segmentation.
        
       | samlinnfer wrote:
       | How does it work for code? (Chunking code that is)
        
         | nostrebored wrote:
         | Poorly, just like it does for text.
         | 
         | Chunking is easily where all of these problems die beyond PoC
         | scale.
         | 
         | I've talked to multiple code generation companies in the past
         | week -- most are stuck with BM25 and taking in whole files.
        
         | bhavnicksm wrote:
         | Right now, we haven't worked on adding support for code -- some
         | things like comments (#, //) have punctuations that adversely
         | affect chunking, along with indentation and other issues.
         | 
         | But, it's on the roadmap, so please hold on!
        
       | bravura wrote:
       | One thing I've been looking for, and was a bit tricky
       | implementing myself to be very fast, is this:
       | 
       | I have a particular max token length in mind, and I have a
       | tokenizer like tiktoken. I have a string and I want to quickly
       | find the maximum length truncation of the string that is <=
       | target max token length.
       | 
       | Does chonkie handle this?
        
         | bhavnicksm wrote:
         | I don't fully understand what you mean by "maximum length
         | truncation of the string"; but if you're talking about
         | splitting the sentence into 'chunks' which have token counts
         | less than a pre-specified max_token length then, yes!
         | 
         | Is that what you meant?
        
       | spullara wrote:
       | 21MB? to split text? have you analyzed the footprint?
        
         | bhavnicksm wrote:
         | Just to clarify, the 21MB is the size of the package itself!
         | Other package sizes are way larger.
         | 
         | Memory footprint of the chunking itself would vary widely based
         | on the dataset, and it's not something we tested on... usually
         | other providers don't test it either, as long as it doesn't
         | bust up the computer/server.
         | 
         | If saving memory during runtime is important for your
         | application, let me know! I'd run some benchmarks for it...
         | 
         | Thanks!
        
       | ch1kkenm4ss4 wrote:
       | Chonkie and lightweight? Good naming!
        
         | bhavnicksm wrote:
         | Haha~ thanks!
        
       | opendang wrote:
       | Wow, a whole repo of AI-generated slop, complete with an AI-
       | generated logo. Anyone who runs this code unsandboxed is braver
       | than I.
        
         | xivusr wrote:
         | IMO comments like this go against the spirit of HN - why not
         | offer more constructive feedback? Implying defects without
         | suggestions on how to improve (or proof) is low effort and what
         | I expect on a YouTube comment thread, not HN.
        
       | simonw wrote:
       | Would it make sense for this to offer a chunking strategy that
       | doesn't need a tokenizer at all? I love the goal to keep it
       | small, but "tokenizers" is still a pretty huge dependency (and
       | one that isn't currently compatible with Python 3.13).
       | 
       | I've been hoping to find an ultra light-weight chunking library
       | that can do things like very simple regex-based
       | sentence/paragraph/markdown-aware chunking with minimal
       | additional dependencies.
        
         | parhamn wrote:
         | Across a broad enough dataset (char count / 4) is very close to
         | the actual token count in english -- we verified across
         | millions of queries. We had to switch to using an actual
         | tokenizer for chinese and other unicode languages, as that
         | simple formula misses the mark for context stuffing.
         | 
         | The more complicated stuff is the effective bin-packing problem
         | that emerges depending on how much different contextual sources
         | you have.
        
       | petesergeant wrote:
       | > What other chunking strategies would be useful for RAG
       | applications?
       | 
       | I'm using o1-preview for chunking, creating summary subdocuments.
        
         | bhavnicksm wrote:
         | That's pretty cool! I believe a research paper called
         | LumberChunker recently evaluated that to be pretty decent as
         | well.
         | 
         | Thanks for responding, I'll try to make it easier to use
         | something like that in Chonkie in the future!
        
           | petesergeant wrote:
           | Ah, that's an interesting paper, and a slightly different
           | approach to what I'm doing, but possibly a superior one.
           | Thanks!
        
       | bhavnicksm wrote:
       | Thank you so much for giving Chonkie a chance! Just to note
       | Chonkie is still in beta mode (with v0.1.2 running) with a bunch
       | of things planned for it. It's an initial working version, which
       | seemed promising enough to present.
       | 
       | I hope that you will stick with Chonkie for the journey of making
       | the 'perfect' chunking library!
       | 
       | Thanks again!
        
       | vlovich123 wrote:
       | Out of curiosity where does the 21 MiB come from? The codebase
       | clone is 1.2 MiB and the src folder is only 68 KiB.
        
         | ekianjo wrote:
         | Dependencies in the venv?
        
       | mattmein wrote:
       | Also check out https://github.com/D-Star-AI/dsRAG/ for a bit more
       | involved chunking strategy.
        
         | cadence- wrote:
         | This looks pretty amazing. I will take it for a spin next week.
         | I want to make a RAG that will answer questions related to my
         | new car. The manual is huge and it is often hard to find
         | answers in it, so I think this will be a big help to owners of
         | the same car. I think your library can help me chunk that huge
         | PDF easily.
        
           | andai wrote:
           | How many tokens is the manual?
        
       ___________________________________________________________________
       (page generated 2024-11-10 23:00 UTC)