[HN Gopher] Show HN: Chonky - a neural approach for text semanti...
       ___________________________________________________________________
        
       Show HN: Chonky - a neural approach for text semantic chunking
        
       TLDR: I've made a transformer model and a wrapper library that
       segments text into meaningful semantic chunks.  The current text
       splitting approaches rely on heuristics (although one can use
       neural embedder to group semantically related sentences).  I
       propose a fully neural approach to semantic chunking.  I took the
       base distilbert model and trained it on a bookcorpus to split
       concatenated text paragraphs into original paragraphs. Basically
       it's a token classification task. Model fine-tuning took day and a
       half on a 2x1080ti.  The library could be used as a text splitter
       module in a RAG system or for splitting transcripts for example.
       The usage pattern that I see is the following: strip all the markup
       tags to produce pure text and feed this text into the model.  The
       problem is that although in theory this should improve overall RAG
       pipeline performance I didn't manage to measure it properly. Other
       limitations: the model only supports English for now and the output
       text is downcased.  Please give it a try. I'll appreciate a
       feedback.  The Python library: https://github.com/mirth/chonky  The
       transformer model:
       https://huggingface.co/mirth/chonky_distilbert_base_uncased_...
        
       Author : hessdalenlight
       Score  : 146 points
       Date   : 2025-04-11 12:18 UTC (2 days ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | jaggirs wrote:
       | Did you evaluate it on a RAG benchmark?
        
         | hessdalenlight wrote:
         | No I didn't it yet. I would be grateful if you could advise me
         | such a benchmark.
        
           | jaggirs wrote:
           | Not sure, havent done so myself but I think you can use MTEB
           | maybe. Or otherwise a llm benchmark on large inputs (and
           | compare your chunking with naive chunking)
        
       | suddenlybananas wrote:
       | I feel you could improve your README.md considerably just by
       | showing the actual output of the little snippet you show.
        
         | HeavyStorm wrote:
         | Came here to write exactly that. The author includes a large
         | sentence in the sample, so it should show us the output.
        
         | hessdalenlight wrote:
         | Just fixed it.
        
       | mentalgear wrote:
       | I applaud the FOSS initiative but as with anything ml: benchmarks
       | please so we can see what test cases are covered and how well
       | they align with a project's needs.
        
       | petesergeant wrote:
       | Love that people are trying to improve chunkers, but just some
       | examples of how it chunked some input text in the README would go
       | a long way here!
        
       | mathis-l wrote:
       | You might want to take a look at https://github.com/segment-any-
       | text/wtpsplit
       | 
       | It uses a similar approach but the focus is on sentence/paragraph
       | segmentation generally and not specifically focused on RAG. It
       | also has some benchmarks. Might be a good source of inspiration
       | for where to take chonky next.
        
         | vunderba wrote:
         | This is the library that I use, mainly around very noisy IRC
         | chat transcripts and it works pretty well. OP I'd love to see a
         | paragraph matching comparison benchmark against wtpsplit to see
         | how well Chonky stacks up.
        
       | oezi wrote:
       | Just to understand: The model is trained to put paragraph breaks
       | into text. The training dataset is books (in contrast for
       | instance to scientific articles or advertising flyers).
       | 
       | It shouldn't break sentences at commas, right?
        
       | sushidev wrote:
       | So I could use this to index i.e. a fiction book in a vector db,
       | right? And the semantic chunking will possibly provide better
       | results at query time for rag, did I understand that correctly?
        
         | hessdalenlight wrote:
         | Yes and yes you are correct!
        
       | acstorage wrote:
       | You mention that the fine tuning time took half a day, have you
       | ever thought to reduce that time?
        
         | hessdalenlight wrote:
         | Actually day and a half :). I'm all for it but unfortunately I
         | have pretty old hardware.
        
       | dmos62 wrote:
       | Pretty cool. What use case did you have for this? Text with
       | paragraph breaks missing seems fairly exotic.
        
         | cckolon wrote:
         | This would be useful when chunking PDFs scanned with OCR. I've
         | done that before and paragraph breaks were detected pretty
         | inconsistently.
        
       | cmenge wrote:
       | > I took the base distilbert model
       | 
       | I read "the base Dilbert model", all sorts of weird ideas going
       | through my head, concluded I should re-read and made the same
       | mistake again XD
       | 
       | Guess I better take a break and go for a walk now...
        
       | michaelmarkell wrote:
       | It seems to me like chunking (or some higher order version of it
       | like chunking into knowledge graphs) is the highest leverage
       | thing someone can work on right now if trying to improve
       | intelligence of AI systems like code completion, PDF
       | understanding etc. I'm surprised more people aren't working on
       | this.
        
         | serjester wrote:
         | Chunking is less important in the long context era with most
         | people just pulling in top 20 K. You obviously don't want to
         | butcher it, but you've got a lot of room for error.
        
           | lmeyerov wrote:
           | Yeah exactly
           | 
           | We still want chunking in practice to avoid LLM confusion,
           | undifferentiated embeddings, and handling large datasets at
           | lower cost + large volumes. Large context means we can now
           | tolerate multi-paragraph/page, so more like chunk by coherent
           | section.
           | 
           | In theory we can do entire chapter/book, but those other
           | concerns come in, so I only see more niche tools or talk-to-
           | your-PDF do that.
           | 
           | At the same time, embedding is often a significant cost in
           | above scenarios, so I'm curious about the semantic chunking
           | overheads..
        
           | michaelmarkell wrote:
           | In our use-case we have many gigabytes of PDFs that contain
           | some qualitative data but also many pages of inline pdf
           | tables. In an ideal world we'd be "compressing" those
           | embedded tables into some text that says "there's a table
           | here with these columns, if you want to analyze it you can
           | use this <tool>, but basically the table is talking about X,
           | here are the relevant stats like mean, sum, cardinality."
           | 
           | In the naive chunking approach, we would grab random sections
           | of line items from these tables because they happen to
           | reference some similar text to the search query, but there's
           | no guarantee the data pulled into context is complete.
        
           | DeveloperErrata wrote:
           | Trueish - for orgs that can't use API models for regulatory
           | or security reasons, or that just need really efficient high
           | throughput models, setting up your own infra for long context
           | models can still be pretty complicated and expensive. Careful
           | chunking and thoughtful design of the RAG system often still
           | matters a lot in that context.
        
           | J_Shelby_J wrote:
           | "Performance is less important in an era of multi-core CPUs."
        
         | J_Shelby_J wrote:
         | That makes me feel better about spending so much time
         | implementing this balanced text chunker last year.
         | https://github.com/ShelbyJenkins/llm_utils
         | 
         | It splits an input text into equal sized chunks using DFS and
         | parallelization (rayon) to do so relatively quickly.
         | 
         | However, the goal for me is to use a n LLM to split text by
         | topic. I'm thinking I will implement it as an API saas service
         | on top of it being OSS. Do you think it's a viable business?
         | You send a library of text, and receive a library of single
         | topic context chunks as output.
        
       | olavfosse wrote:
       | Does it work on other languages?
        
       | andai wrote:
       | Training a splitter based on existing paragraph conventions is
       | really cool. Actually, that's a task I run into frequently
       | (trying to turn YouTube auto-transcript blob of text into
       | readable sentences). LLMs tend to rewrite the text a bit too much
       | instead of just adding punctuation.
       | 
       | As for RAG, I haven't noticed LLMs struggling with poorly
       | structured text (e.g. the YouTube wall of text blob can just be
       | fed directly into LLMs), though I haven't measured this.
       | 
       | In fact my own "webgrep" (convert top 10 search results into text
       | and run grep on them, optionally followed by LLM summary) works
       | on the _byte_ level (gave up chunking words, sentences and
       | paragraphs entirely): I just shove the 1kb before and after the
       | match into the context. This works fine because LLMs just ignore
       | the  "mutilated" word parts at the beginning and end.
       | 
       | The only downside of this approach is that if I was the LLM, I
       | would probably be unhappy with my job!
       | 
       | As for semantic chunking (in the context of, maximize the
       | relevance of stuff that goes into the LLM, or indeed as a
       | semantic search for the user), I haven't solved it yet, but I can
       | share one amusing experiment: to find the relevant part of the
       | text (having already returned a mostly-relevant big chunk of
       | text), chop off one sentence at a time and re-run the similarity
       | check! So you "distil" the text down to that which is most
       | relevant (according to the embedding model) to the user query.
       | 
       | This is very slow and stupid, especially in real-time (though
       | kinda fun to watch), but kinda works for the "approximately one
       | sentence answers my question" scenario. A much cheaper
       | approximation here would just be to embed at the sentence level
       | as well as the page/paragraph level.
        
       | fareesh wrote:
       | The non english space in these fields is so far behind in terms
       | of accuracy and reliability, it's crazy
        
       | legel wrote:
       | Very cool!
       | 
       | The training objective is clever.
       | 
       | The 50+ filters at Ecodash.ai for 90,000 plants came from a
       | custom RAG model on top of 800,000 raw web pages. Because LLM's
       | are expensive, chunking and semantic search for figuring out what
       | to feed into the LLM for inference is a key part of the pipeline
       | nobody talks about. I think what I did was: run all text through
       | the cheapest OpenAI embeddings API... then, I recall that nearest
       | neighbor vector search wasn't enough to catch all relevant
       | information, for a given query to be answered by an LLM. So, I
       | remember generating a large number of diverse queries, which mean
       | the same thing (e.g. "plant prefers full sun", "plant thrives in
       | direct sunlight", "... requires at least 6 hours of light per
       | day", ...) and then doing nearest neighbor vector search on all
       | queries, and using the statistics to choose what to semantically
       | feed into RAG.
        
         | throwaway7783 wrote:
         | Have you tried the bm25 + vector search + reranking pipeline
         | for this?
        
       | rekovacs wrote:
       | Really amazing and impressive work!
        
       | kamranjon wrote:
       | Interesting! I worked previously for a company that did automatic
       | generation of short video clips from long videos. I fine-tuned a
       | t5 model by taking many Wikipedia articles and removing the new
       | line characters and training it to insert them.
       | 
       | The idea was that paragraphs are naturally how we segment
       | distinct thoughts in text, and would translate well to segmenting
       | long video clips. It actually worked pretty well! It was able to
       | predict the paragraph breaks in many texts that it wasn't trained
       | on at all.
       | 
       | The problems at the time were around context length and dialog
       | style formatting.
       | 
       | I wanted to try and approach the problem in a less brute force
       | way by maybe using sentence embedding and calculating the
       | probability of a sentence being a "paragraph ending" sentence -
       | which would likely result in a much smaller model.
       | 
       | Anyway this is really cool! I'm excited to dive in further to
       | what you've done!
        
       | rybosome wrote:
       | Interesting idea - is the chunking deterministic? It would have
       | to be to be useful, but I'm wondering how that interacts with the
       | neural net.
        
       ___________________________________________________________________
       (page generated 2025-04-13 23:01 UTC)