[HN Gopher] Show HN: Chonky - a neural approach for text semanti...
___________________________________________________________________
Show HN: Chonky - a neural approach for text semantic chunking
TLDR: I've made a transformer model and a wrapper library that
segments text into meaningful semantic chunks. The current text
splitting approaches rely on heuristics (although one can use
neural embedder to group semantically related sentences). I
propose a fully neural approach to semantic chunking. I took the
base distilbert model and trained it on a bookcorpus to split
concatenated text paragraphs into original paragraphs. Basically
it's a token classification task. Model fine-tuning took day and a
half on a 2x1080ti. The library could be used as a text splitter
module in a RAG system or for splitting transcripts for example.
The usage pattern that I see is the following: strip all the markup
tags to produce pure text and feed this text into the model. The
problem is that although in theory this should improve overall RAG
pipeline performance I didn't manage to measure it properly. Other
limitations: the model only supports English for now and the output
text is downcased. Please give it a try. I'll appreciate a
feedback. The Python library: https://github.com/mirth/chonky The
transformer model:
https://huggingface.co/mirth/chonky_distilbert_base_uncased_...
Author : hessdalenlight
Score : 146 points
Date : 2025-04-11 12:18 UTC (2 days ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| jaggirs wrote:
| Did you evaluate it on a RAG benchmark?
| hessdalenlight wrote:
| No I didn't it yet. I would be grateful if you could advise me
| such a benchmark.
| jaggirs wrote:
| Not sure, havent done so myself but I think you can use MTEB
| maybe. Or otherwise a llm benchmark on large inputs (and
| compare your chunking with naive chunking)
| suddenlybananas wrote:
| I feel you could improve your README.md considerably just by
| showing the actual output of the little snippet you show.
| HeavyStorm wrote:
| Came here to write exactly that. The author includes a large
| sentence in the sample, so it should show us the output.
| hessdalenlight wrote:
| Just fixed it.
| mentalgear wrote:
| I applaud the FOSS initiative but as with anything ml: benchmarks
| please so we can see what test cases are covered and how well
| they align with a project's needs.
| petesergeant wrote:
| Love that people are trying to improve chunkers, but just some
| examples of how it chunked some input text in the README would go
| a long way here!
| mathis-l wrote:
| You might want to take a look at https://github.com/segment-any-
| text/wtpsplit
|
| It uses a similar approach but the focus is on sentence/paragraph
| segmentation generally and not specifically focused on RAG. It
| also has some benchmarks. Might be a good source of inspiration
| for where to take chonky next.
| vunderba wrote:
| This is the library that I use, mainly around very noisy IRC
| chat transcripts and it works pretty well. OP I'd love to see a
| paragraph matching comparison benchmark against wtpsplit to see
| how well Chonky stacks up.
| oezi wrote:
| Just to understand: The model is trained to put paragraph breaks
| into text. The training dataset is books (in contrast for
| instance to scientific articles or advertising flyers).
|
| It shouldn't break sentences at commas, right?
| sushidev wrote:
| So I could use this to index i.e. a fiction book in a vector db,
| right? And the semantic chunking will possibly provide better
| results at query time for rag, did I understand that correctly?
| hessdalenlight wrote:
| Yes and yes you are correct!
| acstorage wrote:
| You mention that the fine tuning time took half a day, have you
| ever thought to reduce that time?
| hessdalenlight wrote:
| Actually day and a half :). I'm all for it but unfortunately I
| have pretty old hardware.
| dmos62 wrote:
| Pretty cool. What use case did you have for this? Text with
| paragraph breaks missing seems fairly exotic.
| cckolon wrote:
| This would be useful when chunking PDFs scanned with OCR. I've
| done that before and paragraph breaks were detected pretty
| inconsistently.
| cmenge wrote:
| > I took the base distilbert model
|
| I read "the base Dilbert model", all sorts of weird ideas going
| through my head, concluded I should re-read and made the same
| mistake again XD
|
| Guess I better take a break and go for a walk now...
| michaelmarkell wrote:
| It seems to me like chunking (or some higher order version of it
| like chunking into knowledge graphs) is the highest leverage
| thing someone can work on right now if trying to improve
| intelligence of AI systems like code completion, PDF
| understanding etc. I'm surprised more people aren't working on
| this.
| serjester wrote:
| Chunking is less important in the long context era with most
| people just pulling in top 20 K. You obviously don't want to
| butcher it, but you've got a lot of room for error.
| lmeyerov wrote:
| Yeah exactly
|
| We still want chunking in practice to avoid LLM confusion,
| undifferentiated embeddings, and handling large datasets at
| lower cost + large volumes. Large context means we can now
| tolerate multi-paragraph/page, so more like chunk by coherent
| section.
|
| In theory we can do entire chapter/book, but those other
| concerns come in, so I only see more niche tools or talk-to-
| your-PDF do that.
|
| At the same time, embedding is often a significant cost in
| above scenarios, so I'm curious about the semantic chunking
| overheads..
| michaelmarkell wrote:
| In our use-case we have many gigabytes of PDFs that contain
| some qualitative data but also many pages of inline pdf
| tables. In an ideal world we'd be "compressing" those
| embedded tables into some text that says "there's a table
| here with these columns, if you want to analyze it you can
| use this <tool>, but basically the table is talking about X,
| here are the relevant stats like mean, sum, cardinality."
|
| In the naive chunking approach, we would grab random sections
| of line items from these tables because they happen to
| reference some similar text to the search query, but there's
| no guarantee the data pulled into context is complete.
| DeveloperErrata wrote:
| Trueish - for orgs that can't use API models for regulatory
| or security reasons, or that just need really efficient high
| throughput models, setting up your own infra for long context
| models can still be pretty complicated and expensive. Careful
| chunking and thoughtful design of the RAG system often still
| matters a lot in that context.
| J_Shelby_J wrote:
| "Performance is less important in an era of multi-core CPUs."
| J_Shelby_J wrote:
| That makes me feel better about spending so much time
| implementing this balanced text chunker last year.
| https://github.com/ShelbyJenkins/llm_utils
|
| It splits an input text into equal sized chunks using DFS and
| parallelization (rayon) to do so relatively quickly.
|
| However, the goal for me is to use a n LLM to split text by
| topic. I'm thinking I will implement it as an API saas service
| on top of it being OSS. Do you think it's a viable business?
| You send a library of text, and receive a library of single
| topic context chunks as output.
| olavfosse wrote:
| Does it work on other languages?
| andai wrote:
| Training a splitter based on existing paragraph conventions is
| really cool. Actually, that's a task I run into frequently
| (trying to turn YouTube auto-transcript blob of text into
| readable sentences). LLMs tend to rewrite the text a bit too much
| instead of just adding punctuation.
|
| As for RAG, I haven't noticed LLMs struggling with poorly
| structured text (e.g. the YouTube wall of text blob can just be
| fed directly into LLMs), though I haven't measured this.
|
| In fact my own "webgrep" (convert top 10 search results into text
| and run grep on them, optionally followed by LLM summary) works
| on the _byte_ level (gave up chunking words, sentences and
| paragraphs entirely): I just shove the 1kb before and after the
| match into the context. This works fine because LLMs just ignore
| the "mutilated" word parts at the beginning and end.
|
| The only downside of this approach is that if I was the LLM, I
| would probably be unhappy with my job!
|
| As for semantic chunking (in the context of, maximize the
| relevance of stuff that goes into the LLM, or indeed as a
| semantic search for the user), I haven't solved it yet, but I can
| share one amusing experiment: to find the relevant part of the
| text (having already returned a mostly-relevant big chunk of
| text), chop off one sentence at a time and re-run the similarity
| check! So you "distil" the text down to that which is most
| relevant (according to the embedding model) to the user query.
|
| This is very slow and stupid, especially in real-time (though
| kinda fun to watch), but kinda works for the "approximately one
| sentence answers my question" scenario. A much cheaper
| approximation here would just be to embed at the sentence level
| as well as the page/paragraph level.
| fareesh wrote:
| The non english space in these fields is so far behind in terms
| of accuracy and reliability, it's crazy
| legel wrote:
| Very cool!
|
| The training objective is clever.
|
| The 50+ filters at Ecodash.ai for 90,000 plants came from a
| custom RAG model on top of 800,000 raw web pages. Because LLM's
| are expensive, chunking and semantic search for figuring out what
| to feed into the LLM for inference is a key part of the pipeline
| nobody talks about. I think what I did was: run all text through
| the cheapest OpenAI embeddings API... then, I recall that nearest
| neighbor vector search wasn't enough to catch all relevant
| information, for a given query to be answered by an LLM. So, I
| remember generating a large number of diverse queries, which mean
| the same thing (e.g. "plant prefers full sun", "plant thrives in
| direct sunlight", "... requires at least 6 hours of light per
| day", ...) and then doing nearest neighbor vector search on all
| queries, and using the statistics to choose what to semantically
| feed into RAG.
| throwaway7783 wrote:
| Have you tried the bm25 + vector search + reranking pipeline
| for this?
| rekovacs wrote:
| Really amazing and impressive work!
| kamranjon wrote:
| Interesting! I worked previously for a company that did automatic
| generation of short video clips from long videos. I fine-tuned a
| t5 model by taking many Wikipedia articles and removing the new
| line characters and training it to insert them.
|
| The idea was that paragraphs are naturally how we segment
| distinct thoughts in text, and would translate well to segmenting
| long video clips. It actually worked pretty well! It was able to
| predict the paragraph breaks in many texts that it wasn't trained
| on at all.
|
| The problems at the time were around context length and dialog
| style formatting.
|
| I wanted to try and approach the problem in a less brute force
| way by maybe using sentence embedding and calculating the
| probability of a sentence being a "paragraph ending" sentence -
| which would likely result in a much smaller model.
|
| Anyway this is really cool! I'm excited to dive in further to
| what you've done!
| rybosome wrote:
| Interesting idea - is the chunking deterministic? It would have
| to be to be useful, but I'm wondering how that interacts with the
| neural net.
___________________________________________________________________
(page generated 2025-04-13 23:01 UTC)