[HN Gopher] Launch HN: Chonkie (YC X25) - Open-Source Library fo...
       ___________________________________________________________________
        
       Launch HN: Chonkie (YC X25) - Open-Source Library for Advanced
       Chunking
        
       Hey HN! We're Shreyash and Bhavnick. We're building Chonkie
       (https://chonkie.ai), an open-source library for chunking and
       embedding data.  Python: https://github.com/chonkie-inc/chonkie
       TypeScript: https://github.com/chonkie-inc/chonkie-ts  Here's a
       video showing our code chunker: https://youtu.be/Xclkh6bU1P0.
       Bhavnick and I have been building personal projects with LLMs for a
       few years. For much of this time, we found ourselves writing our
       own chunking logic to support RAG applications. We often hesitated
       to use existing libraries because they either had only basic
       features or felt too bloated (some are 80MB+).  We built Chonkie to
       be lightweight, fast, extensible, and easy. The space is evolving
       rapidly, and we wanted Chonkie to be able to quickly support the
       newest strategies. We currently support: Token Chunking, Sentence
       Chunking, Recursive Chunking, Semantic Chunking, plus:  - Semantic
       Double Pass Chunking: Chunks text semantically first, then merges
       closely related chunks.  - Code Chunking: Chunks code files by
       creating an AST and finding ideal split points.  - Late Chunking:
       Based on the paper (https://arxiv.org/abs/2409.04701), where chunk
       embeddings are derived from embedding a longer document.  - Slumber
       Chunking: Based on the "Lumber Chunking" paper
       (https://arxiv.org/abs/2406.17526). It uses recursive chunking,
       then an LLM verifies split points, aiming for high-quality chunks
       with reduced token usage and LLM costs.  You can see how Chonkie
       compares to LangChain and LlamaIndex in our benchmarks:
       https://github.com/chonkie-inc/chonkie/blob/main/BENCHMARKS....
       Some technical details about the Chonkie package: - ~15MB default
       install vs. ~80-170MB for some alternatives. - Up to 33x faster
       token chunking compared to LangChain and LlamaIndex in our tests. -
       Works with major tokenizers (transformers, tokenizers, tiktoken). -
       Zero external dependencies for basic functionality. - Implements
       aggressive caching and precomputation. - Uses running mean pooling
       for efficient semantic chunking. - Modular dependency system
       (install only what you need).  In addition to chunking, Chonkie
       also provides an easy way to create embeddings. For supported
       providers (SentenceTransformer, Model2Vec, OpenAI), you just
       specify the model name as a string. You can also create custom
       embedding handlers for other providers.  RAG is still the most
       common use case currently. However, Chonkie makes chunks that are
       optimized for creating high quality embeddings and vector
       retrieval, so it is not really tied to the "generation" part of
       RAG. In fact, We're seeing more and more people use Chonkie for
       implementing semantic search and/or setting context for agents.  We
       are currently focused on building integrations to simplify the
       retrieval process. We've created "handshakes" - thin functions that
       interact with vector DBs like pgVector, Chroma, TurboPuffer, and
       Qdrant, allowing you to interact with storage easily. If there's an
       integration you'd like to see (vector DB or otherwise), please let
       us know.  We also offer hosted and on-premise versions with OCR,
       extra metadata, all embedding providers, and managed vector
       databases for teams that want a fully managed pipeline. If you're
       interested, reach out at shreyash@chonkie.ai or book a demo:
       https://cal.com/shreyashn/chonkie-demo.  We're eager to hear your
       feedback and comments! Thanks!
        
       Author : snyy
       Score  : 88 points
       Date   : 2025-06-09 16:09 UTC (6 hours ago)
        
       | greymalik wrote:
       | You're part of YC but this is open source - how do you plan to
       | make money off of it?
        
         | tevon wrote:
         | Looks like they will have a cloud offering, and mentioned in
         | this post are on-prem and managed offerings
        
         | snyy wrote:
         | As mentioned in the other reply, we have a cloud/on-prem
         | offering that comes with a managed ETL pipeline built on top of
         | our OSS offering.
        
       | tevon wrote:
       | Was just looking into chunking strategies today, this looks
       | great! Will update with any feedback.
        
         | snyy wrote:
         | Awesome! Keep us posted!
        
       | pzo wrote:
       | Is this only for node (how about bun/deno)? Have it been tested
       | to work with react native?
        
         | snyy wrote:
         | Node and Bun should work. Haven't tested on Deno yet.
         | 
         | We rely on the huggingface/transformers library which might be
         | too heavy for a react-native app.
        
       | Andugal wrote:
       | Congratulations for the launch!
       | 
       | You said that Chonkie works with multiple vector stores. I was
       | wondering what RAG database HN uses? Do you need a specialized
       | one (like Chroma) or is Postgres just fine?
        
         | snyy wrote:
         | Not sure what HN uses :)
         | 
         | If you want agents/LLMs to be able to find relevant data based
         | on similarity to queries, vectorDBs like Chroma (or even
         | pgVector) are great.
        
         | gavmor wrote:
         | Does HN even _use_ a RAG database? What for? They don 't even
         | maintain their own search[0].
         | 
         | 0. https://hn.algolia.com/
        
       | elliot07 wrote:
       | Chonkie is great software. Congrats on the launch! Has been a
       | pleasure to use so far.
        
         | snyy wrote:
         | Thank you :)
        
       | pj_mukh wrote:
       | Super cool!
       | 
       | It looks like size and speed is your major advantage. In our RAG
       | pipeline we run the chunking process async as an onboarding type
       | process. Is Chonkie primarily for people looking to process
       | documents in some sort of real-time scenario?
        
         | snyy wrote:
         | In addition to size and speed we also offer the most variety of
         | chunking strategies!
         | 
         | Typically, our current users fall into one of two categories:
         | 
         | - People who are running async chunking but need access to a
         | strategy not supported in langchain/llamaIndex. Sometimes speed
         | matters here too, especially if the user has a high volume of
         | documents
         | 
         | - people who need real time chunking. Super useful for apps
         | like codegen/code review tools.
        
       | _epps_ wrote:
       | Excited to try this out! Also +1 for Moo Deng-ish mascot.
        
       | amir_karbasi wrote:
       | Looks great! I had looked at Chonkie a few months back, but
       | didn't need it in our pipelines. I was just writing a POC for an
       | agentic chunker this week to handle various formatting and
       | chunking requirements. I'll give Chonkie a shot!
        
         | snyy wrote:
         | Awesome! Keep us posted :)
        
       | yawnxyz wrote:
       | I'm curious if chunking is different for embeddings vs. for
       | "agentic retrieval" e.g. an AI or a person operates like a
       | Librarian; they look up in an index at what resources to look up,
       | get the relevant bits, then piece them together into a cohesive
       | narrative whole -- would we do any chunking at all for this, or
       | does this purely rely on the way the DB is setup? I think for
       | certain use cases, even a single DB record could be too large for
       | context windows, so maybe chunking might need to be done to the
       | record? (e.g. a db of research papers)
        
         | snyy wrote:
         | Great questions!
         | 
         | Chunking fundamentals remain the same whether you're doing
         | traditional semantic search or agentic retrieval. The key
         | difference lies in the retrieval strategy, not the chunking
         | approach itself.
         | 
         | For quality agentic retrieval, you still need to create a
         | knowledge base by chunking documents, generating embeddings,
         | and storing them in a vector database. You can add
         | organizational structure here--like creating separate
         | collections for different document categories (Physics papers,
         | Biology papers, etc.)--though the importance of this
         | organization depends on the size and diversity of your source
         | data.
         | 
         | The agent then operates exactly as you described: it queries
         | the vector database, retrieves relevant chunks, and synthesizes
         | them into a coherent response. The chunking strategy should
         | still optimize for semantic coherence and appropriate context
         | window usage.
         | 
         | Regarding your concern about large DB records: you're
         | absolutely right. Even individual research papers often exceed
         | context windows, so you'd still need to chunk them into
         | smaller, semantically meaningful pieces (perhaps by section,
         | abstract, methodology, etc.). The agent can then retrieve and
         | combine multiple chunks from the same paper or across papers as
         | needed.
         | 
         | The main advantage of agentic retrieval is that the agent can
         | make multiple queries, refine its search strategy, and
         | iteratively build context--but it still relies on well-chunked,
         | embedded content in the underlying vector database.
        
       | babuloseo wrote:
       | I like the mascot.
        
         | esafak wrote:
         | I can't help but think of the SNL sketch.
         | https://www.youtube.com/watch?v=hRGKSwsD7ac
        
       | mritchie712 wrote:
       | We (https://www.definite.app/) have a use case I'd imagine is
       | common for people building agents.
       | 
       | When a user works with our agent, they may end up with a large
       | conversation thread (e.g. 200k+ tokens) with many SQL snippets,
       | query results and database metadata (e.g. table and column info).
       | 
       | For example, if they ask "show me any companies that were heavily
       | engaged at one point, but I haven't talked to in the last 90
       | days". This will pull in their schema (e.g. Hubspot), run a bunch
       | of SQL, show them results, etc.
       | 
       | I want to allow the agent to search previous threads for answers
       | so they don't need to have the conversation again, but chunking
       | up the existing thread is non-trivial (e.g. you don't want to
       | separate the question and answer, you may want to remove errors
       | while retaining the correction, etc.).
       | 
       | Do you have any plans to support "auto chunking" for AI
       | message[0] threads?
       | 
       | 0 - e.g. https://platform.openai.com/docs/api-
       | reference/messages/crea...
        
         | snyy wrote:
         | > you may want to remove errors while retaining the correction
         | 
         | Double clicking on this, are these messages you'd want to drop
         | from memory because they're not part of the actual content
         | (e.g. execution errors or warnings)? That kind of cleanup is
         | something Chonkie can help with as a pre-processing step.
         | 
         | If you can share an example structure of your message threads,
         | I can give more specific guidance. We've seen folks use Chonkie
         | to chunk and embed AI chat threads -- treating the resulting
         | vector store as long-term memory. That way, you can RAG over
         | past threads to recover context without redoing the
         | conversation.
         | 
         | P.S. If HN isn't ideal for going back and forth, feel free to
         | send me an email at shreyash@chonkie.ai.
        
           | mritchie712 wrote:
           | > We've seen folks use Chonkie to chunk and embed AI chat
           | threads
           | 
           | yep, that's what we're looking for. We'll give it a shot!
           | 
           | I think it's worth creating a guide for this use case. Seems
           | like something many people would want to do and the input
           | should be very similar across your users.
        
       | olavfosse wrote:
       | Very cool!
       | 
       | What's the story for chunking PDFs?
       | 
       | We've been using Marker and handling markdown->chunks manually.
        
         | snyy wrote:
         | Pretty much what you described. Convert the PDF to Markdown,
         | join content across pages so that its all one string, then
         | chunk it. Our evals show this approach works best.
        
       | hweller wrote:
       | Congratulations on the launch! would be awesome to see support
       | for MongoDB Atlas as one of the vector stores and Voyage AI as an
       | embedding provider if you are interested. I can imagine quite a
       | few customers that would prefer a lightweight interface for
       | chunking- lmk how I can help make that happen from the Mongo
       | side!
        
       | dbworku wrote:
       | Very cool. Dope maintainers and project!
        
       | ketzo wrote:
       | I'm building out a side project where I need to ingest + chunk a
       | lot of HTML -- wrote my own(terrible) hunker naively thinking
       | that would be easy :')
       | 
       | Definitely gonna give this a try!
        
       | zackify wrote:
       | You guys should steal the ideas I had in mind and partially
       | implemented on https://github.com/zackify/revect
       | 
       | Similar to you I saw a lot of bloated projects out there. Mine is
       | 90mb container.
       | 
       | I want to do what your project does but in addition have
       | extensions for every day apps that index into a db.
       | 
       | Your private database for all ai interactions.
       | 
       | I also have a cloud version using the mcp auth spec, but it's all
       | for fun and probably not worth releasing.
       | 
       | Do you have any plans to do further use cases such as this?
        
       ___________________________________________________________________
       (page generated 2025-06-09 23:00 UTC)