[HN Gopher] Ultra efficient vector extension for SQLite
       ___________________________________________________________________
        
       Ultra efficient vector extension for SQLite
        
       Author : marcobambini
       Score  : 96 points
       Date   : 2025-09-23 14:33 UTC (3 days ago)
        
 (HTM) web link (marcobambini.substack.com)
 (TXT) w3m dump (marcobambini.substack.com)
        
       | simonw wrote:
       | This is a neat project - good API design, performance looks
       | impressive.
       | 
       | Note that this one isn't open source:
       | https://github.com/sqliteai/sqlite-vector/blob/main/LICENSE....
       | 
       | The announcement says:
       | 
       | > We believe the community could benefit from sqlite-vector,
       | which is why we've made it entirely free for open-source
       | projects.
       | 
       | But that's not really how this works. If my open source projects
       | incorporate aspects of this project they're no longer open
       | source!
        
         | rudedogg wrote:
         | Dang, I was really excited about this too.
         | 
         | I guess I'll either stick with sqlite-vec or give turso another
         | look. I'm not fond of the idea of a SQLite fork though.
         | 
         | Do you know if anything else I should take a look at? I know
         | you use a lot of this stuff for your open-source AI/ML stuff.
         | I'd like something I can use on device.
        
         | OutOfHere wrote:
         | In contrast, https://github.com/asg017/sqlite-vec is dual-
         | licensed under Apache and MIT, which makes it open source.
        
         | thorsson12 wrote:
         | Ah, yes, this is a "source available" project, not what you
         | would normally call an "open source" project. Still cool!
        
         | tanvach wrote:
         | There is the 'Additional Grant for Open-Source Projects'
         | section that seems to permit inclusion in open source project.
         | Do you mind explaining why you think this is not enough? I'm
         | not an expert in licenses so genuinely interested in your take.
        
           | matharmin wrote:
           | Let's say I have an open-source project licensed under Apache
           | 2. The grant allows me to include the extension in my
           | project. But it doesn't allow me to relicense it under Apache
           | 2 or any other compatible license. So if I include it, my
           | project can't be Apache 2-licensed anymore.
           | 
           | Apache 2 is just an example here - the same would apply for
           | practically any open source license.
           | 
           | The one place I imagine it could still work is if the open-
           | source project, say a sqlite browser, includes it as an
           | optional plugin. So the project itself stays open-source, but
           | the grant allows using the proprietary plugin with it.
        
           | simonw wrote:
           | The reason I choose to apply open source licenses to my
           | project as I want other people to be able to use them without
           | any limitations (beyond those set out in the open source
           | license I selected, which are extremely permissive.)
           | 
           | If they depend on software that carries limitations, I can no
           | longer make that promise to my own users.
           | 
           | Or does their extra license term mean I can ship my own
           | project which is the thinnest possible wrapper around theirs
           | but makes it fully open source? That seems unlikely.
        
         | dmezzetti wrote:
         | Odd licensing strategy here. It's like someone that wants the
         | cachet of saying they are open source without being it.
        
         | MangoToupe wrote:
         | I guess "free software" is well and truly dead as a term with
         | any general cultural weight.
        
         | F3nd0 wrote:
         | Even worse, it seems like it's not Free Software, either.
        
       | fifilura wrote:
       | Honest question, I just want to learn, what are vector databases
       | used for?
        
         | aaronblohowiak wrote:
         | finding similar things quickly, where the "shape" of a thing
         | can be defined by a vector (like embeddings for instance). this
         | can be used in lots of machine learning applications
        
           | fifilura wrote:
           | I figured it would be something like this. And vectors as
           | rows in a regular table would be too slow then?
        
             | sandyarmstrong wrote:
             | sqlite does not have native support for a vector-like
             | column type. Extensions like this and sqlite-vec build on
             | the BLOB column type, and provide additional functions to
             | let you efficiently search and manipulate this vector data.
        
         | simonw wrote:
         | They're useful for embeddings, which let you turn articles (and
         | images and other content) into a huge array of floating point
         | numbers that capture the semantics of the content. Then you can
         | use a vector database to find similar items to each other - or
         | similar items to the user's own search query.
         | 
         | I wrote a big tutorial about embeddings a couple of years ago
         | which still holds up today:
         | https://simonwillison.net/2023/Oct/23/embeddings/
        
       | kamranjon wrote:
       | I'd be interested to understand the query performance when
       | compared to the HNSW implementation (Turso?) they mentioned. In
       | general search performance is more important to me, and I don't
       | mind having an increase in insert overhead to have very fast
       | vector search.
        
         | dandanua wrote:
         | HNSW is not accurate. I guess brute-force means that sqlite-
         | vector returns the best match.
        
           | kamranjon wrote:
           | Right but libsql(Turso) uses HNSW - so i'd be curious to know
           | how performance of sqlite-vector compares - they do say they
           | "Use a brute-force-like approach, but highly optimized." -
           | which to me, would be very interesting to see compared with a
           | HNSW approach.
        
       | ashvardanian wrote:
       | My usual line of feedback would be to start with a more
       | aggressive benchmark. Indexing 100K dense vector (100ish MB here)
       | is not generally a good idea. Brute-force search at that scale is
       | already trivial at 10 GB/s/core.
        
         | sdenton4 wrote:
         | They say int he post that they're doing optimized brute-force
         | search, which honestly makes a lot of sense for the local-
         | scaled context.
         | 
         | Vector databases are often over-optimized for getting results
         | into papers (where the be-all-end-all measure is recall@latency
         | benchmarking). Indexed search is very rigid - filtered search
         | is a pain, and you're stuck with the metric you used for
         | indexing.
         | 
         | At smaller data scales, you get a lot more flexibility, and
         | shoving things into a indexed search mindlessly will lead to a
         | lot of pain. Providing optimized flexible search at smaller
         | scales is quite valuable, imo.
        
           | ashvardanian wrote:
           | Ah, I see the article does mention "brute-force-like" -- I
           | must have skimmed past that. I'd be curious what exactly is
           | meant by it in practice.
           | 
           | A small note: since the project seems to include
           | @maratyszcza's fp16 library (MIT), it might be nice to add a
           | line of attribution: https://github.com/maratyszcza/fp16
           | 
           | And if you ever feel like broadening the benchmarks, it could
           | be interesting to compare with USearch. It has had an SQLite
           | connector for a while and covers a wide set of hardware
           | backends and similarity metrics, though it's not yet as
           | straightforward to use on some OSes: https://unum-
           | cloud.github.io/usearch/sqlite
        
             | sdenton4 wrote:
             | To be clear, I'm not the author of the post. But I do
             | maintain a library for folks working with large audio
             | datasets, built on a combination of SQLite and usearch. :)
        
       | aabhay wrote:
       | The repo mentions approximate NN search but the article implies
       | this is mainly brute force. Is there any indexing at all then? If
       | not, is the approximate part an app-space thing e.g. storing
       | binary vectors alongside the real ones?
       | 
       | In addition, if things are brute forced, wouldn't a columnar db
       | perform better than a row-based one? E.G. DuckDB?
        
         | OutOfHere wrote:
         | A columnar database is completely irrelevant to vector search.
         | Vectors aren't stored in columns. Traditional indexing too is
         | altogether irrelevant because brute force means a full pass
         | through the data. Specialized indexes can be relevant, but then
         | the search is generally approximate, not exact.
        
           | roflcopter69 wrote:
           | How's a database being columnar irrelevant to vector search?
           | This very vector search extension shows that brute force
           | search can work surprisingly well up to a certain dataset
           | size and at this point columnar storage is great because it
           | gives a perfect memory access pattern for the vector search
           | instead of iterating over all the rows of a table and only
           | accessing the vector of a row.
        
             | OutOfHere wrote:
             | That makes sense. I withdraw my comment.
        
       | mholt wrote:
       | > [sqlite-vec] works via virtual tables, which means vectors must
       | live in separate tables and queries become more complex.
       | 
       | Not really, you can just call the distance function directly and
       | your vector blob can be in any regular field in any regular
       | table, like what I do. Works great.
       | 
       | More info: https://github.com/asg017/sqlite-vec/issues/196
        
         | roflcopter69 wrote:
         | Sorry for not being on topic, just wanted to say hi @mholt and
         | for making and maintaining Caddy! Happy Caddy user here.
        
           | mholt wrote:
           | Thank you, that's always nice to read! I will pass this along
           | to our maintainer team.
        
       | roflcopter69 wrote:
       | It's unfortunate that this one is not really open source, it has
       | Elastic License 2.0 license.
       | 
       | But it's still a wonderful example for how far you can get with
       | brute-force vector similarity search that has been optimized like
       | crazy by making use of SIMD. Not having to use a fancy index is
       | such a blessing. Think of all the complexities that you don't
       | have when not using an index. You don't have these additional
       | worries about memory and/or disk usage or insert/delete/update
       | costs and you can make full use of SQL filters. It's crazy to me
       | what kind of vector DBs people put up with. They use custom query
       | languages, have to hold huge indices in memory or write humongous
       | indices to disk for something that's not necessarily faster than
       | brute-force search for tables with <5M rows. And let's be honest
       | who has those gigantic tables with more than 5M rows? Not too
       | many I'd assume.
        
         | benjiro wrote:
         | > And let's be honest who has those gigantic tables with more
         | than 5M rows? Not too many I'd assume.
         | 
         | /Looks around all innocent ... does 57 billion count? Hate to
         | tell ya but plenty of use cases when you deal with large
         | datasets and normal table design will get out of hand. And row
         | overhead will bite!
         | 
         | We ended up writing a specialized encoder / decoder to store
         | the information in bytea fields to reduce it to a measly 3
         | billion (row packing is the better term) but we also lost some
         | of the advantages of having the database btree index on some of
         | the date fields.
         | 
         | Thing is, the moment you deal with big data, things starts to
         | collaps fast if you want to deal with brute force, vs storage
         | vs index handeling.
         | 
         | I can think of a lot more projects that can expand to crazy
         | numbers, if you follow basic database normalization.
        
       | rcarmo wrote:
       | Seems to work fine, even if the license is a bit of a put off.
       | 
       | I am however still looking for a fast CPU-bound embedding model
       | (fastembed is quite slow on small ARM chips).
        
       | hbcondo714 wrote:
       | The author replied to one of my comments[1] here on HN a few
       | months ago when I asked about doing ANN on the edge; nice to see
       | it arrive!
       | 
       | [1] https://news.ycombinator.com/item?id=44063950
        
       ___________________________________________________________________
       (page generated 2025-09-26 23:00 UTC)