[HN Gopher] Show HN: HNSW index for vector embeddings in approx ...
___________________________________________________________________
Show HN: HNSW index for vector embeddings in approx 500 LOC
Author : dicroce
Score : 49 points
Date : 2025-04-08 15:43 UTC (7 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| oersted wrote:
| I particularly appreciated the concise and plain explanation of
| the data-structure, it really demystifies it.
|
| > HNSW is a graph structure that consists of levels that are more
| sparsely populated at the top and more densely populated at the
| bottom. Nodes within a layer have connections to other nodes that
| are near them on the same level. When a node is inserted a random
| level is picked and the node is inserted there. It is also
| inserted into all levels beneath that level down to 0.
|
| > When searches arrive they start at the top and search that
| level (following connections) until they find the closest node in
| the top level. The search then descends and keeps searching
| nearby nodes. As the search progresses the code keeps track of
| the K nearest nodes it has seen. Eventually it either finds the
| value OR it finds the closest value on level 0 and the K nearest
| nodes seen are returned.
| imurray wrote:
| Looks neat. It would be useful to compare to other
| implementations: https://ann-benchmarks.com/ -- potentially not
| just speed, but implementation details that might change recall.
| swyx wrote:
| i think with small codebases like this is less about speed and
| more about education of essentials - i actually often encourage
| juniors to do small clones like this, feel proud, and then
| study the diffs with the at-scale repros and either feel
| humbled or feel like they have a contribution to make.
| oersted wrote:
| I see they are still using GloVe word embeddings for the first
| benchmark. Ah good ol' days! Nothing wrong with it, should
| still yield a realistic distribution of vectors. Just brings a
| lot of memories :)
| antirez wrote:
| Yes, HSNWs are not _so_ complex, and they work great. I wrote an
| implementation myself, it 's 2500 lines of code (5x the one of
| dicroce!), but inside there is binary and int8 quantization and
| many advanced features (include true deletions), and I commented
| it as much as possible. I hope you may find it useful to read
| alongside the one proposed by OP:
|
| https://github.com/redis/redis/blob/unstable/modules/vector-...
|
| Still, to be honest, I'm reading the 500 lines of code with great
| interest because I didn't thought it was possible to go so small,
| maybe part of the trick is that's in C++ and not in C, as for
| instance you don't have the queue code.
|
| Also the strategy used by my implementation in order to re-link
| the orphaned nodes upon deletion adds complexity, too. (btw any
| feedback on that part is especially appreciated)
|
| EDIT: Ok after carefully inspection this makes more sense :D
|
| 1. Yes, C++ helps, the vector class, the priority queue, ...
|
| 2. I forgot to say that I implemented serialization other than
| quantization, and this also includes quite some code.
|
| 3. Support for threads is another complexity / code-size price to
| pay.
|
| And so forth. Ok, now it makes a lot of sense. Probably the 500
| LOC implementation is a better first-exposure experience for
| newcomers. After accumulating all the "but how to..." questions,
| maybe my C implementation is a useful read.
| mertleee wrote:
| I guess I'm too simple to understand why this is useful? Just
| because it's been implemented in so few lines or?
___________________________________________________________________
(page generated 2025-04-08 23:00 UTC)