[HN Gopher] 90x Faster Than Pgvector - Lantern's HNSW Index Crea...
___________________________________________________________________
90x Faster Than Pgvector - Lantern's HNSW Index Creation Time
Author : diqi
Score : 68 points
Date : 2024-01-02 18:21 UTC (4 hours ago)
(HTM) web link (lantern.dev)
(TXT) w3m dump (lantern.dev)
| levkk wrote:
| Curious about the "outside of the database" index generation
| part. Is this index WAL-protected eventually?
| diqi wrote:
| Yes it is WAL protected: the advantage of external indexing is
| that the HNSW graph is being constructed externally on multiple
| cores instead on a single core inside the Postgres process. But
| eventually the graph is being parsed and processed inside
| Postgres with all the necessary WAL logs for blocks.
| mattashii wrote:
| How does performance scale (vs pgvector) when you have an index
| and start loading data in parallel? Or how does this scale vs the
| to-be-released pgvector 0.5.2?
| mattashii wrote:
| I'm also concerned about these (tested!) errors:
|
| >
| https://github.com/lanterndata/lantern/blob/040f24253e5a2651...
|
| > Operator <-> can only be used inside of an index
|
| Isn't the use of the distance operator in scan+sort critical
| for generating the expected/correct result that's needed for
| validating the recall of an ANN-only index?
| diqi wrote:
| Ah, thank you for noticing! We actually have typo in the
| error message. It actually should be the operator <?> instead
| of <->.
|
| There's some context on the operator <?> here:
| https://github.com/lanterndata/lantern?tab=readme-ov-
| file#a-...
| diqi wrote:
| We haven't benchmarked against 0.5.2 yet so I can't share exact
| numbers. We will benchmark it once it is released.
|
| We think our approach will still significantly outperform
| pgvector because it does less on your production database.
|
| We generate the index remotely, on a compute-optimized machine,
| and only use your production database for index copy.
|
| Parallel pgvector would have to use your production database
| resources to run the compute-intensive HNSW index creation
| workload.
| lettergram wrote:
| As someone who just indexed 6m documents with pgvector, I can say
| it's a massive time sync - on the order of days, even with a 32
| core 64Gb RDS instance.
| cyanydeez wrote:
| what was the token sizes for comparison?
| lettergram wrote:
| I've done a few 384, 762, 512 all take a few days
|
| Though index creation is not a big deal, I want good queries
| rapidly for cheap. So IMO RDS with pgvector is the easiest
| approach.
| jn2clark wrote:
| That sounds much longer than it should. I am not sure on your
| exact use-case but I would encourage you to check out Marqo
| (https://github.com/marqo-ai/marqo - disclaimer, I am a co-
| founder). All inference and orchestration is included (no api
| calls) and many open-source or fine-tuned models can be used.
| chatmasta wrote:
| > That [pgvector index creation time] sounds much longer than
| it should... I would encourage you to check out Marqo
|
| Your comment makes it sound like Marqo is a way to speed up
| pgvector indexing, but to be clear, Marqo is just another
| Vector Database and is unrelated to pgvector.
| code_biologist wrote:
| The reason I would use pgvector is because I am uninterested
| in another piece of infrastructure.
| netcraft wrote:
| So approximately 0% chance I could use this on AWS RDS or Aurora
| correct?
|
| Still, very impressive
| tristan957 wrote:
| This extension is licensed under the Business Source
| License[0], which makes it incompatible with most DBaaS
| offerings. The BSL is a closed-source license. Good choice for
| Lantern, but unusable for everyone else.
|
| Some Postgres offerings allow you to bring your own extensions
| to workaround limitations of these restrictive licenses, for
| instance Neon[1], where I work. I tried to look at the AWS docs
| for you, but couldn't find anything about that. I did find
| Trusted Language Extensions[2], but that seems to be more about
| writing your own extension. Couldn't find a way to upload
| arbitrary extensions.
|
| I will add that you could use logical replication[3] to mirror
| data from your primary database into a Lantern-hosted database
| (or host your own database with the Lantern extension). This
| obviously has a couple downsides, but thought I would mention
| it.
|
| [0]:
| https://github.com/lanterndata/lantern/commit/dda7f064ca80af...
|
| [1]: https://neon.tech/docs/extensions/pg-extensions#custom-
| built...
|
| [2]:
| https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Postg...
|
| [3]: https://www.postgresql.org/docs/current/logical-
| replication....
| themanmaran wrote:
| Likely as an extension eventually. I know RDS has a variety of
| postgres extensions you can use. Pg_vector is supported, so
| likely lantern could get support as well.
|
| [1]
| https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Postg...
| TuringNYC wrote:
| You piqued my interest enough to sign up and try...but now it
| needs an Access Code to try the DB, any HN special here?
| diqi wrote:
| Try YCW24! :)
| jbellis wrote:
| Nice to see people care about index construction time.
|
| I'm the lead author of JVector, which scales linearly to at least
| 32 cores and may be the only graph-based vector index designed
| around nonblocking data structures (as opposed to using locks for
| thread safety): https://github.com/jbellis/jvector/
|
| JVector looks to be about 2x as fast at indexing as Lantern,
| ingesting the Sift1M dataset in under 25s on a 32 core aws box
| (m6i.16xl), compared to 50s for Lantern in the article.
|
| (JVector is based on DiskANN, not HNSW, but the configuration
| parameters are similar -- both are configured with graph degree
| and search width.)
| justinclift wrote:
| Seems related: https://news.ycombinator.com/item?id=38840850
___________________________________________________________________
(page generated 2024-01-02 23:00 UTC)