[HN Gopher] 90x Faster Than Pgvector - Lantern's HNSW Index Crea...
       ___________________________________________________________________
        
       90x Faster Than Pgvector - Lantern's HNSW Index Creation Time
        
       Author : diqi
       Score  : 68 points
       Date   : 2024-01-02 18:21 UTC (4 hours ago)
        
 (HTM) web link (lantern.dev)
 (TXT) w3m dump (lantern.dev)
        
       | levkk wrote:
       | Curious about the "outside of the database" index generation
       | part. Is this index WAL-protected eventually?
        
         | diqi wrote:
         | Yes it is WAL protected: the advantage of external indexing is
         | that the HNSW graph is being constructed externally on multiple
         | cores instead on a single core inside the Postgres process. But
         | eventually the graph is being parsed and processed inside
         | Postgres with all the necessary WAL logs for blocks.
        
       | mattashii wrote:
       | How does performance scale (vs pgvector) when you have an index
       | and start loading data in parallel? Or how does this scale vs the
       | to-be-released pgvector 0.5.2?
        
         | mattashii wrote:
         | I'm also concerned about these (tested!) errors:
         | 
         | >
         | https://github.com/lanterndata/lantern/blob/040f24253e5a2651...
         | 
         | > Operator <-> can only be used inside of an index
         | 
         | Isn't the use of the distance operator in scan+sort critical
         | for generating the expected/correct result that's needed for
         | validating the recall of an ANN-only index?
        
           | diqi wrote:
           | Ah, thank you for noticing! We actually have typo in the
           | error message. It actually should be the operator <?> instead
           | of <->.
           | 
           | There's some context on the operator <?> here:
           | https://github.com/lanterndata/lantern?tab=readme-ov-
           | file#a-...
        
         | diqi wrote:
         | We haven't benchmarked against 0.5.2 yet so I can't share exact
         | numbers. We will benchmark it once it is released.
         | 
         | We think our approach will still significantly outperform
         | pgvector because it does less on your production database.
         | 
         | We generate the index remotely, on a compute-optimized machine,
         | and only use your production database for index copy.
         | 
         | Parallel pgvector would have to use your production database
         | resources to run the compute-intensive HNSW index creation
         | workload.
        
       | lettergram wrote:
       | As someone who just indexed 6m documents with pgvector, I can say
       | it's a massive time sync - on the order of days, even with a 32
       | core 64Gb RDS instance.
        
         | cyanydeez wrote:
         | what was the token sizes for comparison?
        
           | lettergram wrote:
           | I've done a few 384, 762, 512 all take a few days
           | 
           | Though index creation is not a big deal, I want good queries
           | rapidly for cheap. So IMO RDS with pgvector is the easiest
           | approach.
        
         | jn2clark wrote:
         | That sounds much longer than it should. I am not sure on your
         | exact use-case but I would encourage you to check out Marqo
         | (https://github.com/marqo-ai/marqo - disclaimer, I am a co-
         | founder). All inference and orchestration is included (no api
         | calls) and many open-source or fine-tuned models can be used.
        
           | chatmasta wrote:
           | > That [pgvector index creation time] sounds much longer than
           | it should... I would encourage you to check out Marqo
           | 
           | Your comment makes it sound like Marqo is a way to speed up
           | pgvector indexing, but to be clear, Marqo is just another
           | Vector Database and is unrelated to pgvector.
        
           | code_biologist wrote:
           | The reason I would use pgvector is because I am uninterested
           | in another piece of infrastructure.
        
       | netcraft wrote:
       | So approximately 0% chance I could use this on AWS RDS or Aurora
       | correct?
       | 
       | Still, very impressive
        
         | tristan957 wrote:
         | This extension is licensed under the Business Source
         | License[0], which makes it incompatible with most DBaaS
         | offerings. The BSL is a closed-source license. Good choice for
         | Lantern, but unusable for everyone else.
         | 
         | Some Postgres offerings allow you to bring your own extensions
         | to workaround limitations of these restrictive licenses, for
         | instance Neon[1], where I work. I tried to look at the AWS docs
         | for you, but couldn't find anything about that. I did find
         | Trusted Language Extensions[2], but that seems to be more about
         | writing your own extension. Couldn't find a way to upload
         | arbitrary extensions.
         | 
         | I will add that you could use logical replication[3] to mirror
         | data from your primary database into a Lantern-hosted database
         | (or host your own database with the Lantern extension). This
         | obviously has a couple downsides, but thought I would mention
         | it.
         | 
         | [0]:
         | https://github.com/lanterndata/lantern/commit/dda7f064ca80af...
         | 
         | [1]: https://neon.tech/docs/extensions/pg-extensions#custom-
         | built...
         | 
         | [2]:
         | https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Postg...
         | 
         | [3]: https://www.postgresql.org/docs/current/logical-
         | replication....
        
         | themanmaran wrote:
         | Likely as an extension eventually. I know RDS has a variety of
         | postgres extensions you can use. Pg_vector is supported, so
         | likely lantern could get support as well.
         | 
         | [1]
         | https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Postg...
        
       | TuringNYC wrote:
       | You piqued my interest enough to sign up and try...but now it
       | needs an Access Code to try the DB, any HN special here?
        
         | diqi wrote:
         | Try YCW24! :)
        
       | jbellis wrote:
       | Nice to see people care about index construction time.
       | 
       | I'm the lead author of JVector, which scales linearly to at least
       | 32 cores and may be the only graph-based vector index designed
       | around nonblocking data structures (as opposed to using locks for
       | thread safety): https://github.com/jbellis/jvector/
       | 
       | JVector looks to be about 2x as fast at indexing as Lantern,
       | ingesting the Sift1M dataset in under 25s on a 32 core aws box
       | (m6i.16xl), compared to 50s for Lantern in the article.
       | 
       | (JVector is based on DiskANN, not HNSW, but the configuration
       | parameters are similar -- both are configured with graph degree
       | and search width.)
        
       | justinclift wrote:
       | Seems related: https://news.ycombinator.com/item?id=38840850
        
       ___________________________________________________________________
       (page generated 2024-01-02 23:00 UTC)