[HN Gopher] Pgvector - vector similarity search for Postgres
       ___________________________________________________________________
        
       Pgvector - vector similarity search for Postgres
        
       Author : simonpure
       Score  : 94 points
       Date   : 2021-04-22 14:26 UTC (8 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | robertlagrant wrote:
       | I like the long-running trend of using Postgres as a standard
       | platform to build good tools on.
        
       | minxomat wrote:
       | I've done something similar (heh) in pure Postgres:
       | https://github.com/turbo/pg-costop
        
         | phenkdo wrote:
         | how does this do with large-scale searches >10M+ rows? any
         | benchmarks on performance?
        
           | gk1 wrote:
           | Since the answer, in minxomat's words, is "terrible," maybe
           | look at Pinecone (https://www.pinecone.io) which makes light
           | work of searching through 10M+ rows. It sits alongside, not
           | inside, your Postgres (or whatever) warehouse.
           | 
           | Disclosure: I work there.
        
           | minxomat wrote:
           | Oh it's terrible. More educational. You'd never ever want to
           | do a full-scan cosine matching in production. Use locality
           | sensitive hashing (with optimizations like amplification,
           | SuperBit or DenseFly) for real world workloads.
        
         | jononor wrote:
         | Nice, useful specially on managed services where installing
         | additional extensions may not be an option. If you want wider
         | usage, you may want to add a section on how to install/setup
         | this with a typical postgresql setup? And also, how was the
         | code tested for correctness, and what is the expected
         | performance?
        
       | RicoElectrico wrote:
       | Can somebody chime in and say whether this would be a useful
       | solution to recommend similar texts if I compute fastText,
       | doc2vec or other embeddings?
       | 
       | What do such vector similarity queries use in production when
       | there's no DB support? Surely having the "real" DB like Postgres
       | and one for vectors alongside it would be cumbersome.
        
         | [deleted]
        
       | simonw wrote:
       | Side observation, not meant as a criticism of this project at all
       | (which seems to be consistently doing things the PostgreSQL way,
       | I commend that).
       | 
       | The PostgreSQL world really loves operators - this extension uses
       | <#>, <-> and <=> for example
       | 
       | I've been working with PostgreSQL JSON queries a bit recently
       | which is also really operator heavy:
       | https://www.postgresql.org/docs/13/functions-json.html - ->, ->>,
       | #>, #>>
       | 
       | I have a really hard time remember what any of these do - I
       | personally much prefer to use function equivalents like
       | json_extract_path() if they are available.
        
         | jononor wrote:
         | Yeah, I would love to be able to use functions for these vector
         | distance operations instead of the very magical operators...
        
           | megous wrote:
           | https://github.com/ankane/pgvector/blob/master/vector--
           | 0.1.0...
           | 
           | Nothing stops you. All operators are implemented just as
           | procedure calls.
        
         | eloff wrote:
         | I agree, I never remember the syntax for the json operators
         | either. I'd rather functions which are readable if you don't
         | remember what they do. Operators need to be looked up. This
         | emphasis on terseness over readability was one of the reasons I
         | abandoned perl long ago.
        
         | PeterisP wrote:
         | JSON query operators are somewhat unusual because it's very
         | common to need to chain many operators in a row, which would be
         | very verbose if named functions would be used instead - the
         | vector similarity is different in this regard, and that
         | terseness isn't necessary.
        
       | kakadzhun wrote:
       | https://github.com/joosephook/sqlite3-numpy
       | 
       | Minimal working example of storing numpy vectors in sqlite3 using
       | Python:                   import sqlite3         from
       | scipy.spatial.distance import cdist         import numpy as np
       | class Vector:             def __init__(self, arg):
       | self.arg = arg                  def __str__(self):
       | return 'Vector({!r})'.format(self.arg)                   def
       | adapter_func(obj: np.ndarray):             return obj.tobytes()
       | def converter_func(data: bytes):             return
       | np.frombuffer(data)              sqlite3.register_adapter(Vector,
       | adapter_func)
       | sqlite3.register_converter(f"{Vector.__name__}", converter_func)
       | if __name__ == '__main__':             with
       | sqlite3.connect(":memory:", detect_types=sqlite3.PARSE_DECLTYPES)
       | as con:                 cur = con.cursor()
       | cur.execute("create table test(v vector)")
       | cur.execute("insert into test(v) values (?)",
       | (np.random.random(1280),))                 cur.execute("insert
       | into test(v) values (?)", (np.random.random(1280),))
       | cur.execute('select v from test')                 vectors = []
       | for v, in cur.fetchall():                     print(v.shape, v)
       | assert isinstance(v, np.ndarray)
       | vectors.append(v)                      assert len(vectors) == 2
       | print(cdist(vectors[:1], vectors[1:], metric='cosine')[0])
       | cur.close()
        
         | patelajay285 wrote:
         | Our startup made a package powered by SQLite for this very
         | purpose: https://github.com/plasticityai/magnitude
         | 
         | Might be worth checking out :)
        
       | sammorrowdrums wrote:
       | I have been using Smlar for a while for cosine similarity. Might
       | this project provide a viable alternative? Smlar can be quite
       | slow
       | 
       | https://github.com/jirutka/smlar
        
       | BenoitP wrote:
       | Neat!
       | 
       | I guess from the 'ivfflat' keyword that this project is serving
       | vectors produced by FAISS[1] somehow?
       | 
       | https://github.com/facebookresearch/faiss
        
         | heipei wrote:
         | FAISS doesn't produce vectors, it's simply a vector similarity
         | search engine.
        
           | BenoitP wrote:
           | It produces vectors, also does the (distributed) searching.
           | 
           | Here is a partial list of possible vector types:
           | 
           | https://github.com/facebookresearch/faiss/wiki/Faiss-indexes
        
         | gk1 wrote:
         | Seems that way. Faiss is mentioned in the "Thanks" list.
        
       | phenkdo wrote:
       | Love it! Need better ANNs approaches in databases, using external
       | servers such as Milvus, faiss is a pita
       | 
       | related: https://github.com/netrasys/pgANN
        
         | gk1 wrote:
         | Why are they a pain in the...?
        
           | EmilStenstrom wrote:
           | They typically load all data in memory, so you still need
           | persistence to handle crashes (two setups). And since data is
           | typically huge you need servers with lots of expensive RAM.
        
             | rkt08 wrote:
             | How about a vector oriented 'database' instead?
             | Pinecone(https://www.pinecone.io/) does both exact and
             | approx search and it's fully managed so you don't have to
             | worry about reliability,availability etc.
             | 
             | PS: I work there
        
             | patelajay285 wrote:
             | Check out Magnitude, we built it to solve that problem:
             | https://github.com/plasticityai/magnitude
             | 
             | It's still loaded from a file, but heavily uses memory-
             | mapping and caching to be speedy and not overload your RAM
             | immediately. And in production scenarios, multiple worker
             | processes can share that memory due to the memory mapping.
             | 
             | Granted it's read-only, so might not be exactly what you
             | are looking for.
             | 
             | Disclaimer: I'm the author.
        
           | phenkdo wrote:
           | the usual yet-another-moving-part (YAMP) complexity.
        
       ___________________________________________________________________
       (page generated 2021-04-22 23:01 UTC)