[HN Gopher] Pgvector - vector similarity search for Postgres
___________________________________________________________________
Pgvector - vector similarity search for Postgres
Author : simonpure
Score : 94 points
Date : 2021-04-22 14:26 UTC (8 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| robertlagrant wrote:
| I like the long-running trend of using Postgres as a standard
| platform to build good tools on.
| minxomat wrote:
| I've done something similar (heh) in pure Postgres:
| https://github.com/turbo/pg-costop
| phenkdo wrote:
| how does this do with large-scale searches >10M+ rows? any
| benchmarks on performance?
| gk1 wrote:
| Since the answer, in minxomat's words, is "terrible," maybe
| look at Pinecone (https://www.pinecone.io) which makes light
| work of searching through 10M+ rows. It sits alongside, not
| inside, your Postgres (or whatever) warehouse.
|
| Disclosure: I work there.
| minxomat wrote:
| Oh it's terrible. More educational. You'd never ever want to
| do a full-scan cosine matching in production. Use locality
| sensitive hashing (with optimizations like amplification,
| SuperBit or DenseFly) for real world workloads.
| jononor wrote:
| Nice, useful specially on managed services where installing
| additional extensions may not be an option. If you want wider
| usage, you may want to add a section on how to install/setup
| this with a typical postgresql setup? And also, how was the
| code tested for correctness, and what is the expected
| performance?
| RicoElectrico wrote:
| Can somebody chime in and say whether this would be a useful
| solution to recommend similar texts if I compute fastText,
| doc2vec or other embeddings?
|
| What do such vector similarity queries use in production when
| there's no DB support? Surely having the "real" DB like Postgres
| and one for vectors alongside it would be cumbersome.
| [deleted]
| simonw wrote:
| Side observation, not meant as a criticism of this project at all
| (which seems to be consistently doing things the PostgreSQL way,
| I commend that).
|
| The PostgreSQL world really loves operators - this extension uses
| <#>, <-> and <=> for example
|
| I've been working with PostgreSQL JSON queries a bit recently
| which is also really operator heavy:
| https://www.postgresql.org/docs/13/functions-json.html - ->, ->>,
| #>, #>>
|
| I have a really hard time remember what any of these do - I
| personally much prefer to use function equivalents like
| json_extract_path() if they are available.
| jononor wrote:
| Yeah, I would love to be able to use functions for these vector
| distance operations instead of the very magical operators...
| megous wrote:
| https://github.com/ankane/pgvector/blob/master/vector--
| 0.1.0...
|
| Nothing stops you. All operators are implemented just as
| procedure calls.
| eloff wrote:
| I agree, I never remember the syntax for the json operators
| either. I'd rather functions which are readable if you don't
| remember what they do. Operators need to be looked up. This
| emphasis on terseness over readability was one of the reasons I
| abandoned perl long ago.
| PeterisP wrote:
| JSON query operators are somewhat unusual because it's very
| common to need to chain many operators in a row, which would be
| very verbose if named functions would be used instead - the
| vector similarity is different in this regard, and that
| terseness isn't necessary.
| kakadzhun wrote:
| https://github.com/joosephook/sqlite3-numpy
|
| Minimal working example of storing numpy vectors in sqlite3 using
| Python: import sqlite3 from
| scipy.spatial.distance import cdist import numpy as np
| class Vector: def __init__(self, arg):
| self.arg = arg def __str__(self):
| return 'Vector({!r})'.format(self.arg) def
| adapter_func(obj: np.ndarray): return obj.tobytes()
| def converter_func(data: bytes): return
| np.frombuffer(data) sqlite3.register_adapter(Vector,
| adapter_func)
| sqlite3.register_converter(f"{Vector.__name__}", converter_func)
| if __name__ == '__main__': with
| sqlite3.connect(":memory:", detect_types=sqlite3.PARSE_DECLTYPES)
| as con: cur = con.cursor()
| cur.execute("create table test(v vector)")
| cur.execute("insert into test(v) values (?)",
| (np.random.random(1280),)) cur.execute("insert
| into test(v) values (?)", (np.random.random(1280),))
| cur.execute('select v from test') vectors = []
| for v, in cur.fetchall(): print(v.shape, v)
| assert isinstance(v, np.ndarray)
| vectors.append(v) assert len(vectors) == 2
| print(cdist(vectors[:1], vectors[1:], metric='cosine')[0])
| cur.close()
| patelajay285 wrote:
| Our startup made a package powered by SQLite for this very
| purpose: https://github.com/plasticityai/magnitude
|
| Might be worth checking out :)
| sammorrowdrums wrote:
| I have been using Smlar for a while for cosine similarity. Might
| this project provide a viable alternative? Smlar can be quite
| slow
|
| https://github.com/jirutka/smlar
| BenoitP wrote:
| Neat!
|
| I guess from the 'ivfflat' keyword that this project is serving
| vectors produced by FAISS[1] somehow?
|
| https://github.com/facebookresearch/faiss
| heipei wrote:
| FAISS doesn't produce vectors, it's simply a vector similarity
| search engine.
| BenoitP wrote:
| It produces vectors, also does the (distributed) searching.
|
| Here is a partial list of possible vector types:
|
| https://github.com/facebookresearch/faiss/wiki/Faiss-indexes
| gk1 wrote:
| Seems that way. Faiss is mentioned in the "Thanks" list.
| phenkdo wrote:
| Love it! Need better ANNs approaches in databases, using external
| servers such as Milvus, faiss is a pita
|
| related: https://github.com/netrasys/pgANN
| gk1 wrote:
| Why are they a pain in the...?
| EmilStenstrom wrote:
| They typically load all data in memory, so you still need
| persistence to handle crashes (two setups). And since data is
| typically huge you need servers with lots of expensive RAM.
| rkt08 wrote:
| How about a vector oriented 'database' instead?
| Pinecone(https://www.pinecone.io/) does both exact and
| approx search and it's fully managed so you don't have to
| worry about reliability,availability etc.
|
| PS: I work there
| patelajay285 wrote:
| Check out Magnitude, we built it to solve that problem:
| https://github.com/plasticityai/magnitude
|
| It's still loaded from a file, but heavily uses memory-
| mapping and caching to be speedy and not overload your RAM
| immediately. And in production scenarios, multiple worker
| processes can share that memory due to the memory mapping.
|
| Granted it's read-only, so might not be exactly what you
| are looking for.
|
| Disclaimer: I'm the author.
| phenkdo wrote:
| the usual yet-another-moving-part (YAMP) complexity.
___________________________________________________________________
(page generated 2021-04-22 23:01 UTC)