[HN Gopher] Measuring the popularity of different vector databases
       ___________________________________________________________________
        
       Measuring the popularity of different vector databases
        
       Author : lmeyerov
       Score  : 44 points
       Date   : 2022-09-20 16:48 UTC (6 hours ago)
        
 (HTM) web link (gradientflow.com)
 (TXT) w3m dump (gradientflow.com)
        
       | ramoz wrote:
       | https://github.com/google-research/google-research/tree/mast...
       | 
       | We use ScaNN for a large scale/performant neural search.
       | Otherwise this all feels bloated.
        
         | freediver wrote:
         | Another upvote for ScaNN. Used on Teclis.com search engine in
         | production (replaced previously used Faiss, which was
         | excellent). ScaNN is almost double the performance and with
         | same 5 lines of code to implement locally.
         | 
         | At this point vector search is almost 'solved', the challenge
         | remains the quality of vectors fed into the index. Producing
         | accurate text/image/... vector representations (aka embeddings)
         | is where lot more progress needs to be made.
        
           | thirdtrigger wrote:
           | > At this point vector search is almost 'solved'
           | 
           | Well, there is is still a lot to do. Going from memory to
           | disk, etc. But I see your point.
           | 
           | What might be a bit confusing is the mix of vector search
           | libraries with vector search engines. It's a bit like
           | comparing an inverted index library with SOLR :)
           | 
           | But we (I.e., the wider vector search ecosystem) are working
           | on this
        
           | moab wrote:
           | There are many "extra" features that one needs in practice
           | which are patently not solved, starting with working in a
           | dynamic setting.
        
         | liminal wrote:
         | What sort of scale are you dealing with? ScaNN looks more like
         | a vector index like FAISS, rather than a full database.
        
       | thomasahle wrote:
       | http://ann-benchmarks.com/
       | 
       | ScaNN is still doing quite well, but it is definitely not the
       | only game in town.
        
       | txtai wrote:
       | https://github.com/neuml/txtai
       | 
       | txtai can build vector indexes with Faiss/HNSW/Annoy and supports
       | running SQL statements against them. External vector databases
       | can also be plugged in.
        
         | metadat wrote:
         | Is there documentation for how to build / train the txtai
         | models from scratch? I looked but didn't yet find it.
         | 
         | p.s. it might be self evident to some, but Python is missing
         | from the list of supported languages:                   API
         | bindings for JavaScript,         Java, Rust and Go
         | 
         | Presumably Python has "native" support :)
         | 
         | p.p.s this looks awesome, submitted:
         | https://news.ycombinator.com/item?id=32918254
        
           | txtai wrote:
           | Thank you for the kinds words!
           | 
           | Best documentation on training models is the trainer pipeline
           | - https://neuml.github.io/txtai/pipeline/train/trainer. There
           | are a lot of great pretrained models available:
           | https://huggingface.co/models?pipeline_tag=sentence-
           | similari...
           | 
           | Funny on Python, never thought of how that section read,
           | definitely a good clarification to make.
        
       | fzliu wrote:
       | Great article. Vector databases are still somewhat niche, but
       | there's been rapid growth in interest (I'm a part of the Milvus
       | community). https://db-engines.com also provides a holistic
       | popularity metric (Milvus and Weaviate are on there), but I'm not
       | exactly sure how it's calculated.
       | 
       | One quick note - FAISS is an ANN library rather than a vector
       | database. Vector databases support a multitude of other features
       | that you'd see in a traditional database such as caching,
       | replication, horizontal scalability, etc. I've also found
       | Elastic's ANN search functionality a bit on the slow side, likely
       | due to its original architecture being focused on more general
       | text search.
        
         | thirdtrigger wrote:
         | Agreed with your observations about the difference between the
         | libraries and databases. Maybe for a follow-up article? It's
         | nice to see some independent research on the topic, tho.
        
       | tabtab wrote:
       | I keep saying this and it's still applicable. What's really
       | needed is "Dynamic Relational" (
       | https://www.reddit.com/r/Database/comments/qw1erd/are_the_no... )
       | 
       | D.R. allows ad-hoc "schemas" (or equivalent) yet keeps most of
       | the RDMBS idioms to reduce the learning curve for those already
       | familiar with RDBMS, including most of SQL. Other database
       | categories reinvent everything just to get dynamism, or to mostly
       | get dynamism. That's _not_ the shortest path to the goal. It 's
       | more logical just to tweak what's needed for dynamism and leave
       | the rest alone (RDBMS-like).
        
         | jandrewrogers wrote:
         | There are deep adverse performance implications for what you
         | are suggesting. The result of trying to dynamically mix and
         | match that much arbitrary structure would likely combine the
         | worst of both worlds. You can essentially do this today with
         | databases like PostgreSQL but there are good reasons no one
         | does. (What you are describing appears to be a thin wrapper on
         | what would conventionally be called a graph database.)
         | 
         | If there is an "obvious" improvement to database capabilities
         | that seems to be mysteriously absent from all competent
         | implementations, one should consider the hypothesis that it
         | would make databases strictly worse across many dimensions
         | people care about. One thing that can be said about the history
         | of database software is that it has tended to exhaustively
         | explore the known phase space of possible implementations.
         | Novelty in database implementations is usually predicated on a
         | material computer science advance at the architectural level;
         | any rearrangement of existing parts and ideas has usually
         | already been tried multiple times.
         | 
         | There are many good ideas for databases that no one implements
         | because we don't know how to make them fast enough. People care
         | greatly about database performance and scalability. You can
         | implement almost any database feature you can imagine if you
         | don't care about performance and scalability, you just won't
         | have any users.
        
           | cmrdporcupine wrote:
           | Ok all that said, this is what I've heard before, but _what_
           | is the actual technical limitation behind  "simply" providing
           | alternate columnar storage&index implementations for tables
           | optimized for OLAP but sharing the surrounding query parser,
           | query execution framework, and potentially even query planner
           | that is used elsewhere for OLTP?
           | 
           | Like... Vertica is a fork of Postgres. I'm curious why they
           | chose to fork and implement column-oriented storage and
           | indexes etc. rather than simply add them as _options_ to
           | stock Postgres.
           | 
           | Obviously joins across the two different worlds would be
           | highly problematic, and perhaps query execution, data
           | materialization, query planning etc. could look significantly
           | different.
           | 
           | But the potential advantage to the end user seems high, if
           | moving from "transactional" to "analytical" workloads is a
           | matter of moving data from one table-type to another, within
           | the same underlying database system.
           | 
           | Again, I know there are reasons why this approach has not
           | been successful. I'm curious what they are.
        
             | jandrewrogers wrote:
             | The original OLTP/OLAP dichotomy was based on architectural
             | tradeoffs required for spinning disk and the way indexes
             | worked. Especially at the time Vertica was created, the
             | OLTP-ness of PostgreSQL was essentially hardcoded into the
             | architecture. Vertica made a lot of changes, e.g. to
             | storage behavior, to support their use case that would have
             | significantly impacted OLTP performance. A complicated
             | query that mixes NSM and DSM tables would be unreasonably
             | complex since the query building blocks are different
             | depending on the type of storage.
             | 
             | This trade off doesn't really need to exist in a SQL
             | database today on modern hardware with its extremely high
             | storage bandwidth. There are other models that can satisfy
             | both OLTP and OLAP use cases satisfactorily in a single
             | coherent system with modern internals. The SQL database
             | market is extremely conservative, so even if you built it
             | no one would adopt it for a decade.
        
             | convolvatron wrote:
             | most of it i think is just focus. its alot of work to pull
             | together an OLAP and an OLTP database and they are very
             | different workloads.
             | 
             | the one place where you do start to get into fundamental
             | issues is consistency. the planner can certainly identify
             | read-only transactions and statically remove some
             | conflicts. but whatever scheme you are using (locks, mvcc,
             | optimistic) is going to struggle the more concurrent
             | overlapping transactions there are. since OLAP transactions
             | are very long lived and touch alot of things - they are
             | pretty hostile co-residents with the OLTP traffic.
        
       ___________________________________________________________________
       (page generated 2022-09-20 23:01 UTC)