[HN Gopher] Measuring the popularity of different vector databases
___________________________________________________________________
Measuring the popularity of different vector databases
Author : lmeyerov
Score : 44 points
Date : 2022-09-20 16:48 UTC (6 hours ago)
(HTM) web link (gradientflow.com)
(TXT) w3m dump (gradientflow.com)
| ramoz wrote:
| https://github.com/google-research/google-research/tree/mast...
|
| We use ScaNN for a large scale/performant neural search.
| Otherwise this all feels bloated.
| freediver wrote:
| Another upvote for ScaNN. Used on Teclis.com search engine in
| production (replaced previously used Faiss, which was
| excellent). ScaNN is almost double the performance and with
| same 5 lines of code to implement locally.
|
| At this point vector search is almost 'solved', the challenge
| remains the quality of vectors fed into the index. Producing
| accurate text/image/... vector representations (aka embeddings)
| is where lot more progress needs to be made.
| thirdtrigger wrote:
| > At this point vector search is almost 'solved'
|
| Well, there is is still a lot to do. Going from memory to
| disk, etc. But I see your point.
|
| What might be a bit confusing is the mix of vector search
| libraries with vector search engines. It's a bit like
| comparing an inverted index library with SOLR :)
|
| But we (I.e., the wider vector search ecosystem) are working
| on this
| moab wrote:
| There are many "extra" features that one needs in practice
| which are patently not solved, starting with working in a
| dynamic setting.
| liminal wrote:
| What sort of scale are you dealing with? ScaNN looks more like
| a vector index like FAISS, rather than a full database.
| thomasahle wrote:
| http://ann-benchmarks.com/
|
| ScaNN is still doing quite well, but it is definitely not the
| only game in town.
| txtai wrote:
| https://github.com/neuml/txtai
|
| txtai can build vector indexes with Faiss/HNSW/Annoy and supports
| running SQL statements against them. External vector databases
| can also be plugged in.
| metadat wrote:
| Is there documentation for how to build / train the txtai
| models from scratch? I looked but didn't yet find it.
|
| p.s. it might be self evident to some, but Python is missing
| from the list of supported languages: API
| bindings for JavaScript, Java, Rust and Go
|
| Presumably Python has "native" support :)
|
| p.p.s this looks awesome, submitted:
| https://news.ycombinator.com/item?id=32918254
| txtai wrote:
| Thank you for the kinds words!
|
| Best documentation on training models is the trainer pipeline
| - https://neuml.github.io/txtai/pipeline/train/trainer. There
| are a lot of great pretrained models available:
| https://huggingface.co/models?pipeline_tag=sentence-
| similari...
|
| Funny on Python, never thought of how that section read,
| definitely a good clarification to make.
| fzliu wrote:
| Great article. Vector databases are still somewhat niche, but
| there's been rapid growth in interest (I'm a part of the Milvus
| community). https://db-engines.com also provides a holistic
| popularity metric (Milvus and Weaviate are on there), but I'm not
| exactly sure how it's calculated.
|
| One quick note - FAISS is an ANN library rather than a vector
| database. Vector databases support a multitude of other features
| that you'd see in a traditional database such as caching,
| replication, horizontal scalability, etc. I've also found
| Elastic's ANN search functionality a bit on the slow side, likely
| due to its original architecture being focused on more general
| text search.
| thirdtrigger wrote:
| Agreed with your observations about the difference between the
| libraries and databases. Maybe for a follow-up article? It's
| nice to see some independent research on the topic, tho.
| tabtab wrote:
| I keep saying this and it's still applicable. What's really
| needed is "Dynamic Relational" (
| https://www.reddit.com/r/Database/comments/qw1erd/are_the_no... )
|
| D.R. allows ad-hoc "schemas" (or equivalent) yet keeps most of
| the RDMBS idioms to reduce the learning curve for those already
| familiar with RDBMS, including most of SQL. Other database
| categories reinvent everything just to get dynamism, or to mostly
| get dynamism. That's _not_ the shortest path to the goal. It 's
| more logical just to tweak what's needed for dynamism and leave
| the rest alone (RDBMS-like).
| jandrewrogers wrote:
| There are deep adverse performance implications for what you
| are suggesting. The result of trying to dynamically mix and
| match that much arbitrary structure would likely combine the
| worst of both worlds. You can essentially do this today with
| databases like PostgreSQL but there are good reasons no one
| does. (What you are describing appears to be a thin wrapper on
| what would conventionally be called a graph database.)
|
| If there is an "obvious" improvement to database capabilities
| that seems to be mysteriously absent from all competent
| implementations, one should consider the hypothesis that it
| would make databases strictly worse across many dimensions
| people care about. One thing that can be said about the history
| of database software is that it has tended to exhaustively
| explore the known phase space of possible implementations.
| Novelty in database implementations is usually predicated on a
| material computer science advance at the architectural level;
| any rearrangement of existing parts and ideas has usually
| already been tried multiple times.
|
| There are many good ideas for databases that no one implements
| because we don't know how to make them fast enough. People care
| greatly about database performance and scalability. You can
| implement almost any database feature you can imagine if you
| don't care about performance and scalability, you just won't
| have any users.
| cmrdporcupine wrote:
| Ok all that said, this is what I've heard before, but _what_
| is the actual technical limitation behind "simply" providing
| alternate columnar storage&index implementations for tables
| optimized for OLAP but sharing the surrounding query parser,
| query execution framework, and potentially even query planner
| that is used elsewhere for OLTP?
|
| Like... Vertica is a fork of Postgres. I'm curious why they
| chose to fork and implement column-oriented storage and
| indexes etc. rather than simply add them as _options_ to
| stock Postgres.
|
| Obviously joins across the two different worlds would be
| highly problematic, and perhaps query execution, data
| materialization, query planning etc. could look significantly
| different.
|
| But the potential advantage to the end user seems high, if
| moving from "transactional" to "analytical" workloads is a
| matter of moving data from one table-type to another, within
| the same underlying database system.
|
| Again, I know there are reasons why this approach has not
| been successful. I'm curious what they are.
| jandrewrogers wrote:
| The original OLTP/OLAP dichotomy was based on architectural
| tradeoffs required for spinning disk and the way indexes
| worked. Especially at the time Vertica was created, the
| OLTP-ness of PostgreSQL was essentially hardcoded into the
| architecture. Vertica made a lot of changes, e.g. to
| storage behavior, to support their use case that would have
| significantly impacted OLTP performance. A complicated
| query that mixes NSM and DSM tables would be unreasonably
| complex since the query building blocks are different
| depending on the type of storage.
|
| This trade off doesn't really need to exist in a SQL
| database today on modern hardware with its extremely high
| storage bandwidth. There are other models that can satisfy
| both OLTP and OLAP use cases satisfactorily in a single
| coherent system with modern internals. The SQL database
| market is extremely conservative, so even if you built it
| no one would adopt it for a decade.
| convolvatron wrote:
| most of it i think is just focus. its alot of work to pull
| together an OLAP and an OLTP database and they are very
| different workloads.
|
| the one place where you do start to get into fundamental
| issues is consistency. the planner can certainly identify
| read-only transactions and statically remove some
| conflicts. but whatever scheme you are using (locks, mvcc,
| optimistic) is going to struggle the more concurrent
| overlapping transactions there are. since OLAP transactions
| are very long lived and touch alot of things - they are
| pretty hostile co-residents with the OLTP traffic.
___________________________________________________________________
(page generated 2022-09-20 23:01 UTC)