[HN Gopher] Show HN: Embeddinghub: A vector database built for M...
___________________________________________________________________
Show HN: Embeddinghub: A vector database built for Machine Learning
embeddings
Author : cyrusthegreat
Score : 99 points
Date : 2021-09-16 14:11 UTC (8 hours ago)
(HTM) web link (github.com)
(TXT) w3m dump (github.com)
| andreawangahead wrote:
| keep up with the good work!
| nelsondev wrote:
| Cool! Nice work! Do you have any performance numbers you could
| share?
|
| Specifically around nearest neighbor computation latency, a
| regular get embedding latency, read/write rate achieved on a
| machine?
| cyrusthegreat wrote:
| Not yet, this is very much an early release to get it in
| people's hands and to get feedback on the API and the
| functionality. We've purposely held off optimizing too much
| until we feel more confident that this is useful and our API
| approach makes sense for people. That said, Simba who's one of
| the main devs actually comes from a performance tuning
| background at Google. Also, it's built on HNSWLIB and RocksDB,
| and is being used in real world workloads today.
| shabbyjoon wrote:
| How is this different from Pinecone, Milvus, and Faiss?
| cyrusthegreat wrote:
| Pinecone is closed source and only available as a SaaS service.
| Milvus and us have more overlap, we're focused on the
| embeddings workflow like versioning and using embedding with
| other features. Milvus is entirely focused on nearest neighbor
| operations.
|
| Faiss is solving the approximate nearest neighbor problem, not
| the storage problem. It's not a database, it's an index. We use
| a lightweight version of Faiss (HNSWLIB) to index embeddings in
| Embeddinghub.
| gk1 wrote:
| I'm from Pinecone so I can chime in...
|
| The biggest difference, as cyrusthegreat pointed out, is that
| we're a fully managed service. You sign up, spin up a database
| service with a single API call[0], and go from there. There's
| no infrastructure to build and keep available, even as you
| scale to billions of items.
|
| Pinecone also comes with features like metadata filtering[1]
| for better control over results, and hybrid storage for up to
| 10x lower compute costs. EmbeddingHub has a few features
| Pinecone doesn't yet have, like versioning -- though with our
| architecture it's straightforward to add if someone asks.
|
| Hope that helps! And I'm glad to see more projects in this
| space, especially from the feature-store side.
|
| [0] https://www.pinecone.io/docs/api/operation/create_index/
|
| [1] https://www.youtube.com/watch?v=r5CsJ_S9_w4
| deploy wrote:
| This looks awesome - psyched to try! Embeddings are a bitch, nice
| to see some new tools for managing them :)
| cyrusthegreat wrote:
| Thanks for the kind words! We'd love to get your feedback as we
| iterate. Please join our slack community:
| https://join.slack.com/t/featureform-community/shared_invite...
| barefeg wrote:
| Where can I find documentation on versioning? My first use case
| would be to version different embeddings and use it more like a
| storage backend than to search for KNN. Would it be possible to
| not create the NN graph and just use it for versioned storage? We
| currently use opendistro and it nicely allows doing pre and post
| filtering based on other document fields (other than the
| embedding). Therefore I think this could never be a full
| replacement without figuring out how to combine the rest of the
| document structure
| cyrusthegreat wrote:
| Hey! We're actually polishing up a PR that'll add documentation
| and finalize the versioning API, it should be merged in this
| weekend. Would you be up for a quick chat with someone on our
| team? It would be interesting to get your feedback and see what
| else we're missing to be a drop-in replacement to opendistro,
| join our slack if so. We'll dm you :)
| https://join.slack.com/t/featureform-community/shared_invite...
| cyrusthegreat wrote:
| Hi everyone!
|
| Over the years, I've found myself building hacky solutions to
| serve and manage my embeddings. I'm excited to share
| Embeddinghub, an open-source vector database for ML embeddings.
| It is built with four goals in mind:
|
| Store embeddings durably and with high availability
|
| Allow for approximate nearest neighbor operations
|
| Enable other operations like partitioning, sub-indices, and
| averaging
|
| Manage versioning, access control, and rollbacks painlessly
|
| It's still in the early stages, and before we committed more dev
| time to it we wanted to get your feedback. Let us know what you
| think and what you'd like to see!
|
| Repo: https://github.com/featureform/embeddinghub
|
| Docs: https://docs.featureform.com/
|
| What's an Embedding? The Definitive Guide to Embeddings:
| https://www.featureform.com/post/the-definitive-guide-to-emb...
| [deleted]
| JPKab wrote:
| Holy shit, this looks amazing!
|
| I see you've got examples for NLP use cases in your docs. Can't
| wait to read them. Embeddings are a constant source of
| complexity when I'm trying to move certain operations to
| Lambda, this looks like it would speed the initializations up
| big time.
| cyrusthegreat wrote:
| We're so glad to hear that! We'd love your feedback as we
| keep building. Please join our community on Slack:
| https://join.slack.com/t/featureform-
| community/shared_invite...
| localhost wrote:
| Curious about how your solution is different / better than
| nmslib which I've tried in the past?
| cyrusthegreat wrote:
| We actually use HNSWLIB by NMSLIB on the backend. NMSLIB is
| solving the approximate nearest neighbor problem, not the
| storage problem. It's not a database, it's an index. We
| handle everything needed to turn their index into a full
| fledged database with a data science workflow around it
| (versioning, monitoring, etc.)
| localhost wrote:
| That's great. I've been very impressed by the performance
| of nmslib in my scenarios. I'll definitely check out eh -
| thanks for sharing!
| ypcx wrote:
| In the "Definitive Guide to Embeddings", in the figure "An
| illustration of One Hot Encoding", the "One Hot Encoding" table
| doesn't make any sense whatsoever. Am I wrong?
| make3 wrote:
| no you're right ahahah wth are these
| cyrusthegreat wrote:
| You are both right. I just realized this and would be
| embarrassed if I wasn't laughing so hard. I gave an
| original drawing to our designer with the correct values
| and we didn't inspect their final image. We'll get this
| fixed, thanks for pointing this out and sorry for the
| confusion :)
| kevin948 wrote:
| This is really great! It speaks very much to my use-case
| (building user embeddings and serving them both to analysts +
| other ML models).
|
| I was wondering if there was a reasonable way to store raw data
| next to the embeddings such that: 1. Analysts can run queries to
| filter down to a space they understand (the raw data). 2. Nearest
| neighbors can be run on top of their selection on the embedding
| space.
|
| Our main use case is segmentation, so giving analysts access to
| the raw feature space is very important.
| cyrusthegreat wrote:
| This is in the works! We'd love you feedback on the API and to
| learn a bit more about your use-case so we build the right
| thing, mind joining our slack?
| https://join.slack.com/t/featureform-community/shared_invite...
| tourist_on_road wrote:
| Great work! Looks like you are using HNSWLIB. From what I
| understand HNSW graph based approach can be memory intensive
| compared PQ code based approach. FAISS has support for both HNSW
| and PQ codes. Any plans on extending your work to support PQ code
| based index in future?
| cyrusthegreat wrote:
| Yes! We plan to bring Faiss in and utilize a lot of its
| functionality, our goal for this release was to get an end-to-
| end working to get feedback on the API. HNSW was a good default
| with this in mind.
| jamesblonde wrote:
| How does it compare to the OpenDistro for Elastic KNN plugin
| - which also uses HNSW (and also includes scalable storage,
| high availability, backups, and filtering)?
| cyrusthegreat wrote:
| Our API is built from the ground up with the machine
| learning workflow in mind. For example, we have a training
| API that allows you to batch requests and even download
| your embeddings and generate an HNSW index locally. Our
| view of versioning, rollbacks, and more makes a lot of
| sense for an ML index, but very little sense for a search
| index.
| sathergate wrote:
| which search algorithm does it use?
| cyrusthegreat wrote:
| We use HNSW internally via HNSWLIB, it's the same algorithm
| that Facebook uses to power their embedding search.
| sathergate wrote:
| thanks! how did you make the decision to use hnsw over faiss
| and other search algorithms?
| cyrusthegreat wrote:
| Faiss actually also uses HNSW internally, HNSWLIB is just a
| lighter weight implementation which allowed us to iterate
| faster. In the future we will switch it back out for FAISS
| to take advantage of its full array of functionality.
| planetsprite wrote:
| What makes this different from something like gensim? They have
| vector search for doc2vec embeddings.
| cyrusthegreat wrote:
| Gensim is great for generating certain types of embeddings, but
| not for operationalizing them. It doesn't do approximate
| nearest neighbor lookup which is a deal breaker for most models
| that use embeddings at scale. It also do not manage versioning
| so you end up having to hack a workflow around it to manage
| embedding. Finally, it's not really data infrastructure like
| this is, so you end up doing hacky things like copying all your
| embeddings to every docker file. With regards to serving
| embeddings, gensim is just a library that supports in-memory
| brute force nearest neighbour look ups.
___________________________________________________________________
(page generated 2021-09-16 23:02 UTC)