[HN Gopher] Show HN: Embeddinghub: A vector database built for M...
       ___________________________________________________________________
        
       Show HN: Embeddinghub: A vector database built for Machine Learning
       embeddings
        
       Author : cyrusthegreat
       Score  : 99 points
       Date   : 2021-09-16 14:11 UTC (8 hours ago)
        
 (HTM) web link (github.com)
 (TXT) w3m dump (github.com)
        
       | andreawangahead wrote:
       | keep up with the good work!
        
       | nelsondev wrote:
       | Cool! Nice work! Do you have any performance numbers you could
       | share?
       | 
       | Specifically around nearest neighbor computation latency, a
       | regular get embedding latency, read/write rate achieved on a
       | machine?
        
         | cyrusthegreat wrote:
         | Not yet, this is very much an early release to get it in
         | people's hands and to get feedback on the API and the
         | functionality. We've purposely held off optimizing too much
         | until we feel more confident that this is useful and our API
         | approach makes sense for people. That said, Simba who's one of
         | the main devs actually comes from a performance tuning
         | background at Google. Also, it's built on HNSWLIB and RocksDB,
         | and is being used in real world workloads today.
        
       | shabbyjoon wrote:
       | How is this different from Pinecone, Milvus, and Faiss?
        
         | cyrusthegreat wrote:
         | Pinecone is closed source and only available as a SaaS service.
         | Milvus and us have more overlap, we're focused on the
         | embeddings workflow like versioning and using embedding with
         | other features. Milvus is entirely focused on nearest neighbor
         | operations.
         | 
         | Faiss is solving the approximate nearest neighbor problem, not
         | the storage problem. It's not a database, it's an index. We use
         | a lightweight version of Faiss (HNSWLIB) to index embeddings in
         | Embeddinghub.
        
         | gk1 wrote:
         | I'm from Pinecone so I can chime in...
         | 
         | The biggest difference, as cyrusthegreat pointed out, is that
         | we're a fully managed service. You sign up, spin up a database
         | service with a single API call[0], and go from there. There's
         | no infrastructure to build and keep available, even as you
         | scale to billions of items.
         | 
         | Pinecone also comes with features like metadata filtering[1]
         | for better control over results, and hybrid storage for up to
         | 10x lower compute costs. EmbeddingHub has a few features
         | Pinecone doesn't yet have, like versioning -- though with our
         | architecture it's straightforward to add if someone asks.
         | 
         | Hope that helps! And I'm glad to see more projects in this
         | space, especially from the feature-store side.
         | 
         | [0] https://www.pinecone.io/docs/api/operation/create_index/
         | 
         | [1] https://www.youtube.com/watch?v=r5CsJ_S9_w4
        
       | deploy wrote:
       | This looks awesome - psyched to try! Embeddings are a bitch, nice
       | to see some new tools for managing them :)
        
         | cyrusthegreat wrote:
         | Thanks for the kind words! We'd love to get your feedback as we
         | iterate. Please join our slack community:
         | https://join.slack.com/t/featureform-community/shared_invite...
        
       | barefeg wrote:
       | Where can I find documentation on versioning? My first use case
       | would be to version different embeddings and use it more like a
       | storage backend than to search for KNN. Would it be possible to
       | not create the NN graph and just use it for versioned storage? We
       | currently use opendistro and it nicely allows doing pre and post
       | filtering based on other document fields (other than the
       | embedding). Therefore I think this could never be a full
       | replacement without figuring out how to combine the rest of the
       | document structure
        
         | cyrusthegreat wrote:
         | Hey! We're actually polishing up a PR that'll add documentation
         | and finalize the versioning API, it should be merged in this
         | weekend. Would you be up for a quick chat with someone on our
         | team? It would be interesting to get your feedback and see what
         | else we're missing to be a drop-in replacement to opendistro,
         | join our slack if so. We'll dm you :)
         | https://join.slack.com/t/featureform-community/shared_invite...
        
       | cyrusthegreat wrote:
       | Hi everyone!
       | 
       | Over the years, I've found myself building hacky solutions to
       | serve and manage my embeddings. I'm excited to share
       | Embeddinghub, an open-source vector database for ML embeddings.
       | It is built with four goals in mind:
       | 
       | Store embeddings durably and with high availability
       | 
       | Allow for approximate nearest neighbor operations
       | 
       | Enable other operations like partitioning, sub-indices, and
       | averaging
       | 
       | Manage versioning, access control, and rollbacks painlessly
       | 
       | It's still in the early stages, and before we committed more dev
       | time to it we wanted to get your feedback. Let us know what you
       | think and what you'd like to see!
       | 
       | Repo: https://github.com/featureform/embeddinghub
       | 
       | Docs: https://docs.featureform.com/
       | 
       | What's an Embedding? The Definitive Guide to Embeddings:
       | https://www.featureform.com/post/the-definitive-guide-to-emb...
        
         | [deleted]
        
         | JPKab wrote:
         | Holy shit, this looks amazing!
         | 
         | I see you've got examples for NLP use cases in your docs. Can't
         | wait to read them. Embeddings are a constant source of
         | complexity when I'm trying to move certain operations to
         | Lambda, this looks like it would speed the initializations up
         | big time.
        
           | cyrusthegreat wrote:
           | We're so glad to hear that! We'd love your feedback as we
           | keep building. Please join our community on Slack:
           | https://join.slack.com/t/featureform-
           | community/shared_invite...
        
         | localhost wrote:
         | Curious about how your solution is different / better than
         | nmslib which I've tried in the past?
        
           | cyrusthegreat wrote:
           | We actually use HNSWLIB by NMSLIB on the backend. NMSLIB is
           | solving the approximate nearest neighbor problem, not the
           | storage problem. It's not a database, it's an index. We
           | handle everything needed to turn their index into a full
           | fledged database with a data science workflow around it
           | (versioning, monitoring, etc.)
        
             | localhost wrote:
             | That's great. I've been very impressed by the performance
             | of nmslib in my scenarios. I'll definitely check out eh -
             | thanks for sharing!
        
         | ypcx wrote:
         | In the "Definitive Guide to Embeddings", in the figure "An
         | illustration of One Hot Encoding", the "One Hot Encoding" table
         | doesn't make any sense whatsoever. Am I wrong?
        
           | make3 wrote:
           | no you're right ahahah wth are these
        
             | cyrusthegreat wrote:
             | You are both right. I just realized this and would be
             | embarrassed if I wasn't laughing so hard. I gave an
             | original drawing to our designer with the correct values
             | and we didn't inspect their final image. We'll get this
             | fixed, thanks for pointing this out and sorry for the
             | confusion :)
        
       | kevin948 wrote:
       | This is really great! It speaks very much to my use-case
       | (building user embeddings and serving them both to analysts +
       | other ML models).
       | 
       | I was wondering if there was a reasonable way to store raw data
       | next to the embeddings such that: 1. Analysts can run queries to
       | filter down to a space they understand (the raw data). 2. Nearest
       | neighbors can be run on top of their selection on the embedding
       | space.
       | 
       | Our main use case is segmentation, so giving analysts access to
       | the raw feature space is very important.
        
         | cyrusthegreat wrote:
         | This is in the works! We'd love you feedback on the API and to
         | learn a bit more about your use-case so we build the right
         | thing, mind joining our slack?
         | https://join.slack.com/t/featureform-community/shared_invite...
        
       | tourist_on_road wrote:
       | Great work! Looks like you are using HNSWLIB. From what I
       | understand HNSW graph based approach can be memory intensive
       | compared PQ code based approach. FAISS has support for both HNSW
       | and PQ codes. Any plans on extending your work to support PQ code
       | based index in future?
        
         | cyrusthegreat wrote:
         | Yes! We plan to bring Faiss in and utilize a lot of its
         | functionality, our goal for this release was to get an end-to-
         | end working to get feedback on the API. HNSW was a good default
         | with this in mind.
        
           | jamesblonde wrote:
           | How does it compare to the OpenDistro for Elastic KNN plugin
           | - which also uses HNSW (and also includes scalable storage,
           | high availability, backups, and filtering)?
        
             | cyrusthegreat wrote:
             | Our API is built from the ground up with the machine
             | learning workflow in mind. For example, we have a training
             | API that allows you to batch requests and even download
             | your embeddings and generate an HNSW index locally. Our
             | view of versioning, rollbacks, and more makes a lot of
             | sense for an ML index, but very little sense for a search
             | index.
        
       | sathergate wrote:
       | which search algorithm does it use?
        
         | cyrusthegreat wrote:
         | We use HNSW internally via HNSWLIB, it's the same algorithm
         | that Facebook uses to power their embedding search.
        
           | sathergate wrote:
           | thanks! how did you make the decision to use hnsw over faiss
           | and other search algorithms?
        
             | cyrusthegreat wrote:
             | Faiss actually also uses HNSW internally, HNSWLIB is just a
             | lighter weight implementation which allowed us to iterate
             | faster. In the future we will switch it back out for FAISS
             | to take advantage of its full array of functionality.
        
       | planetsprite wrote:
       | What makes this different from something like gensim? They have
       | vector search for doc2vec embeddings.
        
         | cyrusthegreat wrote:
         | Gensim is great for generating certain types of embeddings, but
         | not for operationalizing them. It doesn't do approximate
         | nearest neighbor lookup which is a deal breaker for most models
         | that use embeddings at scale. It also do not manage versioning
         | so you end up having to hack a workflow around it to manage
         | embedding. Finally, it's not really data infrastructure like
         | this is, so you end up doing hacky things like copying all your
         | embeddings to every docker file. With regards to serving
         | embeddings, gensim is just a library that supports in-memory
         | brute force nearest neighbour look ups.
        
       ___________________________________________________________________
       (page generated 2021-09-16 23:02 UTC)