[HN Gopher] Find anything fast with Google's vector search techn...
___________________________________________________________________
Find anything fast with Google's vector search technology
Author : sshroot
Score : 198 points
Date : 2021-12-14 18:07 UTC (4 hours ago)
(HTM) web link (cloud.google.com)
(TXT) w3m dump (cloud.google.com)
| ahurmazda wrote:
| For a similar ANN/vector search capabilities, https://vespa.ai/
| is a great open-source solution. Elasticsearch may offer some
| form of ANN too but need to double check
| sanxiyn wrote:
| I don't think Elasticsearch has one yet, but OpenSearch does:
| https://opensearch.org/docs/latest/search-plugins/knn/index/
| m_ke wrote:
| Lucene 9.0 just shipped with hnsw support, should make it
| into ES at some point
| (https://twitter.com/msokolov/status/1468395332531003393)
|
| EDIT: ES integration PR:
| https://github.com/elastic/elasticsearch/issues/78473
| ahurmazda wrote:
| Ah! Great to know. ANN searches are becoming table stakes at
| this point. Hopefully, we will see more and more platforms
| adding it to their repertoire.
| ShamelessC wrote:
| This gh repo makes it pretty easy to create similar tech by first
| embedding any images you have using the released "CLIP" model
| from Open AI and then creating a Faiss index over these embeds
| for quick retrieval/decode. You can then do text->image, and
| image->image semantic search.
|
| https://github.com/rom1504/clip-retrieval
| thirdtrigger wrote:
| Interesting - we are working on an open source vector search
| engine called Weaviate and did the same for the complete
| Wikipedia and Wikidata.
|
| [1] Docs:
| https://www.semi.technology/developers/weaviate/current/
|
| [2] Github: https://github.com/semi-technologies/weaviate
|
| [3] Wikipedia demo dataset: https://github.com/semi-
| technologies/semantic-search-through...
|
| [4] Wikidata dataset: https://github.com/semi-
| technologies/biggraph-wikidata-searc...
|
| Last week there was also a feature on Techcrunch about vector
| search and Weaviate: https://techcrunch.com/2021/12/11/2246180/
| mravl wrote:
| Real eyeopener. this will change the search industry completely
| detaro wrote:
| why?
| thirdtrigger wrote:
| That's a fair question - but I'm going to assume the open-
| source nature is being meant with this.
| CShorten wrote:
| I've made some videos on Weaviate as well (Henry AI Labs) if
| interested:
|
| [1] Wikipedia Vector Search Demo with Weaviate:
| https://www.youtube.com/watch?v=IGB8vjCuay0
|
| [2] Vector Search through Wikidata with Weaviate:
| https://www.youtube.com/watch?v=T4zlvknSbGc
|
| [3] Demonstrations of Deep Learning:
| https://www.youtube.com/watch?v=5jbneytoKi0
|
| [4] Weaviate's GraphQL API for Neurosymbolic Search:
| https://www.youtube.com/watch?v=K_2X48Tln9U
|
| [5] Introducing the Weaviate Vector Search Engine:
| https://www.youtube.com/watch?v=AS_2U_INpKk
| thirdtrigger wrote:
| These are all great!
|
| There is also this video about modern search engines and
| Weaviate on the AI Coffee Break YT channel:
| https://www.youtube.com/watch?v=YkK5IKgxp-c
| gk1 wrote:
| It's great to see more and more talk of vector search and vector
| databases. We've been promoting this technology for over a year
| now and have several intro articles for anyone looking to learn
| more[1], and a generous free tier on our vector search service[2]
| for anyone looking to give vector search a shot.
|
| [1] https://www.pinecone.io/learn/
|
| [2] https://app.pinecone.io/
|
| We are also actively researching the space, and just recently
| published a paper on improving Google's ScaNN:
| https://arxiv.org/abs/2112.02179
| wswope wrote:
| That reference/learning page is a great resource!
|
| As for Pinecone itself, what are the main selling points as you
| see them for a simple application (e.g. comparing trigram-
| vectorized sets of strings) when compared to a home-rolled
| solution using postgres with array types? Better performance,
| ease of indexing, etc.?
| gk1 wrote:
| I pinged someone more technical from our team to chime in.
|
| In the meantime I can say moving to the dense vector + ANN
| search combo turns regular searches into semantic searches,
| which means more relevant results.
|
| If that's the case for you, then you can use Pinecone to go
| further and make those results _fast_ ( <100ms), _fresh_
| (CRUD + live index updates), and _filtered_ (apply single-
| stage metadata- filtering). All on a fully managed system
| that you can scale up /down with one API call.
| indeed30 wrote:
| Does Pinecone have any position on the status of document
| embeddings and whether they would be considered PII? One of the
| challenges of using a fully managed service is the headache of
| adding yet another data subprocessor and all of the legal and
| compliance questions that raises.
| dvaun wrote:
| I've been toying with making a deckbuilder for Magic: The
| Gathering and could see this being potentially useful for
| finding fun card combinations. Thanks!
| kruptos wrote:
| I love this idea. I would pay for that service!
| gk1 wrote:
| That would be a fun use case for us to promote. Let me know
| when it's ready! The free plan supports as many as 1 million
| items, more than enough for the all MTG cards in existence.
| Plus you can add and filter by metadata, like card type and
| properties.
| dvaun wrote:
| > Plus you can add and filter by metadata, like card type
| and properties.
|
| I read through your docs and figure that will be part of
| the approach.
|
| An idea I had was to find similar, or "next best", cards
| for replacement in popular decks or to achieve similar
| effects in order to bring down the cost of EDH, Modern,
| etc. formats. I'm just getting back into the hobby again,
| so having a tool like this would make my wife and wallet
| happy :)
| 16mb wrote:
| I've resorted to playing modern with high quality fakes.
| Otherwise wouldn't have the budget. Checkout bootlegmtg
| on reddit
| thirdtrigger wrote:
| We are actually discussing this on the Weaviate Slack :-) htt
| ps://weaviate.slack.com/archives/D02JM9D3HND/p16347312830...
| amelius wrote:
| Does anyone know of a good benchmark suite for search technology?
|
| (And how well does the technique of the article work wrt it?)
| CoolGuySteve wrote:
| Is this more or less a k-d tree as a service? Where any distance
| function can be used to index the data?
|
| Or is it something different?
| hamilyon2 wrote:
| I thought k-d trees were useless in high-dimentional case. So,
| it must be something else.
| contravariant wrote:
| I'd say they're about as useful as euclidean distance is.
| ahurmazda wrote:
| More or less but as always the devil is in the detail. Here is
| a paper[1] that summarizes issues with naive approaches.
| Incidentally. the proposed solution (Hierarchical NSW) in this
| paper performed fairly well in the industry benchmarks
|
| [1] https://arxiv.org/ftp/arxiv/papers/1603/1603.09320.pdf
| monkeybutton wrote:
| A k-d tree gives you exact answers to nearest neighbour
| queries.
| srean wrote:
| A k-d tree is a data structure. Whether you use that for
| exact nearest neighbor query or approximate is up to the
| algorithm used. K-d trees work well for a handful of
| dimension beyond that it becomes quite expensive.
| freediver wrote:
| I built multiple systems using vector search, one of them demoed
| in a search engine for non-commercial content at
| http://teclis.com
|
| Running vector search (also sometimes referred to as semantic
| search, or a part of semantic search stack) is a trivial matter
| with open-source libraries like Faiss
| https://github.com/facebookresearch/faiss
|
| It takes 5 minutes to set up. You can search billion vectors on
| common hardware. For low-latency (up to couple of hundred
| milliseconds) use cases, it is highly unlikely that any cloud
| solution like this would be a better choice than something
| deployed on premise because of the network overhead.
|
| (worth noting is that there are about two dozen vector search
| libraries, all benchmarked at http://ann-benchmarks.com/ and most
| of them open-source)
|
| A much more interesting (and harder) problem is creating good
| vectors to begin with. This refers to the process of converting a
| text or an image to a multidimensional vector, usually done by a
| machine learning model such as BERT (for text) or ImageNet (for
| images).
|
| Try entering a query like 'gpt3' or '2019' into the news search
| demo linked in the Google's PR:
|
| https://matchit.magellanic-clouds.com/
|
| The results are nonsensical. Not because the vector search didn't
| do its job well, but because generated vectors were suboptimal to
| begin with. Having good vectors is 99% of the semantic search
| problem.
|
| A nice demo of what semantic search can do is Google's Talk to
| Books https://books.google.com/talktobooks/
|
| This area of research s fascinating. For those who want to play
| with this more, an interesting end-to-end (including both vector
| generation and search) open-source solution is Haystack
| https://github.com/deepset-ai/haystack
| noud wrote:
| I just made a couple of searches with teclis. I have to say,
| it's not bad. It's clearly not complete and I get several empty
| searches. But the content of the results are of higher quality
| than what I get with Google or DDG. Nice work!
| freediver wrote:
| Thanks. The index is tiny and it is just a proof of concept
| of what a single person can do with technologies available
| nowadays. I felt it is better for it to return zero results
| than bad results.
| gk1 wrote:
| > The index is tiny
|
| What was the largest index you've had on Faiss? That seems
| to affect whether people think of it as more than adequate
| or terribly inadequate.
| freediver wrote:
| This demo is only about million vectors. The largest I
| had in Faiss was embeddings of the entire Wikipedia
| (scale in the neighborhood of ~30 million vectors). I
| know people running few billion vectors in Faiss.
| leobg wrote:
| So one vector per article? Doesn't this skew results? A
| short article with 0.9 relevance score would rank higher
| than a long article containing one paragraph with 1.0
| relevance. Am I mistaken?
| leobg wrote:
| Also, BERT on cheap hardware? I thought that without a
| GPU, vectorizing millions of snippets or doing sub-second
| queries was basically out of the question.
| petra wrote:
| It's a good experience, for sure better than Google.
|
| But I get 1/5 - 1/10 hit ratio(successful/empty searches).
| That's not habit forming, memory forming for me.
|
| Is there a core use case where I would get a good hit ratio
| ?
| freediver wrote:
| As the site says this demo is by no means meant as a
| replacement for Google, but rather to complement it. I
| would say Teclis is good for content discovery and
| learning new things outside the typical search engine
| filter bubble. A few examples of good queries are listed
| on the site.
|
| A similar concept was shown here recently:
| https://search.marginalia.nu
| mrg3_2013 wrote:
| Thanks for the reference to haystack. I didn't know it existed!
| I was looking into huggingface that seems to allow to build
| your own language model and train (still learning - but thats
| what I've learnt so far). I don't know how expensive these get
| (for example, if you have 100K lines?). Any thoughts on how
| this compares to HuggingFace and any anecdotes on time it would
| take to custom train ?
| hiddencost wrote:
| I recommend readers take parent post with a grain of salt.
|
| (1) Google's offering returns with-in <5ms, in my experience.
| (2) the demo is for paragraphs, not short text. You're putting
| mismatched data into the input, of course it's not going to
| work. Try a paragraph as suggested.
| freediver wrote:
| Hmm.. There is no web service that returns response in <5ms,
| unless you are sitting at the very terminal of the hardware
| producing the output.
|
| The demo featured in this PR takes about 800-1000ms total to
| produce search results. How much of that is the actual API is
| not known. Typically an https request to an API in the cloud
| will cost you at least 50ms of network latency, more likely
| 100ms-200ms. If you are running vector search on premise you
| will obviously not have this overhead.
|
| Text embeddings typically work for short text as well as
| paragraphs (paragraph embeddings are usually mean/max of word
| embeddings anyway) simply because most commercial use cases
| demand handling of short text input (because nobody is
| inputting a paragraph into a typical search box; what use is
| a news search if you can not type a single word like 'Biden'
| or 'gpt3' into it).
| hiddencost wrote:
| It's a cloud offering, so the machines are located near
| each other. Using it with other cloud services is a fair
| comparison to running it in the same box on-prem.
|
| The offering is similarity search, not a search engine.
| They offer image to image as another comparison point.
| freediver wrote:
| With the caveat of having to use GCP to host your server
| too, I can agree with you (although 5ms still sounds
| incredibly low, how many vectors was that?).
|
| I was obviously talking about a general use case where a
| user considers using an API like this vs running Faiss,
| and their server can be anywhere (a use case that is more
| common to me personally).
| leobg wrote:
| So how are you creating the embeddings for your search engine?
| GloVe? Sentence BERT? Are you training your own models? Are you
| employing any kind of normalization? There are so many
| variables to optimize on many levels. Which is, of course, what
| makes this whole area super exciting.
| 1vuio0pswjnm7 wrote:
| Really like the idea of teclis, i.e., a non-commercial search
| engine. Is it correct that teclis is HTTP-only (via port 1333)
| and TLS is not an option. (NB. I am not suggesting there is
| anything wrong with HTTP. I am simply curious if TLS is
| available.)
| freediver wrote:
| There is nothing wrong with using HTTP if the data
| transferred is not sensitive like in this case for demo
| purposes (and if anything, it is also faster for the user).
| dontreact wrote:
| There are two huge things your 5 minute setup is missing which
| are very hard techinically to tackle
|
| 1. Incrementally updating the search space. Not that easy to
| do, and becomes more important to not just do the dumb thing of
| retraining the entire index on every update for larger
| datasets.
|
| 2. Combining vector search and some database-like search in an
| efficient manner. I don't know if this Google post really
| solves that problem or if they just do the vector lookup
| followed by a parallelized linear scan, but this is still an
| open research/unsolved problem.
| freediver wrote:
| Correct, that would take more than 5 minutes, although still
| possible to do with Faiss (and not that hard relatively
| speaking - in the Teclis demo, I indeed did your second point
| - combine results with a keyword search engine and there are
| many simple solutions you can use out there like Meilisearch,
| Sonic etc.e). If you were to try using an external API for
| vector search, you would still need to build keyword based
| search separately (and then combining/ranking logic) so then
| you may be better off just building the entire stack anyway.
|
| Anyway, for me, the number one priority was latency and it is
| hard to beat on-premise search for that.
|
| Even then, a vector search API is just one component you will
| need in your stack. You need to pick the right model, create
| vectors (GPU intensive), then possibly combine search results
| with keyword based search to improve accuracy etc. I am still
| waiting to see an end-to-end API doing all this.
| dontreact wrote:
| Interesting. Did you also tackle the incremental update
| problem with FAISS?
| freediver wrote:
| No, I didn't have a need for it in this demo (but is
| certainly possible with Faiss).
| gk1 wrote:
| Exactly right. Things like data freshness (live index
| updates), CRUD operations, metadata filtering, and horizontal
| scaling are all "extras" that don't come with Faiss. Hence
| the need for solutions like Google Matching Engine and
| Pinecone.io.
|
| And even if you do just want ANN and nothing else, some
| people just want to make API calls to a live service and not
| worry about anything else.
| moab wrote:
| Can you expand more or provide a concrete example for the
| second point? What kind of database-like searches are you
| thinking about for spatial data? Things like range-queries
| can already be (approximately) done. Or are you thinking
| about relational style queries on data associated with each
| point?
| dontreact wrote:
| Yes exactly, relational style queries with each data point.
| Maybe you have some metadata about your images and maybe
| you need to join against another table to properly query
| them. But at the same time you want to only grab the first
| k nearest neighbors according to vector similarity.
| gk1 wrote:
| Pinecone does this, at least if I'm understanding your
| use case right: https://www.pinecone.io/docs/metadata-
| filtering/
|
| And you're right, it wasn't easy to build.
| etiennedi wrote:
| Spot on! Both of those were motivating factors when building
| Weaviate (Open Source Vector Search Engine). We really wanted
| it to feel like a full database or search engine. You should
| be able to do anything you would do with Elasticsearch, etc.
| There should be no waiting time between creating an object
| and searching. Incremental Updates and Deletes supported,
| etc.
|
| On your second point about efficient filtering, check out
| this article I wrote outlining how filtered vector search
| works in Weaviate: https://towardsdatascience.com/effects-of-
| filtered-hnsw-sear...
|
| For even more details on filtering, check the documentation:
| https://www.semi.technology/developers/weaviate/current/arch.
| ..
| monkeybutton wrote:
| If you are interested in how ScaNN compares to other
| approximation algorithms, there are some benchmarks here:
| http://ann-benchmarks.com/
| 323 wrote:
| People say google search is terrible these days, but I find the
| opposite.
|
| I can vaguely describe in a sentence the gist of an article I've
| read, or an image, and the proper result will usually be in the
| first page.
|
| Of course, it doesn't always work, sometimes there are "hash
| collisions" so to speak, but I don't think the old algorithm
| would have been more successfully either, since if I knew the
| exact keywords to use, I wouldn't need to start with a vague
| description in the first place.
| [deleted]
| tux3 wrote:
| I'd like to join the chorus disappointed in Google search
| results.
|
| It seems to do more correction for you, which is great if
| you're searching for common popular things. But any uncommon or
| precise query will often be misunderstood as something else.
|
| Plenty of times, no matter how I reword my sentence or what
| sort of analogies I try to give it, I've had Google fail to
| give me something that I know exists and that I have to find
| some other way.
| tehjoker wrote:
| seems like it would be simple to reintroduce advanced search
| but these guys are monomaniacal
| Jemaclus wrote:
| I've literally gone to Google and typed something very similar
| to "That guy in that thing with the dog" and the correct answer
| shows up as the first result. It's quite brilliant and magical
| how they do that.
|
| But sometimes it's a total miss when I want something very
| specific, and it just shows me other things I didn't ask for.
| authed wrote:
| I have more trouble finding exactly what I'm looking for using
| Google, for me it started going down hill when they removed the
| Plus operator (and no quotes don't work the same).
|
| Also, Yandex is much better for reverse image search (similar
| images).
| est31 wrote:
| Yeah I remember that in the past, maybe so 5+ years ago, I used
| to phrase things in a certain way to please their algorithm.
| This is not needed any more. Sometimes it doesn't grasp
| concepts and returns false results. But this has become rarer
| as well.
| dqv wrote:
| >I can vaguely describe in a sentence the gist of an article
| I've read, or an image, and the proper result will usually be
| in the first page.
|
| For the specific context of "I can find something I've already
| found", yes, it's useful. I just wish there was a way to change
| that context to "discovery mode" where it uses a different
| algorithm that is oriented toward finding new information. I
| want to find sites in the spirit of those old-fashioned sites
| that are minimally styled with dense information. And not just
| Wikipedia or a few "trusted" sources like it used to be in
| earlier times, but a more well-rounded result set.
| idealmedtech wrote:
| > sites in the spirit of those old-fashioned sites
|
| I think the problem is that such sites are very difficult to
| find algorithmically, especially when it comes to their poor
| SEO. The reason they used to be so prevalent in the early
| 2000s search results is because that's _mostly_ what the web
| was back then; a bunch of personal websites, blogs, etc.
|
| To do that nowadays would require heavy (manual) curation,
| which obviously Google isn't interested in.
| joe_the_user wrote:
| I don't think it's entirely true that the current Google
| results are merely a matter of decent sites being hard to
| find. I'm pretty sure that two years ago, Google found a
| great portion of "real content" pages than it finds today
| and SEO was already huge then. And Google does a
| significant amount of human testing right now.
|
| It especially noticeable to me that just in the last few
| months, Google has changed their algorithm so some product
| will be the first item on even the most generic search.
| nullc wrote:
| > For the specific context of "I can find something I've
| already found",
|
| I find it to be utterly terrible for that too.. even when I
| have verbatim strings from the thing I'm looking for it often
| simply doesn't show up. ... often because it rewrites the
| query into something about Kim Kardashian's butt and no
| amount of quotes or pluses will make it stop.
| joe_the_user wrote:
| I disagree about the quality of Google search but I should note
| this has nothing to do with the utility of Google's vector
| search library, which is just one low level part of the process
| of creating a final Google and I'd expect the technical quality
| here to be excellent by default.
|
| Whether one likes or hates current Google search results, their
| qualities and the changes from early search processes are
| clearly intentional and don't relate to how well Google does
| raw indexing.
| porker wrote:
| > I can vaguely describe in a sentence the gist of an article
| I've read, or an image, and the proper result will usually be
| in the first page.
|
| I can never do that. What I remember is so far from the wording
| used or how Google identifies the image that it never comes up.
| I end up scrolling back through my history trying to do it from
| the page title or domain
| nightpool wrote:
| Note that this article is careful to never say that this
| "vector search" technology powers the classic Google Search.
| This sort of automated classification space is probably _part_
| of Google 's general search algorithm, but it's probably a very
| small part. Youtube recommendations (based on description +
| thumbnail + potentially video content?) and Google Image Search
| are the two in-practice examples that it focuses on.
| makeset wrote:
| If you don't know what you're looking for, it works well. If
| you know exactly what you want, specific keywords etc., good
| luck because it's gone from the mildly condescending "Did you
| mean..." corrections to "Yeah I am going to ignore all that, I
| bet what you really want to find is some ads."
| bitcharmer wrote:
| I have the exact opposite experience. Search results are
| nowadays ridden with crap from companies that learned how to
| game SEO.
|
| If not that, you get scam websites or other ad/malware infested
| trash.
| radicaldreamer wrote:
| Wish there was an easy way to remove forbes.com results
| greybeardgeek wrote:
| you can append -site:forbes.* to your query string
| jiveturkey wrote:
| https://news.ycombinator.com/item?id=29546433
| Kydlaw wrote:
| There is a lot done vector search technology right now. I was
| less fortunate when looking at vector storage. I already looked
| at Pinecone or Weaviate but they are all paid products.
|
| Is there some people having feedback on this?
| Xenoamorphous wrote:
| ElasticSearch supports vectors (dense ones, they supported
| sparse ones at some point but they removed support I think),
| and has things like cosine similarity functions built in.
|
| Not sure how "free" it is though.
| thirdtrigger wrote:
| Not true - Weaviate is open source: https://github.com/semi-
| technologies/weaviate
| gk1 wrote:
| Pinecone has a free tier that's quite generous, fits around 1M
| items and will fit even more soon.
|
| Not sure this helps but just mentioning in case.
| tomcooks wrote:
| I would be happy with "find anything with Google search"
| eliseumds wrote:
| It exists: https://developers.google.com/custom-
| search/v1/overview. Might not be cheap for search-as-you-type
| though.
| sanxiyn wrote:
| GP is complaining Google's search quality is poor, not
| inquiring about Google's search API.
| tomc1985 wrote:
| So... fuzzy logic
|
| Everything old is new again! Again!
| [deleted]
| dorianmariefr wrote:
| Will probably be available as Postgres extension at some point.
| Seems like only special indexing of vectors is needed
| Hokusai wrote:
| It's not very good. I tried different pictures and the results
| are almost random.
|
| A picture from a cartoon returns from logos to any type of
| drawing. A picture of a battery returns cars and shops. A picture
| of food worked as expected and I got more food pictures.
| eob wrote:
| My 2022 wish list is a Postgres plugin that adds vector + AKNN
| support that plays well with relational queries. There are so
| many use cases of that.
|
| I believe Ant Financial has published an open source one but iirc
| the English language documentation is sparse.
| ccleve wrote:
| Do you have a link? I'd like to see it.
|
| I googled and did not find much pertaining to "ant financial"
| and "postgres". Perhaps your google-fu is better than mine...
| etiennedi wrote:
| Checkout the open source vector search engine Weaviate:
| https://github.com/semi-technologies/weaviate
|
| It's not a relational db, but it supports Graph-like
| connections between objects, which makes it really easy to
| model your relations.
| thirdtrigger wrote:
| Jup - this is an example from the demo dataset in the docs:
| https://link.semi.technology/3DPcphe
| akane wrote:
| Check out pgvector: https://github.com/ankane/pgvector
| (disclosure: am author)
|
| It uses IVFFlat indexing, but could be extended to support
| product quantization / ScaNN.
___________________________________________________________________
(page generated 2021-12-14 23:00 UTC)