[HN Gopher] Vectors are over, hashes are the future
___________________________________________________________________
Vectors are over, hashes are the future
Author : jsilvers
Score : 98 points
Date : 2022-10-07 16:59 UTC (6 hours ago)
(HTM) web link (www.algolia.com)
(TXT) w3m dump (www.algolia.com)
| nelsondev wrote:
| Seems the author is proposing LSH instead of vectors for doing
| ANN?
|
| There are benchmarks here, http://ann-benchmarks.com/ , but LSH
| underperforms the state of the art ANN algorithms like HNSW on
| recall/throughput.
|
| LSH I believe was state of the art 10ish years ago, but has since
| been surpassed. Although the caching aspect is really nice.
| kvathupo wrote:
| To elaborate on Noe's comment, the article is suggesting the
| use of LSH where the hashing function is learned by a neural
| network such that similar vectors correspond to similar hashes
| via Hamming weight (whilst enforcing some load factor). In
| effect, a good hash is generated by a neural network. It
| appears Elastiknn a prioi chooses the hash function? Not sure,
| not my area of knowledge.
|
| This approach seems feasible tbh. For example, a stock's
| historical bids/asks probably don't deviate greatly from month
| to month. That said, the generation of a good hash is dependent
| on the stock ticker, and a human doesn't have the time to find
| a good one for every stock at scale.
| Noe2097 wrote:
| LSH is a _technique_, whose performance vastly/mostly depends
| on the hashing function and on how this function enables
| neighborhood exploration.
|
| It might not be trendy, but it doesn't mean it can't work as
| good or better than HNSW. It all depends on the hashing
| function you come up with.
| a-dub wrote:
| when combined with minhashing it approximates jaccard
| similarity, so it seems it would be bounded by that.
| a-dub wrote:
| 10? no, it's more like 20+. lsh was a core piece of the google
| crawler. it was used for high performance fuzzy deduplication.
|
| see ullman's text: mining massive datasets. it's free on the
| web.
| johanvts wrote:
| I think LSH was only introduced in 99 by Indyk et. al. I
| would say it was a pretty active research area 10 years ago.
| a-dub wrote:
| right, but massive scale production use in the google
| crawler to index the entire internet when that was at the
| bleeding edge was state of the art before the art was even
| really recognized as an art.
|
| i don't even think they called it ANN. it was high
| performance, scalable deduplication. (which is, in fact,
| just fast/scalable lossy clustering)
|
| collaborative filtering was kind of a cute joke at the
| time. meanwhile they had lsh, in production, actually
| deduplicating the internet.
| molodec wrote:
| It is true that HNSW outperforms LSH on recall and throughput,
| but for some use cases LSH outperforms HNSW. I just deployed
| this week to prod a new system for short text streaming
| clustering using LSH. I used algorithms from this crate that I
| also built https://github.com/serega/gaoya
|
| HNSW index is slow to construct, so it is best suited for
| search or recommendation engines where you build the index and
| serve. For workloads where you continuously mutate the index,
| like streaming clustering/deduplication LSH outperforms HNSW.
| whycombinetor wrote:
| The article's 0.65 vs 0.66 float64 example doesn't indicate much
| since neither 0.64 nor 0.65 have a terminating representation in
| base 2...
| whatever1 wrote:
| Omg NN "research" is just heuristics on top of heuristics on top
| of mambo jumbo.
|
| Hopefully someone who knows math will enter the field one day and
| build the theoretical basis for all this mess and allow us to
| make real progress.
| auraham wrote:
| Old post of Yann LeCun [1]:
|
| > But another important goal is inventing new methods, new
| techniques, and yes, new tricks. In the history of science and
| technology, the engineering artifacts have almost always
| preceded the theoretical understanding: the lens and the
| telescope preceded optics theory, the steam engine preceded
| thermodynamics, the airplane preceded flight aerodynamics,
| radio and data communication preceded information theory, the
| computer preceded computer science.
|
| [1]
| https://www.reddit.com/r/MachineLearning/comments/7i1uer/n_y...
| sramam wrote:
| (I know nothing about the area.)
|
| Am I incorrect in thinking we are headed to future AIs that jump
| to conclusions? Or is it just my "human neural hash" being
| triggered in error?!
| [deleted]
| mrkeen wrote:
| > The analogy here would be the choice between a 1 second flight
| to somewhere random in the suburb of your choosing in any city in
| the world versus a 10 hour trip putting you at the exact house
| you wanted in the city of your choice.
|
| Wouldn't the first part of the analogy actually be:
|
| A 1 second flight that will probably land at your exact
| destination, but could potentially land you anywhere on earth?
| steve76 wrote:
| olliej wrote:
| So my interpretation of the neural hash approach is largely that
| it is essentially trading a much larger number of very small
| "neurons" vs a smaller number of floats. Given that I'd be
| curious about what the total size difference is.
|
| I could see the hash approach at a functional level resulting in
| different features essentially getting a different number of bit
| directly, which be approximately equivalent to having a NN with
| variable precision floats, all in a very hand wavy way.
|
| Eg we could say a NN/NH needs N bits of information to work
| accurately, in which case you're trading the format and
| operations on those Nbits
| gk1 wrote:
| This is a rehash (pardon me) of this post from 2021:
| https://www.search.io/blog/vectors-versus-hashes
|
| The demand for vector embedding models (like those released by
| OpenAI, Cohere, HuggingFace, etc) and vector databases (like
| https://pinecone.io -- disclosure: I work there) has only grown
| since then. The market has decided that vectors are not, in fact,
| over.
| packetlost wrote:
| PineCone seems interesting. Is the storage backend open source?
| I've been working on a persistent hashmap database that's
| somewhat similar (albeit not done) that should have less RAM
| requirements than bitcask (ie. larger than RAM keysets)
| fzliu wrote:
| Hashes are fine, but to say that "vectors are over" is just plain
| nonsense. We continue to see vectors as a core part of production
| systems for entity representation and recommendation (example:
| https://slack.engineering/recommend-api) and within models
| themselves (example: multimodal and diffusion models). For folks
| into metrics, we're building a vector database specifically for
| storing, indexing, and searching across massive quantities of
| vectors (https://github.com/milvus-io/milvus), and we've seen
| close to exponential growth in terms of total downloads.
|
| Vectors are just getting started.
| PaulHoule wrote:
| Frequently people use vectors as a hash. It's a bit like a
| fashionista declaring clothes obsolete.
| kvathupo wrote:
| Click-bait title aside : ^ ), I'd agree. Neural hashes seem to
| be a promising advancement imo, but I question its impact on
| the convergence time of AI models. In the pecking order of
| neural network bottlenecks, I'd imagine it's not terribly
| expensive to access training data from some database. Rather,
| hardware considerations for improving parallelism seem to be
| the biggest hurdle [1].
|
| [1] - https://www.nvidia.com/en-us/data-center/nvlink/
| jurschreuder wrote:
| For searching on faces I also needed to find vectors in a
| database.
|
| I used random projection hashing to increase the search speed
| because you can just match directly (or at least narrow down
| search) instead of calculating the euclidean distance for each
| row.
| gauddasa wrote:
| True. The title is just clickbait and what we find inside is
| suggestions for dimensionality reduction by a person who
| appears to be on the verge of reinventing autoencoders
| disguised as neural hashes. Is it a mere coincidence that the
| article fails to mention autoencoders?
| aaaaaaaaaaab wrote:
| Pfhew, I thought you wanted to ditch std::vector for hash maps!
| PLenz wrote:
| Hashes are just short, constrained membership vectors
| robotresearcher wrote:
| A state vector can represent a point in the state space of
| floating-point representation, a point in the state space of a
| hash function, or any other discrete space.
|
| Vectors didn't go anywhere. The article is discussing which
| function to use to interpret a vector.
|
| Is there a special meaning of 'vector' here that I am missing? Is
| it so synonymous in the ML context with 'multidimensional
| floating point state space descriptor' that any other use is not
| a vector any more?
| Firmwarrior wrote:
| The title probably makes a lot more sense in the context of
| where it was originally posted
|
| I was as confused and annoyed as you were, though, since I
| don't have a machine learning background
| cratermoon wrote:
| And then there's this:
| https://news.ycombinator.com/item?id=33125640
___________________________________________________________________
(page generated 2022-10-07 23:00 UTC)