[HN Gopher] Introduction to Locality-Sensitive Hashing
___________________________________________________________________
Introduction to Locality-Sensitive Hashing
Author : polm23
Score : 233 points
Date : 2021-06-24 05:51 UTC (17 hours ago)
(HTM) web link (tylerneylon.com)
(TXT) w3m dump (tylerneylon.com)
| dgan wrote:
| Wow I just discovered LSH couple of hours ago, before opening
| HN... Just to see it being covered here
| tux3 wrote:
| Crackpot intuition: Knowing very little about either LSH or
| Lattices, could something like this be used to speed up the
| Shortest Vector Problem and attack lattice-based post-quantum
| cryptography?
|
| Edit: Googling finds a paper from 2015
| (https://eprint.iacr.org/2014/744.pdf). Can't tell how relevant
| this is or how far off the mark I am, but either way this ain't
| novel. Back to humbly minding my own business =)
| KMag wrote:
| A few observations: we don't yet know how to perform long-term
| qbit storage, so all multi-day storage at present is classic
| bits.
|
| If the proposal is to hash all of the possible lattice points
| using irreversable computation on Earth in a reasonable amount
| of time, for a cryptographically useful lattice, the waste heat
| would literally boil the oceans. That's limitation doesn't hold
| for quantum computers, but results need to be converted into
| classic bits for storage. Presumably rainbow-table-like
| operations could reduce the storage space and waste heat
| necessary in storing these bits. Maybe quantum rainbow tables
| are a fruitful area of research.
|
| If we're not pre-hashing all possible lattice points, then
| either we has the public value and we have some heuristic that
| allows us to hash only some small subset of lattice points for
| this particular value (... but then why use a hash? ... just
| compare points in your heuristic sequence) or we have some
| inverse hash operation where we hash the public value, then
| explore the "un-hash space" of lattice points corresponding to
| the LSH for the point in question.
|
| This last general line of inquiry (trying to find search-space
| reducing heuristics) is probably where most of the research
| effort is going, though perhaps not specifically using LSH. In
| order to be useful for the lattice problem, the distance
| metrics for the input and output spaces of the LSH need to have
| specific relations to the lattice distance metric.
|
| So... yea... maybe it helps to re-cast the problem in terms of
| LSH, but not in any obvious way. Also, any search-space
| reducing heuristic could probably be re-interpreted as a LSH if
| you squint at it hard enough.
| ganzuul wrote:
| Replying in kind: if it optimizes memory layout then it speeds
| up computation. However if you reduce the hardness from 5 bn
| years to 1bn years it's important but amusing.
|
| They discuss distance metrics, which may be sensitive to
| inducive bias, which may perhaps be substantiated by external
| knowledge about the specific instance of the problem. Then
| things are not amusing anymore.
| alexfromapex wrote:
| This is actually an amazing idea. I wonder if this would help
| with creating/authenticating real online identities from bots? It
| obviously can be spoofed but definitely a useful tool in the
| toolkit.
| tgb wrote:
| Well written article. Does this get used at all bioinformatics,
| for maybe de novo sequence construction from short reads or
| something? Seems like it would have many applications in that
| field.
|
| Edit: I see now that user inciampati already addressed this.
| the_arun wrote:
| What is the real world use case where you could use LSH?
| buffalobuffalo wrote:
| It actually has a lot of overlap with deep learning. Here's an
| NLP example.
|
| Lets say you have millions of phrases and you want to find the
| ones that are the most similar to a given one that you've
| chosen. One way of doing this would be to create embedding for
| each phrase and looking at some metric like cosine similarity
| of the embeddings to determine closeness of the phrases. The
| problem is, you don't want to have to compare the embedding of
| your phrase to every other phrase in your collection. In this
| case, LSH can help you narrow it down.
| zer01 wrote:
| Malware detection uses it as a heuristic to peg specific
| "variants" to each other. Basically LSH is great for comparing
| corpuses of data that are slightly different.
|
| If you hash the data using something like SHA256 and even one
| byte is different, it'll produce a radically different hash
| (which is by design). With LSH you can have a measurement to
| say "corpus A has a 90% overlap with corpus B", or in a single
| bit flip case, a very high correlation.
| steventhedev wrote:
| It seems like this approach could be adapted as a dimensionality
| reduction technique: select a set of k random hyperplanes denoted
| p_i, then map each point to a new vector r such that r_i is the
| distance to the hyperplane p_i. My gut tells me it may be more
| efficient to reuse existing nearest neighbor approaches in that
| embedding space than to bucket them and compute the partial
| intersection of 20 sets.
|
| That having been said, this was a wonderful read, and beautifully
| presented.
| KMag wrote:
| Yes, given the dimensionality of the input is at least the
| dimensionality of the hash value, location-sensitive hashing is
| always a dimensionality reduction operation.
|
| Conversely, any dimensionality-reducing technique with a fixed
| number of finite range output dimensions can be used for
| location-sensitive hashing as long as a couple of extra
| constraints hold: (1) nearby inputs (given an appropriate
| distance metric) will produce nearby outputs (given a second
| appropriate distance metric), and (2) far-apart inputs are
| likely to produce far-apart outputs.
| steventhedev wrote:
| That seems to imply that LSHs are a subset of DRs. Are there
| LSH techniques that are not applicable to DR?
| mumblemumble wrote:
| Like so many of these algorithms, there are a few
| perspectives, and the most useful way to contextualize it
| depends on the problem you're trying to solve.
|
| I would say that it's both a dimensionality reduction _and_
| an approximate nearest neighbor search technique.
|
| One example of a scenario where I think that the ANN
| perspective is more useful than the DR one is near
| duplicate detection. There, the buckets tend not to be
| treated as dimensions so much as filtered lists of items to
| consider for more detailed comparison. Assuming you're
| trying to do a high recall search for everything within one
| distance, you may just look at the union of all the
| matching buckets.
|
| On the other hand, if you're doing a k-nearest-neighbors
| search (rather than a dragnet search for all items within
| some threshold), you might treat them more like dimensions,
| since the items that share more buckets with the search
| vector are likely to also be the most similar. So perhaps
| you start by ranking by Hamming distance in the LSH space,
| and then refine using comparisons in the higher-dimensional
| space.
| nighthawk454 wrote:
| UMAP is sort of along those lines: https://umap-
| learn.readthedocs.io/en/latest/how_umap_works.h...
| edoliberty wrote:
| Hi All, it's true that LSH (locality sensitive hashing) and DR
| (dimension reduction) are very related. They both derive their
| correctness from the same strong concentration phenomenon.
|
| But, they are used/designed for different use cases. DR is mostly
| concerned with preserving distances such that
|
| c * D( f(x), f(y) ) < d(x,y) < C*D( f(x), f(y) )
|
| Here, f(x) is the mapping of x to a lower dimensional space (or
| any other object, really). d(-,-) and D(-,-) are the original
| distance measure and new distance measure respectively. The
| smaller the ratio C/c the better. Clearly, the computational
| overhead of applying f and D are also important.
|
| LSH, on the other hand, is mostly concerned with
| probabilistically filtering and searching for nearby points
| efficiently. Here are some (old) class notes from a course I gave
| 2013https://edoliberty.github.io//classnotes/datamining2013/pdfs.
| ..
|
| While LSH is mathematically beautiful, it is no longer considered
| the gold standard for ANN. There are many open source algorithms
| that outperform LSH on most datasets.
|
| (I founded a vector search company: https://www.pinecone.io)
| molodec wrote:
| There are many great algorithms that outperform LSH on ANN
| vector search, but vectorization in itself a high-compute task,
| which ads latency and require GPUs, but this is not reported in
| benchmarks. For text LSH hash can be efficiently created
| directly from tokens. Of course, this does not work for
| semantic similarity, but can be used for lexical similarity
| search.
|
| By the way. I enjoyed listening the interview with you on
| Practical AI podcast!
| azinman2 wrote:
| What are the gold standard algorithms now?
| cgreerrun wrote:
| This site keeps track of them:
| https://github.com/erikbern/ann-benchmarks#glove-100-angular
|
| ScaNN is currently SOTA:
| https://ai.googleblog.com/2020/07/announcing-scann-
| efficient...
| azinman2 wrote:
| Thanks!
| edoliberty wrote:
| My 2c, the choice of algorithm is quite complex... It is
| data dependent and also depends on your read/write ratio,
| latency and accuracy needs, memory/cpu, and many other
| factors. If I were you I'd look for a versatile solution
| more than "the best" algorithm cos that choice will
| likely change... just sayin'...
| IshKebab wrote:
| I can't see how LSH would be any faster for 2D/3D data than just
| sorting into regular X, Y buckets (like the first example).
| Surely this is only useful for high dimensional data?
| vaibhavsagar wrote:
| Aaron Levin did a very interesting talk about this at !!Con 2017:
| https://www.youtube.com/watch?v=kKRvEJrvvso
| treebog wrote:
| I remember learning about locality sensitive hashing in school a
| decade ago, but since then I've never seen it used in an ML
| system in industry. Does anyone actually use this technique at
| scale? If you have, I'd love to hear about your application and
| experience with it.
| polm23 wrote:
| I used it in ecommerce to find duplicate product titles.
| dcolkitt wrote:
| I've used it in an HFT system. Basically as a form of latency-
| minimizing model compression for evaluation in production.
|
| The problem was that the base layer model (tree ensembles) did
| very well in terms of training error but was much too slow to
| evaluate in production. With LSH, you can start with a bunch of
| points from the base model, then train the compressed model
| against the reference samples.
| gpderetta wrote:
| A lifetime ago I used it on a document clustering system. It is
| not very good as a general similarity function, but excellent
| for quickly finding near duplicates in linear time. About half
| of the documents of set I was working on (news articles) were
| duplicates, so an early removal pass would speed up the actual
| clustering algorithm pass.
|
| My recollection is a bit fuzzy, but I remember specifically
| using Min-Hash together with random projection.
| weimerica wrote:
| We used it at work (anti-cybercrime / Phishing focused)
| company in conjunction with our crawler for clustering of
| phishing landing pages.
|
| Typically, you'll have loads of script kiddies hosting
| slightly modified copies of the Apple ID landing page on a
| compromised web server... Alternatively you'd have loads of
| these pages build out of "exploit frameworks" / "kits" so
| we'd want to categorize and group to identify prevalence of a
| given framework / author.
|
| Had the nice side-effect of prioritizing, speeding up, and
| automating takedowns for our SOC folks.
| v3rt3x wrote:
| I've used it for a wide variety of search and recommendation
| problems, albeit not in isolation. In practice, I've found that
| it is a great way to reduce the size of a search space before
| handing the problem off to a more precise algorithm.
| ospider wrote:
| When I was in ByteDance, it was used extensively to deduplicate
| news articles.
| midjji wrote:
| It was used extensively in classic computer vision descriptor
| matching including most image lookup e.g. google image search
| today. There are many reasons it not much used in recent
| computer vision. The first is that it does not work when the
| descriptors themselves are explicitly designed to make visually
| similar features descriptively distinct, and the second is that
| it generally works poorly for binary vectors which dominate.
| With regards to ml, its been tried, but the datasets are either
| too small or not available to researchers, and the result would
| be far too slow compared to the current approaches. Its
| actually one of the areas where deep learning hasn't reached
| parity to my current knowledge. It would be a great project,
| the dataset issue aside.
| addictedcs wrote:
| For binary vectors you can choose a different distance metric
| (not geometric one, i.e. Jaccard) that can be used to
| effectively hash similar data points into similar buckets.
|
| Treating your binary vector as a set allows you to use min-
| hashing as your LSH schema (min-hashing is just a random
| permutation of the given set). This simple trick makes LSH
| with min-hashing quite a powerful tool for binary vectors
| that are extensively used in recommenders systems and other
| domains.
|
| I've used LSH + Min-Hash for image search (and subsequently
| for audio fingerprinting). If interested, I've blogged about
| it here [1].
|
| [1] - https://emysound.com/blog/open-source/2020/06/12/how-
| audio-f...
| gpderetta wrote:
| Agree. Also cosine distance.
| jalammar wrote:
| It's widely used in recommenders based on embeddings. See:
|
| https://github.com/spotify/annoy
|
| https://github.com/facebookresearch/faiss
| roye wrote:
| one more, from google:
| https://ai.googleblog.com/2020/07/announcing-scann-
| efficient...
| gk1 wrote:
| And https://www.pinecone.io...
| Loranubi wrote:
| It's used in the cybersecurity world for fuzzy file comparison.
| See https://github.com/trendmicro/tlsh
| roye wrote:
| Academic example but I'm sure it's made it's way to this
| industry as well: in bioinformatics, this has proven
| tremendously useful for approximate string matching to do
| things like identify microbes in sequenced microbiome samples -
| cf
| https://genomebiology.biomedcentral.com/articles/10.1186/s13...
| pilotneko wrote:
| It is being used in the reformer architecture to relate the dot
| product attention layer. This enables much longer input
| sequences and reduces computational complexity from O(n^2) to
| O(nlog(n)), where n in the length of the sequence.
|
| https://huggingface.co/transformers/model_doc/reformer.html
| inciampati wrote:
| Locality sensitive hashing is extensively used in
| bioinformatics and genomics.
|
| mash is probably the best-known implementation
| (https://doi.org/10.1186/s13059-016-0997-x). It allows you to
| do _highly_ efficient pairwise comparisons of genome sequences,
| where it provides a good approximates their actual pairwise
| sequence divergence.
|
| Is MinHash used industrially outside of genomics? In theory,
| it'd be useful on any kind of text corpus. The design of the
| hash functions that the sketch is built from is non-trivial The
| hashes typically come from k-mers of a fixed length and various
| fast hash functions over them.
| phillypham wrote:
| It was used in https://ai.googleblog.com/2020/01/reformer-
| efficient-transfo... for faster and more memory-efficient
| attention.
| GistNoesis wrote:
| The general idea of quantizing points so that similar points
| fall into the same bucket or a near bucket is a powerful one.
|
| It's more appropriately used to answer Ball queries (i.e. find
| all points within distance r of the query) than near neighbor
| queries.
|
| I've used it in multiple context :
|
| -Low dimension : Computing all pair-wise interaction for nearby
| particles in physics simulations : You hash particles positions
| to cells, and you check interactions between nearby cells.
|
| -Image Similarity search for candidates to loop closure in SLAM
| algorithms (Depending on the number of images you have brute-
| forcing in hamming space is often sufficient)
|
| -Visual vocabulary, each key-point has a descriptor value, when
| you use opencv with SIFT/SURF for matching you can use Flann
| with LSH.
|
| -K-mean variant clustering with large K, when K is large you
| need an index to find the closer point efficiently (used to
| compute the visual vocabulary above)
|
| -Neural networks : for example it can be very tempting to put
| some kind of LSH index on the GPU as a standard neural-network
| layer, but if it's too small you are better with standard
| brute-force, and if it's too big you run out of memory. So you
| are looking for the sweet-spot. But remembering that if you
| have an index with 10M elements and you update sparsely only 10
| elements by iteration, you'll need at least 1M iterations
| before each element has moved. So your algorithm will take a
| long time to converge.
|
| -P2P indices : Tables of hashes that can be used to recover
| similar data when querying them.
|
| The version of LSH described in the article is not often used
| because in real data you can often take advantage of some
| hierarchical representation inside the data.
|
| It's among one of the many space-partitioning technique with
| multiple overlapping partitions.
|
| Depending on the constraints on your data, it's hard to make it
| shine. It shines best when there are large quantity of data,
| but it consumes a lot of memory so it may be useful to have the
| index on disk, and ideally you'd like to be able to update the
| index incrementally. The various existing libraries work well
| for the specific situations they are designed to handle, but
| quite often you are still better rolling your own space-
| partitioning technique for your specific constraints.
|
| Also quite often instead of an approximate search and have to
| fiddle with some tuning parameters, it's better to use some
| exact search : If you can project into hamming spaces, there
| are fast and exact algorithms (although memory intensive) based
| on the pigeon-hole principle.
| Mehdi2277 wrote:
| The recommender system mention is very valuable use case that
| I've seen at some of the largest ml using companies in the
| world. If you have a recommendation system on millions or much
| higher pool of content you will need an ANN index. I would
| expect google image search is using an ANN index. Facebook
| definitely uses ANN indices. Spotify is where Annoy comes from.
| They're pretty key at tiktok.
|
| Whether you use locality sensitive hashing or something else is
| a separate question. Personal experience is graph based ANN
| indices (hnsw is one nice one) have tended to a bit better, but
| LSH is competitive so I wouldn't strongly be against a design
| decision choosing it. One downside I've seen with some ann
| index libraries is a lot of them don't support incremental
| updates/deletions and force you to build the index as a batch
| job. That's fine in some use cases, but breaks others. LSH
| based approach spotify uses doesn't support incremental
| updates, but that's not because LSH can't support it just Annoy
| wasn't designed for it.
| andyxor wrote:
| great post and my favorite CS topic, LSH is particularly relevant
| to machine learning because of its use in indexing of embeddings,
| which are now omnipresent in ML (from word2vec to transformers to
| image and graph embeddings etc, etc.)
|
| it is now supported in ElasticSearch KNN index (they use HNSWLIB
| but you can call it a descendant of original LSH in a way)
|
| check out ANN benchmarks [0] for comparison of LSH performance to
| other state of the art methods like proximity graphs/HNSWLIB [1]
| and quantization/SCANN [2]
|
| As an introduction LSH (with MinHash) is also described in detail
| in the book "Mining Of Massive Datasets", ch.3, "Finding Similar
| items", highly recommended [3]
|
| if you want to play with LSH, python "annoy" library is the best
| place to start [4]
|
| [0] https://github.com/erikbern/ann-benchmarks
|
| [1] https://github.com/google-research/google-
| research/tree/mast...
|
| [2] https://github.com/nmslib/hnswlib
|
| [3] http://infolab.stanford.edu/~ullman/mmds
|
| [4] https://github.com/spotify/annoy
| zimpenfish wrote:
| http://infolab.stanford.edu/~ullman/mmds is 401 for me -
| presumably because directory indexing is turned off.
|
| http://infolab.stanford.edu/~ullman/mmds/book.pdf is the book;
| http://infolab.stanford.edu/~ullman/mmds/bookL.pdf is the
| hyperlinked book; and http://www.mmds.org/ seems to be the
| official homepage.
| Lichtso wrote:
| Any relation to: https://en.wikipedia.org/wiki/Rolling_hash ?
|
| Also I wonder if hexagonal lattices were a better choice for the
| 2D use case. Having multiple randomly sized, shifted and rotated
| hexagonal lattices would come very close to:
| https://en.wikipedia.org/wiki/Grid_cell
| bvrmn wrote:
| Does TeX equations are intended to be presented in a raw form?
| It's very hard to read without proper rendering.
| encypruon wrote:
| It renders fine on my end. Try checking if your browser is
| blocking cdnjs.cloudflare.com for some reason.
| bvrmn wrote:
| Yep, chrome works fine!
| sidcool wrote:
| Getting a 404 'Site not Found' page.
| zaarn wrote:
| Try to disable HTTPS Everywhere or Firefox' HTTPS Only Mode.
| Looks like this site is badly setup and can't handle HTTP-only
| requests properly.
| wcrossbow wrote:
| This reminds me of the fuzzy simplical sets used by UMAP. Is
| there any relationship between the two? Could it be used to
| accelerate the algorithm?
| jtnag wrote:
| Few years ago I made a PoC with LSH to find similar functions
| within keyword-driven test framework to reduce keywords
| duplicates. It did find duplicates with fairy high accuracy
| although I did not get any comments from real live usage.
| bruce343434 wrote:
| This is pretty awesome! But I'm not a fan of the RNG involved. Is
| there an approach where you use a circle or sphere with a fallof
| function instead of layering random mosaics/hash functions?
| Hnrobert42 wrote:
| I was barely hanging on to a lot of the math notation, but the
| author argues the advantage of the randomization is that on
| average the performance will be constant across all datasets.
| That is, once you develop a sense for how the system performs,
| you shouldn't be surprised by wild performance swings.
|
| If that is not a valuable feature for you, the I believe you
| could use any clustering hash function you want, including a
| deterministic one.
| bruce343434 wrote:
| The randomization is to prevent bias in the sampling, as they
| illustrate with the grid. They then overlay random "grid
| shapes" to roughly approximate a sphere. I can't prove it
| right now, but I'm betting that if you extended this approach
| to infinite hashes you would eventually get a perfect
| circular fallof.
|
| So, the randomization is not the only way to prevent bias:
| you can also just not have it to begin with by cutting to the
| chase and using circular falloff to begin with. But what I'm
| asking is how would one implement such a thing.
| bnj wrote:
| A little off topic but after enjoying the piece (nice tufte style
| layout, kudos) I was digging around the other posts. Disappointed
| that I couldn't add an RSS feed for the future!
___________________________________________________________________
(page generated 2021-06-24 23:02 UTC)