[HN Gopher] Introduction to Locality-Sensitive Hashing
       ___________________________________________________________________
        
       Introduction to Locality-Sensitive Hashing
        
       Author : polm23
       Score  : 233 points
       Date   : 2021-06-24 05:51 UTC (17 hours ago)
        
 (HTM) web link (tylerneylon.com)
 (TXT) w3m dump (tylerneylon.com)
        
       | dgan wrote:
       | Wow I just discovered LSH couple of hours ago, before opening
       | HN... Just to see it being covered here
        
       | tux3 wrote:
       | Crackpot intuition: Knowing very little about either LSH or
       | Lattices, could something like this be used to speed up the
       | Shortest Vector Problem and attack lattice-based post-quantum
       | cryptography?
       | 
       | Edit: Googling finds a paper from 2015
       | (https://eprint.iacr.org/2014/744.pdf). Can't tell how relevant
       | this is or how far off the mark I am, but either way this ain't
       | novel. Back to humbly minding my own business =)
        
         | KMag wrote:
         | A few observations: we don't yet know how to perform long-term
         | qbit storage, so all multi-day storage at present is classic
         | bits.
         | 
         | If the proposal is to hash all of the possible lattice points
         | using irreversable computation on Earth in a reasonable amount
         | of time, for a cryptographically useful lattice, the waste heat
         | would literally boil the oceans. That's limitation doesn't hold
         | for quantum computers, but results need to be converted into
         | classic bits for storage. Presumably rainbow-table-like
         | operations could reduce the storage space and waste heat
         | necessary in storing these bits. Maybe quantum rainbow tables
         | are a fruitful area of research.
         | 
         | If we're not pre-hashing all possible lattice points, then
         | either we has the public value and we have some heuristic that
         | allows us to hash only some small subset of lattice points for
         | this particular value (... but then why use a hash? ... just
         | compare points in your heuristic sequence) or we have some
         | inverse hash operation where we hash the public value, then
         | explore the "un-hash space" of lattice points corresponding to
         | the LSH for the point in question.
         | 
         | This last general line of inquiry (trying to find search-space
         | reducing heuristics) is probably where most of the research
         | effort is going, though perhaps not specifically using LSH. In
         | order to be useful for the lattice problem, the distance
         | metrics for the input and output spaces of the LSH need to have
         | specific relations to the lattice distance metric.
         | 
         | So... yea... maybe it helps to re-cast the problem in terms of
         | LSH, but not in any obvious way. Also, any search-space
         | reducing heuristic could probably be re-interpreted as a LSH if
         | you squint at it hard enough.
        
         | ganzuul wrote:
         | Replying in kind: if it optimizes memory layout then it speeds
         | up computation. However if you reduce the hardness from 5 bn
         | years to 1bn years it's important but amusing.
         | 
         | They discuss distance metrics, which may be sensitive to
         | inducive bias, which may perhaps be substantiated by external
         | knowledge about the specific instance of the problem. Then
         | things are not amusing anymore.
        
       | alexfromapex wrote:
       | This is actually an amazing idea. I wonder if this would help
       | with creating/authenticating real online identities from bots? It
       | obviously can be spoofed but definitely a useful tool in the
       | toolkit.
        
       | tgb wrote:
       | Well written article. Does this get used at all bioinformatics,
       | for maybe de novo sequence construction from short reads or
       | something? Seems like it would have many applications in that
       | field.
       | 
       | Edit: I see now that user inciampati already addressed this.
        
       | the_arun wrote:
       | What is the real world use case where you could use LSH?
        
         | buffalobuffalo wrote:
         | It actually has a lot of overlap with deep learning. Here's an
         | NLP example.
         | 
         | Lets say you have millions of phrases and you want to find the
         | ones that are the most similar to a given one that you've
         | chosen. One way of doing this would be to create embedding for
         | each phrase and looking at some metric like cosine similarity
         | of the embeddings to determine closeness of the phrases. The
         | problem is, you don't want to have to compare the embedding of
         | your phrase to every other phrase in your collection. In this
         | case, LSH can help you narrow it down.
        
         | zer01 wrote:
         | Malware detection uses it as a heuristic to peg specific
         | "variants" to each other. Basically LSH is great for comparing
         | corpuses of data that are slightly different.
         | 
         | If you hash the data using something like SHA256 and even one
         | byte is different, it'll produce a radically different hash
         | (which is by design). With LSH you can have a measurement to
         | say "corpus A has a 90% overlap with corpus B", or in a single
         | bit flip case, a very high correlation.
        
       | steventhedev wrote:
       | It seems like this approach could be adapted as a dimensionality
       | reduction technique: select a set of k random hyperplanes denoted
       | p_i, then map each point to a new vector r such that r_i is the
       | distance to the hyperplane p_i. My gut tells me it may be more
       | efficient to reuse existing nearest neighbor approaches in that
       | embedding space than to bucket them and compute the partial
       | intersection of 20 sets.
       | 
       | That having been said, this was a wonderful read, and beautifully
       | presented.
        
         | KMag wrote:
         | Yes, given the dimensionality of the input is at least the
         | dimensionality of the hash value, location-sensitive hashing is
         | always a dimensionality reduction operation.
         | 
         | Conversely, any dimensionality-reducing technique with a fixed
         | number of finite range output dimensions can be used for
         | location-sensitive hashing as long as a couple of extra
         | constraints hold: (1) nearby inputs (given an appropriate
         | distance metric) will produce nearby outputs (given a second
         | appropriate distance metric), and (2) far-apart inputs are
         | likely to produce far-apart outputs.
        
           | steventhedev wrote:
           | That seems to imply that LSHs are a subset of DRs. Are there
           | LSH techniques that are not applicable to DR?
        
             | mumblemumble wrote:
             | Like so many of these algorithms, there are a few
             | perspectives, and the most useful way to contextualize it
             | depends on the problem you're trying to solve.
             | 
             | I would say that it's both a dimensionality reduction _and_
             | an approximate nearest neighbor search technique.
             | 
             | One example of a scenario where I think that the ANN
             | perspective is more useful than the DR one is near
             | duplicate detection. There, the buckets tend not to be
             | treated as dimensions so much as filtered lists of items to
             | consider for more detailed comparison. Assuming you're
             | trying to do a high recall search for everything within one
             | distance, you may just look at the union of all the
             | matching buckets.
             | 
             | On the other hand, if you're doing a k-nearest-neighbors
             | search (rather than a dragnet search for all items within
             | some threshold), you might treat them more like dimensions,
             | since the items that share more buckets with the search
             | vector are likely to also be the most similar. So perhaps
             | you start by ranking by Hamming distance in the LSH space,
             | and then refine using comparisons in the higher-dimensional
             | space.
        
         | nighthawk454 wrote:
         | UMAP is sort of along those lines: https://umap-
         | learn.readthedocs.io/en/latest/how_umap_works.h...
        
       | edoliberty wrote:
       | Hi All, it's true that LSH (locality sensitive hashing) and DR
       | (dimension reduction) are very related. They both derive their
       | correctness from the same strong concentration phenomenon.
       | 
       | But, they are used/designed for different use cases. DR is mostly
       | concerned with preserving distances such that
       | 
       | c * D( f(x), f(y) ) < d(x,y) < C*D( f(x), f(y) )
       | 
       | Here, f(x) is the mapping of x to a lower dimensional space (or
       | any other object, really). d(-,-) and D(-,-) are the original
       | distance measure and new distance measure respectively. The
       | smaller the ratio C/c the better. Clearly, the computational
       | overhead of applying f and D are also important.
       | 
       | LSH, on the other hand, is mostly concerned with
       | probabilistically filtering and searching for nearby points
       | efficiently. Here are some (old) class notes from a course I gave
       | 2013https://edoliberty.github.io//classnotes/datamining2013/pdfs.
       | ..
       | 
       | While LSH is mathematically beautiful, it is no longer considered
       | the gold standard for ANN. There are many open source algorithms
       | that outperform LSH on most datasets.
       | 
       | (I founded a vector search company: https://www.pinecone.io)
        
         | molodec wrote:
         | There are many great algorithms that outperform LSH on ANN
         | vector search, but vectorization in itself a high-compute task,
         | which ads latency and require GPUs, but this is not reported in
         | benchmarks. For text LSH hash can be efficiently created
         | directly from tokens. Of course, this does not work for
         | semantic similarity, but can be used for lexical similarity
         | search.
         | 
         | By the way. I enjoyed listening the interview with you on
         | Practical AI podcast!
        
         | azinman2 wrote:
         | What are the gold standard algorithms now?
        
           | cgreerrun wrote:
           | This site keeps track of them:
           | https://github.com/erikbern/ann-benchmarks#glove-100-angular
           | 
           | ScaNN is currently SOTA:
           | https://ai.googleblog.com/2020/07/announcing-scann-
           | efficient...
        
             | azinman2 wrote:
             | Thanks!
        
               | edoliberty wrote:
               | My 2c, the choice of algorithm is quite complex... It is
               | data dependent and also depends on your read/write ratio,
               | latency and accuracy needs, memory/cpu, and many other
               | factors. If I were you I'd look for a versatile solution
               | more than "the best" algorithm cos that choice will
               | likely change... just sayin'...
        
       | IshKebab wrote:
       | I can't see how LSH would be any faster for 2D/3D data than just
       | sorting into regular X, Y buckets (like the first example).
       | Surely this is only useful for high dimensional data?
        
       | vaibhavsagar wrote:
       | Aaron Levin did a very interesting talk about this at !!Con 2017:
       | https://www.youtube.com/watch?v=kKRvEJrvvso
        
       | treebog wrote:
       | I remember learning about locality sensitive hashing in school a
       | decade ago, but since then I've never seen it used in an ML
       | system in industry. Does anyone actually use this technique at
       | scale? If you have, I'd love to hear about your application and
       | experience with it.
        
         | polm23 wrote:
         | I used it in ecommerce to find duplicate product titles.
        
         | dcolkitt wrote:
         | I've used it in an HFT system. Basically as a form of latency-
         | minimizing model compression for evaluation in production.
         | 
         | The problem was that the base layer model (tree ensembles) did
         | very well in terms of training error but was much too slow to
         | evaluate in production. With LSH, you can start with a bunch of
         | points from the base model, then train the compressed model
         | against the reference samples.
        
         | gpderetta wrote:
         | A lifetime ago I used it on a document clustering system. It is
         | not very good as a general similarity function, but excellent
         | for quickly finding near duplicates in linear time. About half
         | of the documents of set I was working on (news articles) were
         | duplicates, so an early removal pass would speed up the actual
         | clustering algorithm pass.
         | 
         | My recollection is a bit fuzzy, but I remember specifically
         | using Min-Hash together with random projection.
        
           | weimerica wrote:
           | We used it at work (anti-cybercrime / Phishing focused)
           | company in conjunction with our crawler for clustering of
           | phishing landing pages.
           | 
           | Typically, you'll have loads of script kiddies hosting
           | slightly modified copies of the Apple ID landing page on a
           | compromised web server... Alternatively you'd have loads of
           | these pages build out of "exploit frameworks" / "kits" so
           | we'd want to categorize and group to identify prevalence of a
           | given framework / author.
           | 
           | Had the nice side-effect of prioritizing, speeding up, and
           | automating takedowns for our SOC folks.
        
         | v3rt3x wrote:
         | I've used it for a wide variety of search and recommendation
         | problems, albeit not in isolation. In practice, I've found that
         | it is a great way to reduce the size of a search space before
         | handing the problem off to a more precise algorithm.
        
         | ospider wrote:
         | When I was in ByteDance, it was used extensively to deduplicate
         | news articles.
        
         | midjji wrote:
         | It was used extensively in classic computer vision descriptor
         | matching including most image lookup e.g. google image search
         | today. There are many reasons it not much used in recent
         | computer vision. The first is that it does not work when the
         | descriptors themselves are explicitly designed to make visually
         | similar features descriptively distinct, and the second is that
         | it generally works poorly for binary vectors which dominate.
         | With regards to ml, its been tried, but the datasets are either
         | too small or not available to researchers, and the result would
         | be far too slow compared to the current approaches. Its
         | actually one of the areas where deep learning hasn't reached
         | parity to my current knowledge. It would be a great project,
         | the dataset issue aside.
        
           | addictedcs wrote:
           | For binary vectors you can choose a different distance metric
           | (not geometric one, i.e. Jaccard) that can be used to
           | effectively hash similar data points into similar buckets.
           | 
           | Treating your binary vector as a set allows you to use min-
           | hashing as your LSH schema (min-hashing is just a random
           | permutation of the given set). This simple trick makes LSH
           | with min-hashing quite a powerful tool for binary vectors
           | that are extensively used in recommenders systems and other
           | domains.
           | 
           | I've used LSH + Min-Hash for image search (and subsequently
           | for audio fingerprinting). If interested, I've blogged about
           | it here [1].
           | 
           | [1] - https://emysound.com/blog/open-source/2020/06/12/how-
           | audio-f...
        
             | gpderetta wrote:
             | Agree. Also cosine distance.
        
         | jalammar wrote:
         | It's widely used in recommenders based on embeddings. See:
         | 
         | https://github.com/spotify/annoy
         | 
         | https://github.com/facebookresearch/faiss
        
           | roye wrote:
           | one more, from google:
           | https://ai.googleblog.com/2020/07/announcing-scann-
           | efficient...
        
             | gk1 wrote:
             | And https://www.pinecone.io...
        
         | Loranubi wrote:
         | It's used in the cybersecurity world for fuzzy file comparison.
         | See https://github.com/trendmicro/tlsh
        
         | roye wrote:
         | Academic example but I'm sure it's made it's way to this
         | industry as well: in bioinformatics, this has proven
         | tremendously useful for approximate string matching to do
         | things like identify microbes in sequenced microbiome samples -
         | cf
         | https://genomebiology.biomedcentral.com/articles/10.1186/s13...
        
         | pilotneko wrote:
         | It is being used in the reformer architecture to relate the dot
         | product attention layer. This enables much longer input
         | sequences and reduces computational complexity from O(n^2) to
         | O(nlog(n)), where n in the length of the sequence.
         | 
         | https://huggingface.co/transformers/model_doc/reformer.html
        
         | inciampati wrote:
         | Locality sensitive hashing is extensively used in
         | bioinformatics and genomics.
         | 
         | mash is probably the best-known implementation
         | (https://doi.org/10.1186/s13059-016-0997-x). It allows you to
         | do _highly_ efficient pairwise comparisons of genome sequences,
         | where it provides a good approximates their actual pairwise
         | sequence divergence.
         | 
         | Is MinHash used industrially outside of genomics? In theory,
         | it'd be useful on any kind of text corpus. The design of the
         | hash functions that the sketch is built from is non-trivial The
         | hashes typically come from k-mers of a fixed length and various
         | fast hash functions over them.
        
         | phillypham wrote:
         | It was used in https://ai.googleblog.com/2020/01/reformer-
         | efficient-transfo... for faster and more memory-efficient
         | attention.
        
         | GistNoesis wrote:
         | The general idea of quantizing points so that similar points
         | fall into the same bucket or a near bucket is a powerful one.
         | 
         | It's more appropriately used to answer Ball queries (i.e. find
         | all points within distance r of the query) than near neighbor
         | queries.
         | 
         | I've used it in multiple context :
         | 
         | -Low dimension : Computing all pair-wise interaction for nearby
         | particles in physics simulations : You hash particles positions
         | to cells, and you check interactions between nearby cells.
         | 
         | -Image Similarity search for candidates to loop closure in SLAM
         | algorithms (Depending on the number of images you have brute-
         | forcing in hamming space is often sufficient)
         | 
         | -Visual vocabulary, each key-point has a descriptor value, when
         | you use opencv with SIFT/SURF for matching you can use Flann
         | with LSH.
         | 
         | -K-mean variant clustering with large K, when K is large you
         | need an index to find the closer point efficiently (used to
         | compute the visual vocabulary above)
         | 
         | -Neural networks : for example it can be very tempting to put
         | some kind of LSH index on the GPU as a standard neural-network
         | layer, but if it's too small you are better with standard
         | brute-force, and if it's too big you run out of memory. So you
         | are looking for the sweet-spot. But remembering that if you
         | have an index with 10M elements and you update sparsely only 10
         | elements by iteration, you'll need at least 1M iterations
         | before each element has moved. So your algorithm will take a
         | long time to converge.
         | 
         | -P2P indices : Tables of hashes that can be used to recover
         | similar data when querying them.
         | 
         | The version of LSH described in the article is not often used
         | because in real data you can often take advantage of some
         | hierarchical representation inside the data.
         | 
         | It's among one of the many space-partitioning technique with
         | multiple overlapping partitions.
         | 
         | Depending on the constraints on your data, it's hard to make it
         | shine. It shines best when there are large quantity of data,
         | but it consumes a lot of memory so it may be useful to have the
         | index on disk, and ideally you'd like to be able to update the
         | index incrementally. The various existing libraries work well
         | for the specific situations they are designed to handle, but
         | quite often you are still better rolling your own space-
         | partitioning technique for your specific constraints.
         | 
         | Also quite often instead of an approximate search and have to
         | fiddle with some tuning parameters, it's better to use some
         | exact search : If you can project into hamming spaces, there
         | are fast and exact algorithms (although memory intensive) based
         | on the pigeon-hole principle.
        
         | Mehdi2277 wrote:
         | The recommender system mention is very valuable use case that
         | I've seen at some of the largest ml using companies in the
         | world. If you have a recommendation system on millions or much
         | higher pool of content you will need an ANN index. I would
         | expect google image search is using an ANN index. Facebook
         | definitely uses ANN indices. Spotify is where Annoy comes from.
         | They're pretty key at tiktok.
         | 
         | Whether you use locality sensitive hashing or something else is
         | a separate question. Personal experience is graph based ANN
         | indices (hnsw is one nice one) have tended to a bit better, but
         | LSH is competitive so I wouldn't strongly be against a design
         | decision choosing it. One downside I've seen with some ann
         | index libraries is a lot of them don't support incremental
         | updates/deletions and force you to build the index as a batch
         | job. That's fine in some use cases, but breaks others. LSH
         | based approach spotify uses doesn't support incremental
         | updates, but that's not because LSH can't support it just Annoy
         | wasn't designed for it.
        
       | andyxor wrote:
       | great post and my favorite CS topic, LSH is particularly relevant
       | to machine learning because of its use in indexing of embeddings,
       | which are now omnipresent in ML (from word2vec to transformers to
       | image and graph embeddings etc, etc.)
       | 
       | it is now supported in ElasticSearch KNN index (they use HNSWLIB
       | but you can call it a descendant of original LSH in a way)
       | 
       | check out ANN benchmarks [0] for comparison of LSH performance to
       | other state of the art methods like proximity graphs/HNSWLIB [1]
       | and quantization/SCANN [2]
       | 
       | As an introduction LSH (with MinHash) is also described in detail
       | in the book "Mining Of Massive Datasets", ch.3, "Finding Similar
       | items", highly recommended [3]
       | 
       | if you want to play with LSH, python "annoy" library is the best
       | place to start [4]
       | 
       | [0] https://github.com/erikbern/ann-benchmarks
       | 
       | [1] https://github.com/google-research/google-
       | research/tree/mast...
       | 
       | [2] https://github.com/nmslib/hnswlib
       | 
       | [3] http://infolab.stanford.edu/~ullman/mmds
       | 
       | [4] https://github.com/spotify/annoy
        
         | zimpenfish wrote:
         | http://infolab.stanford.edu/~ullman/mmds is 401 for me -
         | presumably because directory indexing is turned off.
         | 
         | http://infolab.stanford.edu/~ullman/mmds/book.pdf is the book;
         | http://infolab.stanford.edu/~ullman/mmds/bookL.pdf is the
         | hyperlinked book; and http://www.mmds.org/ seems to be the
         | official homepage.
        
       | Lichtso wrote:
       | Any relation to: https://en.wikipedia.org/wiki/Rolling_hash ?
       | 
       | Also I wonder if hexagonal lattices were a better choice for the
       | 2D use case. Having multiple randomly sized, shifted and rotated
       | hexagonal lattices would come very close to:
       | https://en.wikipedia.org/wiki/Grid_cell
        
       | bvrmn wrote:
       | Does TeX equations are intended to be presented in a raw form?
       | It's very hard to read without proper rendering.
        
         | encypruon wrote:
         | It renders fine on my end. Try checking if your browser is
         | blocking cdnjs.cloudflare.com for some reason.
        
           | bvrmn wrote:
           | Yep, chrome works fine!
        
       | sidcool wrote:
       | Getting a 404 'Site not Found' page.
        
         | zaarn wrote:
         | Try to disable HTTPS Everywhere or Firefox' HTTPS Only Mode.
         | Looks like this site is badly setup and can't handle HTTP-only
         | requests properly.
        
       | wcrossbow wrote:
       | This reminds me of the fuzzy simplical sets used by UMAP. Is
       | there any relationship between the two? Could it be used to
       | accelerate the algorithm?
        
       | jtnag wrote:
       | Few years ago I made a PoC with LSH to find similar functions
       | within keyword-driven test framework to reduce keywords
       | duplicates. It did find duplicates with fairy high accuracy
       | although I did not get any comments from real live usage.
        
       | bruce343434 wrote:
       | This is pretty awesome! But I'm not a fan of the RNG involved. Is
       | there an approach where you use a circle or sphere with a fallof
       | function instead of layering random mosaics/hash functions?
        
         | Hnrobert42 wrote:
         | I was barely hanging on to a lot of the math notation, but the
         | author argues the advantage of the randomization is that on
         | average the performance will be constant across all datasets.
         | That is, once you develop a sense for how the system performs,
         | you shouldn't be surprised by wild performance swings.
         | 
         | If that is not a valuable feature for you, the I believe you
         | could use any clustering hash function you want, including a
         | deterministic one.
        
           | bruce343434 wrote:
           | The randomization is to prevent bias in the sampling, as they
           | illustrate with the grid. They then overlay random "grid
           | shapes" to roughly approximate a sphere. I can't prove it
           | right now, but I'm betting that if you extended this approach
           | to infinite hashes you would eventually get a perfect
           | circular fallof.
           | 
           | So, the randomization is not the only way to prevent bias:
           | you can also just not have it to begin with by cutting to the
           | chase and using circular falloff to begin with. But what I'm
           | asking is how would one implement such a thing.
        
       | bnj wrote:
       | A little off topic but after enjoying the piece (nice tufte style
       | layout, kudos) I was digging around the other posts. Disappointed
       | that I couldn't add an RSS feed for the future!
        
       ___________________________________________________________________
       (page generated 2021-06-24 23:02 UTC)