[HN Gopher] Introduction to Locality-Sensitive Hashing (2018)
___________________________________________________________________
Introduction to Locality-Sensitive Hashing (2018)
Author : signa11
Score : 145 points
Date : 2022-12-23 06:30 UTC (16 hours ago)
(HTM) web link (tylerneylon.com)
(TXT) w3m dump (tylerneylon.com)
| AR_1 wrote:
| A well written article and the problem of efficient peer
| discovery in a large network of nodes will become even more
| pronounced in the near future.
|
| Another design using proximity based hashing is to use a particle
| swarm optimization approach [1]
|
| [1]
|
| Proximity-based Networking: Small world overlays optimized with
| particle swarm optimization
|
| https://arxiv.org/pdf/2006.02006.pdf
| eruci wrote:
| I currently use a modified hilbert curve implementation for nn
| searches, will check this out to see how it compares.
|
| my use case is here: https://geocode.xyz/2451481561372 ->
| https://geocode.xyz/45.17310,-75.62670 /
| https://geocode.xyz/2451481561373 ->
| https://geocode.xyz/45.17300,-75.62670
| glass_of_water wrote:
| How do you deal with false negatives when using Hilbert curves
| for nearest neighbor searches? Based on my limited
| understanding, points close on space can be far from each other
| on the curve. Do you use multiple curves or something?
| lgvld wrote:
| Thank you for sharing! Very interesting. I love visual
| explanations [1]. If I remember correctly, LSH can be used in
| computer graphics, by instance when computing fluids simulations
| (where _a lot_ of particles are involved).
|
| The author also wrote an article in which he illustrates RNN in
| an original and stunning way: https://tylerneylon.com/a/randnn/
|
| [1]: Not related, but you might also enjoy the work of Bartosz
| Ciechanowski: https://ciechanow.ski/
| abalaji wrote:
| This is a fantastic article! Though, for the first half, I do
| think folks would benefit from having a rough understanding of
| the nuances of hash maps, if I may plug my own rough notes [1]
|
| [1]
| https://github.com/adithyabsk/datastructures/blob/main/docs/...
| tylerneylon wrote:
| Hi, I'm the author of this article. I'm grateful for the positive
| comments! I plan to explore FALCONN (pointed to by gajjanag) and
| the beautiful article of Bartosz Ciechanowski pointed out by
| lgvld.
|
| My original motivation for writing this was to explain (in a
| follow-up article) an ANN algorithm I published in 2010:
| https://epubs.siam.org/doi/pdf/10.1137/1.9781611973075.94
|
| I never wrote that up (as a non-academic article), but the fact
| that people like this article is a nudge for me to do that.
|
| The above paper was the first LSH algorithm which
| deterministically returns all close-enough points and guarantees
| no false positives for distant-enough points. I've heard there
| are also other algorithms with that property published since. If
| I have time, I would like to one day see how that algorithm
| compares against the others in Erik Bernhardsson's benchmarks.
| ctxc wrote:
| Forgive my lack of knowledge, but I was lost the instant I
| found the first equation :(
|
| H is the hash function and x are coordinates. But what is R Z?
| tylerneylon wrote:
| R is the set of real numbers.
|
| Z is the set of integers {.., -2, -1, 0, 1, 2, ...}.
|
| Z is called "Z" due to influential German mathematicians who
| chose the letter to stand for "zahlen" ("number" in German).
|
| As a fun fact, the letter "N" stands for "natural numbers,"
| meaning either positive integers, or positive integers and
| zero. However, that letter (N) is _not_ standardized because
| mathematicians disagree about whether or not to include zero.
| When I separately told a couple of my professors this in grad
| school, they didn't believe me -- one of them thinking it was
| crazy for zero to be excluded, the other thinking it was
| crazy to include it!
|
| https://en.wikipedia.org/wiki/Natural_number
| dang wrote:
| Discussed at the time:
|
| _Introduction to Locality-Sensitive Hashing_ -
| https://news.ycombinator.com/item?id=27614381 - June 2021 (63
| comments)
| [deleted]
| zbird wrote:
| I like meself a well-written and formatted article. Thank you.
| tipsytoad wrote:
| Fantastic article! How does this compare to ANN for the use case?
|
| https://erikbern.com/2015/10/01/nearest-neighbors-and-vector...
| gpderetta wrote:
| Reminds me of the random projections dimensionality reduction
| schema which is often used with LSH.
|
| In fact the described forest of trees schema can probably be
| interpreted as an LSH.
|
| Disclaimer: I haven't touched this stuff for more than 10
| years. Don't know what's the state of the art now.
| uniqueuid wrote:
| I agree that random hyperplanes are underpinning both, but as
| far as I recall, usually only one set of random planes is
| used in LSH (i.e. one set of trees). The granularity of
| approximate proximity rests in the number of identical signs,
| i.e. being on the same side of the plane.
|
| There is another good and more technical explanation (using
| bands) in chapter 2 of mining massive datasets by Leskovec,
| Rajaraman and Ullman.
| gpderetta wrote:
| As I said, it has been a long time!
|
| I have dim memories of using random k basis vectors to
| convert high dimensionality feature vectors to k
| dimensions, and doing m times to generate multiple
| projections as part of a an LSH schema. Min-hashing might
| have been involved.
| senderista wrote:
| IIRC, minhashing is used to approximate Jacquard
| similarity (a set-theoretic measure), while random
| hyperplanes (aka simhashing) is used to approximate
| cosine similarity (a geometric/algebraic measure). So
| they solve different problems, even though some problems
| can be cast in terms of either framework.
| foobar502 wrote:
| Read the article and googled a bit - what are more example use
| cases for LSH behind those described?
| madiator wrote:
| Here's an example I can think of. Suppose you have a bunch of
| text documents, and you know that some documents are similar
| but not identical (e.g. plagiarized and slightly modified). You
| want to find out which documents are similar.
|
| You can first run the contents through some sort of embedding
| model (e.g. the recent OpenAI embedding model [1]), and then
| apply LSH on those embeddings. The documents that have the same
| LSH value would have had very similar embeddings, and thus very
| similar content.
|
| [1] https://beta.openai.com/docs/guides/embeddings
| tylerneylon wrote:
| The use case I see the most in my career is to use LSH to help
| solve the "ANN" problem = approximate nearest neighbors
| (typically with ranked results). I've seen ANN used many times
| for near-duplicate detection and in recommendation systems.
|
| Although I don't have access to the proprietary code used, it's
| most likely that an LSH algorithm is behind the scenes in every
| modern search engine (to avoid serving duplicates), many modern
| ranking systems such as Elasticsearch (because items are
| typically vectorized and retrieved based on these vectors), and
| most recommendation systems (for similar reasons as ranking).
| For example, all of these pages probably have an LSH algorithm
| at some point (either batch processing before your request, or
| in some cases real-time lookups):
|
| * Every search result page on Google * Every product page on
| Amazon (similar products) * All music suggestions on Spotify or
| similar * Every video recommendation from TikTok, YouTube, or
| Instagram
|
| etc.
| molodec wrote:
| Another interesting use case for LSH is search results
| caching. Used by Amazon https://www.linkedin.com/feed/update/
| urn:li:activity:6943348...
| https://www.amazon.science/blog/more-efficient-caching-
| for-p...
| senderista wrote:
| Yes, e.g. many IR systems use cosine similarity to compute
| query vector/term vector similarity, and simhashing
| approximates cosine similarity. OTOH, some IR systems instead
| use a set-theoretic measure, Jacquard similarity, which can
| be approximated by minhashing.
| gajjanag wrote:
| By the way, https://github.com/FALCONN-LIB/FALCONN contains a
| really good LSH implementation. Also see
| https://www.mit.edu/~andoni/LSH/ if you want to know more about
| the research literature.
|
| The fastest way for Euclidean space that I know that works well
| in practice is via Leech lattice decoding:
| https://www.semanticscholar.org/paper/Maximum-likelihood-dec... ,
| or https://www.semanticscholar.org/paper/Efficient-bounded-
| dist... .
|
| It is possible to create an implementation based on the above
| that decodes 24 dimensional points to the closest Leech lattice
| vector in < 1 microsecond per point on my AMD Ryzen laptop.
| Combine with some fast random projections/Johnson Lindenstrauss
| as described in the article to form the LSH.
|
| This LSH family is unfortunately not present in FALCONN, but the
| alternatives in FALCONN are pretty good.
|
| Source: thought extensively about LSH for my PhD.
| [deleted]
| gumby wrote:
| Interesting concept! Until I read it I thought locality was
| exactly what I would _not_ want in a hash function.
|
| I love it when I read something that upends my assumptions and
| opens up new possibilities.
___________________________________________________________________
(page generated 2022-12-23 23:00 UTC)