[HN Gopher] Introduction to Locality-Sensitive Hashing (2018)
___________________________________________________________________
Introduction to Locality-Sensitive Hashing (2018)
Author : signa11
Score : 220 points
Date : 2022-12-23 06:30 UTC (1 days ago)
(HTM) web link (tylerneylon.com)
(TXT) w3m dump (tylerneylon.com)
| AR_1 wrote:
| A well written article and the problem of efficient peer
| discovery in a large network of nodes will become even more
| pronounced in the near future.
|
| Another design using proximity based hashing is to use a particle
| swarm optimization approach [1]
|
| [1]
|
| Proximity-based Networking: Small world overlays optimized with
| particle swarm optimization
|
| https://arxiv.org/pdf/2006.02006.pdf
| eruci wrote:
| I currently use a modified hilbert curve implementation for nn
| searches, will check this out to see how it compares.
|
| my use case is here: https://geocode.xyz/2451481561372 ->
| https://geocode.xyz/45.17310,-75.62670 /
| https://geocode.xyz/2451481561373 ->
| https://geocode.xyz/45.17300,-75.62670
| glass_of_water wrote:
| How do you deal with false negatives when using Hilbert curves
| for nearest neighbor searches? Based on my limited
| understanding, points close on space can be far from each other
| on the curve. Do you use multiple curves or something?
| eruci wrote:
| I use two curves.
| lgvld wrote:
| Thank you for sharing! Very interesting. I love visual
| explanations [1]. If I remember correctly, LSH can be used in
| computer graphics, by instance when computing fluids simulations
| (where _a lot_ of particles are involved).
|
| The author also wrote an article in which he illustrates RNN in
| an original and stunning way: https://tylerneylon.com/a/randnn/
|
| [1]: Not related, but you might also enjoy the work of Bartosz
| Ciechanowski: https://ciechanow.ski/
| abalaji wrote:
| This is a fantastic article! Though, for the first half, I do
| think folks would benefit from having a rough understanding of
| the nuances of hash maps, if I may plug my own rough notes [1]
|
| [1]
| https://github.com/adithyabsk/datastructures/blob/main/docs/...
| bitL wrote:
| Reformer NLP model uses LSH to reduce the size of the
| transformer/attention layer while keeping the same accuracy,
| being able to model much longer context lengths in NLP or do
| image/video generation:
|
| https://ai.googleblog.com/2020/01/reformer-efficient-transfo...
| tylerneylon wrote:
| Hi, I'm the author of this article. I'm grateful for the positive
| comments! I plan to explore FALCONN (pointed to by gajjanag) and
| the beautiful article of Bartosz Ciechanowski pointed out by
| lgvld.
|
| My original motivation for writing this was to explain (in a
| follow-up article) an ANN algorithm I published in 2010:
| https://epubs.siam.org/doi/pdf/10.1137/1.9781611973075.94
|
| I never wrote that up (as a non-academic article), but the fact
| that people like this article is a nudge for me to do that.
|
| The above paper was the first LSH algorithm which
| deterministically returns all close-enough points and guarantees
| no false positives for distant-enough points. I've heard there
| are also other algorithms with that property published since. If
| I have time, I would like to one day see how that algorithm
| compares against the others in Erik Bernhardsson's benchmarks.
| mhuffman wrote:
| This is a really well-done article! Very approachable for a
| non-math-expert. If it is useful at all, I noticed while
| reading that there are two misspellings. "moives" for "moves"
| and "tht" for "that". I am looking forward to more of your
| articles.
| ctxc wrote:
| Forgive my lack of knowledge, but I was lost the instant I
| found the first equation :(
|
| H is the hash function and x are coordinates. But what is R Z?
| tylerneylon wrote:
| R is the set of real numbers.
|
| Z is the set of integers {.., -2, -1, 0, 1, 2, ...}.
|
| Z is called "Z" due to influential German mathematicians who
| chose the letter to stand for "zahlen" ("number" in German).
|
| As a fun fact, the letter "N" stands for "natural numbers,"
| meaning either positive integers, or positive integers and
| zero. However, that letter (N) is _not_ standardized because
| mathematicians disagree about whether or not to include zero.
| When I separately told a couple of my professors this in grad
| school, they didn't believe me -- one of them thinking it was
| crazy for zero to be excluded, the other thinking it was
| crazy to include it!
|
| https://en.wikipedia.org/wiki/Natural_number
| ctxc wrote:
| Thank you!
| fithisux wrote:
| Is there any blog where I can learn more on this fascinating
| area? Especially new discoveries?
|
| I've read the part in the book of "Mining Massive Datasets and
| lots of old notes about "The Probabilistic Method".
|
| But new things?
| _a_a_a_ wrote:
| @author a question if you don't mind. I saw this article and it
| was very interesting but I get the impression (or heard
| somewhere) that LSH is actually more suitable to much higher
| dimensional problems than 2D. Put another way, I think LSH is
| perhaps inappropriate for a two-dimensional case like this
| example where you could solve it more simply, but that would
| not scale to higher dimensions easily or at all, which is where
| LSH would work. Is this clear, and am I correct in my
| understanding?
|
| Thanks
| flashfaffe2 wrote:
| Maybe this is off topic but how this algorithm adapt itself to
| time series?
|
| I mean you want to cluster data which might be autocorrelated...
| Does this case is taken into account or do you need to preprocess
| your data beforehand?
| dang wrote:
| Discussed at the time:
|
| _Introduction to Locality-Sensitive Hashing_ -
| https://news.ycombinator.com/item?id=27614381 - June 2021 (63
| comments)
| [deleted]
| zbird wrote:
| I like meself a well-written and formatted article. Thank you.
| tipsytoad wrote:
| Fantastic article! How does this compare to ANN for the use case?
|
| https://erikbern.com/2015/10/01/nearest-neighbors-and-vector...
| gpderetta wrote:
| Reminds me of the random projections dimensionality reduction
| schema which is often used with LSH.
|
| In fact the described forest of trees schema can probably be
| interpreted as an LSH.
|
| Disclaimer: I haven't touched this stuff for more than 10
| years. Don't know what's the state of the art now.
| uniqueuid wrote:
| I agree that random hyperplanes are underpinning both, but as
| far as I recall, usually only one set of random planes is
| used in LSH (i.e. one set of trees). The granularity of
| approximate proximity rests in the number of identical signs,
| i.e. being on the same side of the plane.
|
| There is another good and more technical explanation (using
| bands) in chapter 2 of mining massive datasets by Leskovec,
| Rajaraman and Ullman.
| gpderetta wrote:
| As I said, it has been a long time!
|
| I have dim memories of using random k basis vectors to
| convert high dimensionality feature vectors to k
| dimensions, and doing m times to generate multiple
| projections as part of a an LSH schema. Min-hashing might
| have been involved.
| senderista wrote:
| IIRC, minhashing is used to approximate Jacquard
| similarity (a set-theoretic measure), while random
| hyperplanes (aka simhashing) is used to approximate
| cosine similarity (a geometric/algebraic measure). So
| they solve different problems, even though some problems
| can be cast in terms of either framework.
| foobar502 wrote:
| Read the article and googled a bit - what are more example use
| cases for LSH behind those described?
| nl wrote:
| Face matching. You use embedding created by DeepFace of
| MediaPipe and put them in a LSH and then the same face ends up
| close to each other.
| madiator wrote:
| Here's an example I can think of. Suppose you have a bunch of
| text documents, and you know that some documents are similar
| but not identical (e.g. plagiarized and slightly modified). You
| want to find out which documents are similar.
|
| You can first run the contents through some sort of embedding
| model (e.g. the recent OpenAI embedding model [1]), and then
| apply LSH on those embeddings. The documents that have the same
| LSH value would have had very similar embeddings, and thus very
| similar content.
|
| [1] https://beta.openai.com/docs/guides/embeddings
| ww520 wrote:
| Geohash is one use case. Latitude and longitude address is
| converted to a geohash which has the nice property of nested
| rectangles have the same hash prefix.
| magic123_ wrote:
| In Stuff Made Here's most recent video, he uses LSH to solve a
| 4000 pieces all-white jigsaw puzzle:
| https://www.youtube.com/watch?v=WsPHBD5NsS0
| zamalek wrote:
| Collision detection in games. This problem is O(n^2) because
| you have to check every object against every other object.
|
| You can _almost_ only check objects that inhabit the same
| buckets (there are caveats, usually neighboring buckets are
| also checked), eliminating objects that couldn 't possibly
| collide by virtue of e.g. being on the other side of the map.
| Of course this is still O(n^2) because every object could be in
| the same bucket (but that's unlikely).
| tylerneylon wrote:
| The use case I see the most in my career is to use LSH to help
| solve the "ANN" problem = approximate nearest neighbors
| (typically with ranked results). I've seen ANN used many times
| for near-duplicate detection and in recommendation systems.
|
| Although I don't have access to the proprietary code used, it's
| most likely that an LSH algorithm is behind the scenes in every
| modern search engine (to avoid serving duplicates), many modern
| ranking systems such as Elasticsearch (because items are
| typically vectorized and retrieved based on these vectors), and
| most recommendation systems (for similar reasons as ranking).
| For example, all of these pages probably have an LSH algorithm
| at some point (either batch processing before your request, or
| in some cases real-time lookups):
|
| * Every search result page on Google * Every product page on
| Amazon (similar products) * All music suggestions on Spotify or
| similar * Every video recommendation from TikTok, YouTube, or
| Instagram
|
| etc.
| molodec wrote:
| Another interesting use case for LSH is search results
| caching. Used by Amazon https://www.linkedin.com/feed/update/
| urn:li:activity:6943348...
| https://www.amazon.science/blog/more-efficient-caching-
| for-p...
| senderista wrote:
| Yes, e.g. many IR systems use cosine similarity to compute
| query vector/term vector similarity, and simhashing
| approximates cosine similarity. OTOH, some IR systems instead
| use a set-theoretic measure, Jacquard similarity, which can
| be approximated by minhashing.
| gajjanag wrote:
| By the way, https://github.com/FALCONN-LIB/FALCONN contains a
| really good LSH implementation. Also see
| https://www.mit.edu/~andoni/LSH/ if you want to know more about
| the research literature.
|
| The fastest way for Euclidean space that I know that works well
| in practice is via Leech lattice decoding:
| https://www.semanticscholar.org/paper/Maximum-likelihood-dec... ,
| or https://www.semanticscholar.org/paper/Efficient-bounded-
| dist... .
|
| It is possible to create an implementation based on the above
| that decodes 24 dimensional points to the closest Leech lattice
| vector in < 1 microsecond per point on my AMD Ryzen laptop.
| Combine with some fast random projections/Johnson Lindenstrauss
| as described in the article to form the LSH.
|
| This LSH family is unfortunately not present in FALCONN, but the
| alternatives in FALCONN are pretty good.
|
| Source: thought extensively about LSH for my PhD.
| [deleted]
| sendfoods wrote:
| FALCONN doesn't look maintained, unfortunately
| gumby wrote:
| Interesting concept! Until I read it I thought locality was
| exactly what I would _not_ want in a hash function.
|
| I love it when I read something that upends my assumptions and
| opens up new possibilities.
| twelve40 wrote:
| I ran into lunatics who were asking things like these at a
| programming job interview at a prominent company. There should be
| a moral to this but I don't know what it is.
___________________________________________________________________
(page generated 2022-12-24 23:02 UTC)