[HN Gopher] Finding near-duplicates with Jaccard similarity and ...
___________________________________________________________________
Finding near-duplicates with Jaccard similarity and MinHash
Author : brianyu8
Score : 159 points
Date : 2024-07-04 05:04 UTC (17 hours ago)
(HTM) web link (blog.nelhage.com)
(TXT) w3m dump (blog.nelhage.com)
| BiteCode_dev wrote:
| I worked with a client that implemented their own Python version
| of this to deduplicate citizen entries in a big french gov
| database. It worked well.
|
| Of course, nowaday I would probably just tell them to use
| datasketch (https://pypi.org/project/datasketch/).
|
| With this trip to memory lane, I looked around a little, and
| noticed people are still creating new stuff on the topic. E.G:
|
| https://pypi.org/project/rensa/
|
| Which is basically a more specialized but faster version of
| datasketch minhash, written in rust, with a little python on top.
| RobinL wrote:
| For deduplicating people, the Fellegi Sunter model is also a
| powerful approach. Splink[0] is a free Python library that
| implements this for big datasets. Probably you could combine
| parts of both approaches as well. Full disclosure: I am the
| lead author.
|
| I've also written up some interactive tutorials on how the
| method works [1] if anyone's interested
|
| [0]https://github.com/moj-analytical-services/splink
| [1]https://www.robinlinacre.com/intro_to_probabilistic_linkage/
| BiteCode_dev wrote:
| Interesting.
|
| What would you say are the main differences with other
| approaches?
| RobinL wrote:
| The Fellegi Sunter model is able to estimate the importance
| of different types of information from the data itself
| (i.e. unsupervised learning).
|
| For instance, a match on a date of birth column lends a
| greater weight of evidence in favour of a match than a
| match on first name (since dob has higher cardinality).
|
| The method is also able to estimate weights for fuzzy
| matches (how much evidence in favour of a match is close
| match on dob with one character difference), and also how
| much evidence against a match a mismatch is.
|
| For instance, if you have very high data quality on gender,
| then a match on gender doesn't tell you much, but a
| mismatch on gender is quite strong evidence against the
| idea two records match.
|
| I have a blog post here that delves into this a bit more:
| https://www.robinlinacre.com/fellegi_sunter_accuracy/
| BiteCode_dev wrote:
| Ok, so you get better accuracy if you datasets have
| obviously weighted fields. Do you pay that in perfs, and
| if yes how much?
| RobinL wrote:
| The performance is pretty good because the prediction is
| ultimately just adding up match weights. Much of the
| performance is dictated by
|
| (1) The blocking approach you choose (how wide you cast
| the net in searching for matches. This is actually
| somewhere were minhash can be used in conjunction
|
| (2) whether you choose to use complex fuzzy matching
| functions and how many - this is chosen by the user.
|
| There's some benchmarking results here:
| https://www.robinlinacre.com/fast_deduplication/
|
| Overall it's an approach which makes a pretty good trade
| off between speed and accuracy. That's why it's used by
| many national stats institutes (UK, US, Aus, Germany
| etc.) - because it's capable of working on country-
| population sized datasets.
| bugglebeetle wrote:
| Wonderful library and a really great set of explainers!
| Thanks for sharing this. Wish I had this for quite a few past
| projects...
| molodec wrote:
| There is also gaoya ( I am the author ), which is also written
| in Rust and has python bindings. datasketch is great, but it is
| not performant enough for my use case. gaoya is used in a
| production system for large scale clustering
| https://github.com/serega/gaoya
| is_true wrote:
| I had to do this with sport teams but I used levenshtein for the
| names. I ended up creating a vector with other data (country,
| gender) and using that vector to calculate the distance
| (similarity).
| vivzkestrel wrote:
| I have this problem right now in postgres. I have 600000
| feed_items with the schema (feed_item_id uuid, author varchar,
| content text, guid varchar, link varchar, title varchar, summary
| text, feed_id integer) The content and summary columns especially
| for some of the news items are highly similar but not equal. For
| any given 2 such news items, I am trying to cut them down to 1.
| Any ideas?
| RobinL wrote:
| One useful technique here could be to use text embeddings and
| cosine similarity:
| https://simonwillison.net/2023/Oct/23/embeddings/
| swasheck wrote:
| love this and have been using tf/idf for embeddings and
| various measures of similarity for some personal pet
| projects. one thing i came across in my research is that
| cosine similarity was more useful for vectors of different
| lengths and that euclidean distance was useful for vectors of
| similar length but simon alludes to a same-length
| requirement. i'm not formally trained in this area so i was
| hoping someone could shed some light on this for me.
| dleeftink wrote:
| I've implemented a minhash-like system in Bigquery, and was
| able to calculate cosine similarities (within a certain
| similarity threshold) between all Stack Overflow in reasonable
| time. To give you a broad idea of the procedure:
|
| 1. Concatenate and split all text fields into an array of
| ngrams (2...n chars)
|
| 2. Declare two global arrays A & B of length-n and fill them
| with random 32-64 bit integers
|
| 3. Hash each ngram to a 32-64 bit integer, multiply the
| resulting hash by each random value in array A, and modulo the
| resulting output with each random value in array B. Take the
| minimum value. Per row, your aim is to end up with an array of
| 'minhashed' integers of equal length to the arrays in step 2.
| If you declare the global arrays to be length of 64, the
| minhashed array for each row will also be of length 64.
|
| 4. Bucketise the resulting hash arrays by summing N consecutive
| minhash values using a window function (e.g. sum each 4
| consecutive rows)
|
| If all went well, you can now unnest these arrays (call them
| 'source rows') and self join the dataset on each bucketed
| minhash value (resulting in an additional column of 'target
| rows'). Group by source, target columns and count the
| occurances to get an indication of how likely two rows are
| similar.
|
| In essence, the more often two items hash to a similar bucket,
| the more similar they are, and it's up to you to define the
| cut-off from where to calculate actual pairwise Jaccard (or
| cosine) similarities between items.
| probably_wrong wrote:
| If you think two items cover the same highly similar _keywords_
| then Jaccard distance should work.
|
| If you think two items share highly similar _text_ then give
| Levenshtein distance a try.
| kordlessagain wrote:
| Have an LLM generate a reverse index for the entries, but force
| it to keep the cardinality low. Then you can use a Jaccard
| similarity.
| pkeenan11 wrote:
| Holy synchronicity Batman! I just implemented a minhash system
| that some may find interesting.
|
| The problem is to find (pseudo) inverses of many different proper
| submatrices of a big square matrix.
|
| Certain matrix identities (Woodbury, banachiewicz) allow you to
| update the inverse of a "close" submatrix to cheaply calculate a
| new one.
|
| So you store the inverses you've calculated already, with their
| row/column indices as keys. Then for each new submatrix you want
| to find a close already-computed inverse to update from.
|
| This is solved with minhashing. You minwise hash the indices so
| that close matrices are likely to have the same hash.
|
| In my particular implementation I did a "multi-resolution" hash
| so that I can change the selectiveness of the search as the
| number of prior computed inverses grows
| tpoacher wrote:
| One thing many people miss about set based metrics like the
| jaccard similarity (aka Tanimoto coefficient) and F1 score (aka
| Dice coefficient), is that they can also be used identically with
| fuzzy sets.
|
| The only complication is that you then need to choose a suitable
| T-Norm / T-Conorm pair, which express the concept of intersection
| and union for fuzzy sets, and there's an infinite family of them.
| But that's a good thing, since you get to pick the pair with your
| desired semantics.
|
| I wrote about this ([0][1]) in the context of validating medical
| image segmentations when both the segmentation and ground truth
| are probabilistic/fuzzy rather than binary masks.
|
| Otherwise what most people do instead is to simply threshold at
| 0.5 to obtain binary sets for use with the binary variants of the
| jaccard / dice coefficients. Which apparently decreases the
| precision of your _validation_ operator by 2 orders of magnitude.
| It 's like, you publish your algorithm claiming it's better than
| SOTA by 0.001, ignoring the fact that your validation operator
| has an error margin of 0.1 ...
|
| [0]
| https://link.springer.com/chapter/10.1007/978-3-319-46723-8_...
|
| [1]
| https://ora.ox.ac.uk/objects/uuid:dc352697-c804-4257-8aec-08...
| derefr wrote:
| As a document clustering / dataset deduplication technique, how
| well does "throwing ML at the problem" (e.g. using a pre-trained
| LLM encoder to generate vector embeddings of your documents, and
| then sticking those vectors into a vector DB and doing k-means to
| cluster the vectors) compare, quality-wise and performance-wise,
| to these simpler discrete-algorithm methods?
| motohagiography wrote:
| there was a really simple string sort by similarity oneliner I
| can't find in my history, but the basic idea was to take a
| wordlist, split each word in half, reverse the halves,
| concatenate them back together, sort the resulting strings, then
| flip the strings back to their original words.
|
| this simple shuffle was a poor man's similarity sort that I used
| to find sound-alike words. the thing about the quality of
| similarity is that it's hard to know for sure if it worked, but
| this was sufficient.
___________________________________________________________________
(page generated 2024-07-04 23:00 UTC)