[HN Gopher] Finding near-duplicates with Jaccard similarity and ...
       ___________________________________________________________________
        
       Finding near-duplicates with Jaccard similarity and MinHash
        
       Author : brianyu8
       Score  : 159 points
       Date   : 2024-07-04 05:04 UTC (17 hours ago)
        
 (HTM) web link (blog.nelhage.com)
 (TXT) w3m dump (blog.nelhage.com)
        
       | BiteCode_dev wrote:
       | I worked with a client that implemented their own Python version
       | of this to deduplicate citizen entries in a big french gov
       | database. It worked well.
       | 
       | Of course, nowaday I would probably just tell them to use
       | datasketch (https://pypi.org/project/datasketch/).
       | 
       | With this trip to memory lane, I looked around a little, and
       | noticed people are still creating new stuff on the topic. E.G:
       | 
       | https://pypi.org/project/rensa/
       | 
       | Which is basically a more specialized but faster version of
       | datasketch minhash, written in rust, with a little python on top.
        
         | RobinL wrote:
         | For deduplicating people, the Fellegi Sunter model is also a
         | powerful approach. Splink[0] is a free Python library that
         | implements this for big datasets. Probably you could combine
         | parts of both approaches as well. Full disclosure: I am the
         | lead author.
         | 
         | I've also written up some interactive tutorials on how the
         | method works [1] if anyone's interested
         | 
         | [0]https://github.com/moj-analytical-services/splink
         | [1]https://www.robinlinacre.com/intro_to_probabilistic_linkage/
        
           | BiteCode_dev wrote:
           | Interesting.
           | 
           | What would you say are the main differences with other
           | approaches?
        
             | RobinL wrote:
             | The Fellegi Sunter model is able to estimate the importance
             | of different types of information from the data itself
             | (i.e. unsupervised learning).
             | 
             | For instance, a match on a date of birth column lends a
             | greater weight of evidence in favour of a match than a
             | match on first name (since dob has higher cardinality).
             | 
             | The method is also able to estimate weights for fuzzy
             | matches (how much evidence in favour of a match is close
             | match on dob with one character difference), and also how
             | much evidence against a match a mismatch is.
             | 
             | For instance, if you have very high data quality on gender,
             | then a match on gender doesn't tell you much, but a
             | mismatch on gender is quite strong evidence against the
             | idea two records match.
             | 
             | I have a blog post here that delves into this a bit more:
             | https://www.robinlinacre.com/fellegi_sunter_accuracy/
        
               | BiteCode_dev wrote:
               | Ok, so you get better accuracy if you datasets have
               | obviously weighted fields. Do you pay that in perfs, and
               | if yes how much?
        
               | RobinL wrote:
               | The performance is pretty good because the prediction is
               | ultimately just adding up match weights. Much of the
               | performance is dictated by
               | 
               | (1) The blocking approach you choose (how wide you cast
               | the net in searching for matches. This is actually
               | somewhere were minhash can be used in conjunction
               | 
               | (2) whether you choose to use complex fuzzy matching
               | functions and how many - this is chosen by the user.
               | 
               | There's some benchmarking results here:
               | https://www.robinlinacre.com/fast_deduplication/
               | 
               | Overall it's an approach which makes a pretty good trade
               | off between speed and accuracy. That's why it's used by
               | many national stats institutes (UK, US, Aus, Germany
               | etc.) - because it's capable of working on country-
               | population sized datasets.
        
           | bugglebeetle wrote:
           | Wonderful library and a really great set of explainers!
           | Thanks for sharing this. Wish I had this for quite a few past
           | projects...
        
         | molodec wrote:
         | There is also gaoya ( I am the author ), which is also written
         | in Rust and has python bindings. datasketch is great, but it is
         | not performant enough for my use case. gaoya is used in a
         | production system for large scale clustering
         | https://github.com/serega/gaoya
        
       | is_true wrote:
       | I had to do this with sport teams but I used levenshtein for the
       | names. I ended up creating a vector with other data (country,
       | gender) and using that vector to calculate the distance
       | (similarity).
        
       | vivzkestrel wrote:
       | I have this problem right now in postgres. I have 600000
       | feed_items with the schema (feed_item_id uuid, author varchar,
       | content text, guid varchar, link varchar, title varchar, summary
       | text, feed_id integer) The content and summary columns especially
       | for some of the news items are highly similar but not equal. For
       | any given 2 such news items, I am trying to cut them down to 1.
       | Any ideas?
        
         | RobinL wrote:
         | One useful technique here could be to use text embeddings and
         | cosine similarity:
         | https://simonwillison.net/2023/Oct/23/embeddings/
        
           | swasheck wrote:
           | love this and have been using tf/idf for embeddings and
           | various measures of similarity for some personal pet
           | projects. one thing i came across in my research is that
           | cosine similarity was more useful for vectors of different
           | lengths and that euclidean distance was useful for vectors of
           | similar length but simon alludes to a same-length
           | requirement. i'm not formally trained in this area so i was
           | hoping someone could shed some light on this for me.
        
         | dleeftink wrote:
         | I've implemented a minhash-like system in Bigquery, and was
         | able to calculate cosine similarities (within a certain
         | similarity threshold) between all Stack Overflow in reasonable
         | time. To give you a broad idea of the procedure:
         | 
         | 1. Concatenate and split all text fields into an array of
         | ngrams (2...n chars)
         | 
         | 2. Declare two global arrays A & B of length-n and fill them
         | with random 32-64 bit integers
         | 
         | 3. Hash each ngram to a 32-64 bit integer, multiply the
         | resulting hash by each random value in array A, and modulo the
         | resulting output with each random value in array B. Take the
         | minimum value. Per row, your aim is to end up with an array of
         | 'minhashed' integers of equal length to the arrays in step 2.
         | If you declare the global arrays to be length of 64, the
         | minhashed array for each row will also be of length 64.
         | 
         | 4. Bucketise the resulting hash arrays by summing N consecutive
         | minhash values using a window function (e.g. sum each 4
         | consecutive rows)
         | 
         | If all went well, you can now unnest these arrays (call them
         | 'source rows') and self join the dataset on each bucketed
         | minhash value (resulting in an additional column of 'target
         | rows'). Group by source, target columns and count the
         | occurances to get an indication of how likely two rows are
         | similar.
         | 
         | In essence, the more often two items hash to a similar bucket,
         | the more similar they are, and it's up to you to define the
         | cut-off from where to calculate actual pairwise Jaccard (or
         | cosine) similarities between items.
        
         | probably_wrong wrote:
         | If you think two items cover the same highly similar _keywords_
         | then Jaccard distance should work.
         | 
         | If you think two items share highly similar _text_ then give
         | Levenshtein distance a try.
        
         | kordlessagain wrote:
         | Have an LLM generate a reverse index for the entries, but force
         | it to keep the cardinality low. Then you can use a Jaccard
         | similarity.
        
       | pkeenan11 wrote:
       | Holy synchronicity Batman! I just implemented a minhash system
       | that some may find interesting.
       | 
       | The problem is to find (pseudo) inverses of many different proper
       | submatrices of a big square matrix.
       | 
       | Certain matrix identities (Woodbury, banachiewicz) allow you to
       | update the inverse of a "close" submatrix to cheaply calculate a
       | new one.
       | 
       | So you store the inverses you've calculated already, with their
       | row/column indices as keys. Then for each new submatrix you want
       | to find a close already-computed inverse to update from.
       | 
       | This is solved with minhashing. You minwise hash the indices so
       | that close matrices are likely to have the same hash.
       | 
       | In my particular implementation I did a "multi-resolution" hash
       | so that I can change the selectiveness of the search as the
       | number of prior computed inverses grows
        
       | tpoacher wrote:
       | One thing many people miss about set based metrics like the
       | jaccard similarity (aka Tanimoto coefficient) and F1 score (aka
       | Dice coefficient), is that they can also be used identically with
       | fuzzy sets.
       | 
       | The only complication is that you then need to choose a suitable
       | T-Norm / T-Conorm pair, which express the concept of intersection
       | and union for fuzzy sets, and there's an infinite family of them.
       | But that's a good thing, since you get to pick the pair with your
       | desired semantics.
       | 
       | I wrote about this ([0][1]) in the context of validating medical
       | image segmentations when both the segmentation and ground truth
       | are probabilistic/fuzzy rather than binary masks.
       | 
       | Otherwise what most people do instead is to simply threshold at
       | 0.5 to obtain binary sets for use with the binary variants of the
       | jaccard / dice coefficients. Which apparently decreases the
       | precision of your _validation_ operator by 2 orders of magnitude.
       | It 's like, you publish your algorithm claiming it's better than
       | SOTA by 0.001, ignoring the fact that your validation operator
       | has an error margin of 0.1 ...
       | 
       | [0]
       | https://link.springer.com/chapter/10.1007/978-3-319-46723-8_...
       | 
       | [1]
       | https://ora.ox.ac.uk/objects/uuid:dc352697-c804-4257-8aec-08...
        
       | derefr wrote:
       | As a document clustering / dataset deduplication technique, how
       | well does "throwing ML at the problem" (e.g. using a pre-trained
       | LLM encoder to generate vector embeddings of your documents, and
       | then sticking those vectors into a vector DB and doing k-means to
       | cluster the vectors) compare, quality-wise and performance-wise,
       | to these simpler discrete-algorithm methods?
        
       | motohagiography wrote:
       | there was a really simple string sort by similarity oneliner I
       | can't find in my history, but the basic idea was to take a
       | wordlist, split each word in half, reverse the halves,
       | concatenate them back together, sort the resulting strings, then
       | flip the strings back to their original words.
       | 
       | this simple shuffle was a poor man's similarity sort that I used
       | to find sound-alike words. the thing about the quality of
       | similarity is that it's hard to know for sure if it worked, but
       | this was sufficient.
        
       ___________________________________________________________________
       (page generated 2024-07-04 23:00 UTC)