hngopher.com

       [HN Gopher] Jaccard Index
       ___________________________________________________________________
        
       Jaccard Index
        
       Author : dedalus
       Score  : 91 points
       Date   : 2023-03-19 19:33 UTC (3 hours ago)
        
 (HTM) web link (en.wikipedia.org)
 (TXT) w3m dump (en.wikipedia.org)
        
       | SubiculumCode wrote:
       | Jaccard my Dice please.
        
         | SubiculumCode wrote:
         | Jests asside, I've mostly used the closely related Dice
         | coefficient when measuring segmentation reliability
        
       | psyklic wrote:
       | This is one of my favorite distance metrics* to show people!
       | 
       | For example, perhaps one person likes Reddit and HN, while
       | someone else likes HN and SO.
       | 
       | Then their Jaccard Index would be 1/3, since they have one thing
       | in common out of three.
       | 
       | * Technically it computes "similarity" (larger number == more
       | similar), but `1 - Jaccard Index` is a distance (smaller number
       | == more similar).
        
       | startup_eng wrote:
       | I just used this at work the other day to calculate similarities
       | between different data models that had overlapping children
       | models. One of our teams was going to go through manually to
       | check these overlaps and consolidate, but by using this
       | clustering algo based on Jaccard distance we were able to give
       | them clusters to consolidate up front. Super cool stuff!
        
         | sonofaragorn wrote:
         | what is a children model? I'm curious but can't really follow
         | what you wrote, can you add a bit more context?
        
       | coeneedell wrote:
       | I recently used Jaccard similarity as a measurement of distance
       | between two sets of online articles. It's amazing how versatile
       | it is for all sorts of weird tasks.
        
         | paulgb wrote:
         | I uses to use Jaccard similarity combined with w-shingling at
         | the character level to detect clusters of fraud sites. It was
         | surprisingly effective, because it was able to pick up common
         | patterns in the code even if they used completely different
         | styles and text.
         | 
         | https://en.m.wikipedia.org/wiki/W-shingling
        
           | jethkl wrote:
           | Interesting - I also used Jaccard similarity to classify
           | clusters of malicious ad traffic schemes. The idea worked
           | well. It was unclear if the similarity was due to mimicry or
           | authorship, but that did not matter for our use.
        
       | unethical_ban wrote:
       | What is the predicted bounding and the ground truth bounding, as
       | related by a stop sign? I have no idea what's happening there.
        
         | montroser wrote:
         | The "predicted" box there would be a best guess from a
         | statistical model powered by AI or computer vision, answering,
         | "where is the stop sign in this image?". The "ground truth"
         | would be an annotation by a human answering the same question.
         | The jaccard similarity metric would say that these bounding
         | boxes are highly similar, and so the prediction could be
         | evaluated as high quality.
        
       | pncnmnp wrote:
       | I recently wrote a fun blog post
       | (https://pncnmnp.github.io/blogs/odd-sketches.html) about how to
       | estimate Jaccard Similarity using min hashing, what b-bit min
       | hashing is, and how to improve upon its limitations using a 2014
       | data structure called odd sketches.
       | 
       | Jaccard Similarity's history is also quite interesting. From my
       | blog:
       | 
       | > In the late 19th century, the United States and several
       | European nations were focused on developing strategies for
       | weather forecasting, particularly for storm warnings. In 1884,
       | Sergeant John Finley of the U.S. Army Signal Corps conducted
       | experiments aimed at creating a tornado forecasting program for
       | 18 regions in the United States east of the Rockies. To the
       | surprise of many, Finley claimed his programs were 95.6% to 98.6%
       | accurate, with some areas even achieving a 100% accuracy rate.
       | Upon publishing his findings, Finley's methods were criticized by
       | contemporaries who pointed out flaws in his verification
       | strategies and proposed their solutions. This sparked a renewed
       | interest in weather prediction, which is now referred to as the
       | "Finley Affair."
       | 
       | > One of these contemporaries was Grove Karl Gilbert. Just two
       | months after Finley's publication, Gilbert pointed out that,
       | based on Finley's strategy, a 98.2% accuracy rate could be
       | achieved simply by forecasting no tornado warning. Gilbert then
       | introduced an alternative strategy, which is now known as Jaccard
       | Similarity.
       | 
       | > So why is it named Jaccard Similarity? As it turns out, nearly
       | three decades after Sergeant John Finley's tornado forecasting
       | program in the 1880s, Paul Jaccard independently developed the
       | same concept while studying the distribution of alpine flora.
        
       | stygiansonic wrote:
       | The name may be an example of this:
       | https://en.m.wikipedia.org/wiki/Stigler%27s_law_of_eponymy
       | 
       |  _It was developed by Grove Karl Gilbert in 1884 as his ratio of
       | verification (v)[1] and now is frequently referred to as the
       | Critical Success Index in meteorology.[2] It was later developed
       | independently by Paul Jaccard..._
        
         | jszymborski wrote:
         | Another example of this sort of thing (that is vaguely related
         | in that it's commonly used as a metric) is (what I call) the
         | Matthew's Correlation Coefficient
         | https://en.wikipedia.org/wiki/Phi_coefficient
         | 
         | > In machine learning, it is known as the Matthews correlation
         | coefficient (MCC) ... introduced by biochemist Brian W.
         | Matthews in 1975.[1] Introduced by Karl Pearson,[2] and also
         | known as the Yule phi coefficient from its introduction by Udny
         | Yule in 1912
        
         | dalke wrote:
         | In my field, cheminformatics, we refer to it as "Tanimoto
         | similarity" because it was (quoting Wikipedia) "independently
         | formulated again by T. Tanimoto."
         | 
         | It's an odd set of linkages to get there. First, "Dr. David J.
         | Rogers of the New York Botanical Gardens" proposed a problem to
         | Tanimoto, who published the writeup in an internal IBM report
         | in 1958. (I understand there was a lot of mathematical research
         | in taxonomy at the time.) In 1960 Rogers and Tanimoto published
         | an updated version in Science.
         | 
         | In 1973 Adamson and Bush at Sheffield University developed a
         | method for the automatic classification of chemical structures.
         | They tried Dice, phi, and Sneath as their comparison methods
         | but not Tanimoto. In their updated 1975 publication write
         | "Several coefficients have been proposed based on this
         | criterion", with a list of citations, including the Rogers and
         | Tanimoto paper as citation 14.
         | 
         | In 1986, Peter Willett at Sheffield revisits this work and
         | finds that Tanimoto gives overall better results when applied
         | to what are now called cheminformatics "fingerprints". He uses
         | "Tanimoto", with no direct citation for the source of that
         | definition.
         | 
         | This similarity method is easy to implement, and many
         | organizations already have pre-computed fingerprints (they are
         | used as pre-filters for graph queries), so the concept and
         | nomenclature takes off almost immediately, with "Tanimoto" as
         | the preferred named.
         | 
         | It's not until 1991 that can find a paper in my field referring
         | to the earlier work by Jaccard (the paper uses "Tanimoto
         | (Jaccard)").
         | 
         | I have found some papers in related fields (eg, in IR and mass
         | spectra analysis) which reference Tanimoto similarity, but
         | nothing to the extent that my field uses it.
        
       | ketralnis wrote:
       | Quoting myself from a while ago[0]
       | 
       | At reddit many moons ago before machine learning was a buzzword
       | one early iteration of recommendations was based on Jaccard
       | distance using the number of co-voters between subreddits. But
       | with one twist: divide by the size of the smaller subreddit.
       | relatedness a b =             numerator   = | voters on(a) [?]
       | voters on(b) |             denominator = | voters on(a) [?]
       | voters on(b) |             weight      = min(|voters on(a)|,
       | |voters on(b)|)             numerator / (weight*denominator)
       | 
       | That gives you a directional relatedness, that is
       | programming->python but not necessarily python->programming. Used
       | this way you account for the giant subreddit problem[1]
       | automatically but now the results are less "amitheasshole is
       | related to askreddit" and more like "linguisticshumor is a more
       | niche version of linguistics".
       | 
       | The great thing is that it's actually more actionable as far as
       | recommendations go! Everybody has already heard of the bigger
       | version of this subreddit, but they probably haven't heard of the
       | smaller versions. And it's self-correcting: as a subreddit gets
       | bigger we are less likely to recommend it, which is great because
       | it needs our help less.
       | 
       | It's also easy to compute this because it lends itself to one
       | giant SQL query that postgres or even sqlite[2] optimises
       | reasonable well. It has some discontinuities around very tiny
       | subredddits, so there was also a hack to just exclude them with a
       | hack heuristic. It does get fairly static so once we've picked 3
       | subreddits to recommend if you're on subreddit A, if you don't
       | like them we'll just keep showing them anyway. I had a hack in
       | mind for that (use the computed values as random weights so we'll
       | still occasionally show lower-scoring ones) but by this time
       | people much smarter than I took over recommendations with more
       | holistic solutions to the problem we were trying to solve in the
       | first place. Still, as a first pass it worked great and based on
       | my experience I'd recommend simple approaches like this before
       | you break out the the linear algebra.
       | 
       | Side note, I tried co-commenters in addition to co-voters. The
       | results tended more accurate in my spot tests but the difference
       | fell away in more proper cross-validation testing and I didn't
       | look into where the qualitative difference was. But since there
       | are more votes than comments on small subreddits the number of
       | recommendable subreddits was higher with votes. I reasoned that
       | co-submitters (of posts) should be even more accurate but it was
       | thrown off by a small number of spammers and I didn't want to
       | mess with combining those tasks at the time.
       | 
       | [0]: https://news.ycombinator.com/item?id=22178517
       | 
       | [1]: that votes are distributed according to a power law, meaning
       | that everybody has voted on the largest subreddits so most
       | clustering approaches recommend askreddit to everybody. That's
       | okay for product recommendations where "you should buy the most
       | popular CPU, it's most popular for a reason" but for subreddits
       | you already know that so we want a way to bias to the most
       | "surprising" of your votes.
       | 
       | [2]: I prototyped it on sqlite on my laptop and even with close
       | to the production amount of data it ran reasonable well. Not
       | fast, but fine. This was on considerably less traffic to today,
       | mind.
        
       | seydor wrote:
       | didnt know there was a name for it
        
       ___________________________________________________________________
       (page generated 2023-03-19 23:00 UTC)