https://en.wikipedia.org/wiki/Jaccard_index Jump to content [ ] Toggle sidebar [wikipe] Wikipedia The Free Encyclopedia Search [ ] [Search] [Go] * Create account * Log in [ ] Personal tools * Create account * Log in Pages for logged out editors learn more * Contributions * Talk Navigation * Main page * Contents * Current events * Random article * About Wikipedia * Contact us * Donate Contribute * Help * Learn to edit * Community portal * Recent changes * Upload file Tools * What links here * Related changes * Upload file * Special pages * Permanent link * Page information * Cite this page * Wikidata item Print/export * Download as PDF * Printable version Languages On this Wikipedia the language links are at the top of the page across from the article title. Go to top. [ ] Contents move to sidebar hide * (Top) * 1Similarity of asymmetric binary attributes Toggle Similarity of asymmetric binary attributes subsection + 1.1Difference with the simple matching coefficient (SMC) * 2Weighted Jaccard similarity and distance * 3Probability Jaccard similarity and distance Toggle Probability Jaccard similarity and distance subsection + 3.1Optimality of the Probability Jaccard Index * 4Tanimoto similarity and distance Toggle Tanimoto similarity and distance subsection + 4.1Tanimoto's definitions of similarity and distance + 4.2Other definitions of Tanimoto distance * 5Jaccard index in binary classification confusion matrices * 6See also * 7References * 8Further reading * 9External links Toggle the table of contents [ ] Toggle the table of contents Jaccard index [ ] 14 languages * Az@rbaycanca * Catala * Deutsch * Espanol * frsy * Francais * hangugeo * Bahasa Indonesia * Italiano * Polski * Russkii * Ukrayins'ka * Yue Yu * Zhong Wen Edit links * Article * Talk [ ] English * Read * Edit * View history [ ] More * Read * Edit * View history From Wikipedia, the free encyclopedia Measure of similarity and diversity between sets This article includes a list of general references, but it lacks sufficient corresponding inline citations. Please help [40px] to improve this article by introducing more precise citations. (March 2011) (Learn how and when to remove this template message) [200px-Intersection_of_sets_] [200px-Union_of_sets_A_and_B] Intersection and union of two sets A and B [300px-Intersection_over_Union_-_object_de] [300px-Intersection_over_Union_-_visual_eq] [300px-Intersection_over_Union_-_poor] Intersection over Union as a similarity measure for object detection on images - an important task in computer vision. The Jaccard index, also known as the Jaccard similarity coefficient, is a statistic used for gauging the similarity and diversity of sample sets. It was developed by Grove Karl Gilbert in 1884 as his ratio of verification (v)^[1] and now is frequently referred to as the Critical Success Index in meteorology.^[2] It was later developed independently by Paul Jaccard, originally giving the French name coefficient de communaute,^[3] and independently formulated again by T. Tanimoto.^[4] Thus, the Tanimoto index or Tanimoto coefficient are also used in some fields. However, they are identical in generally taking the ratio of Intersection over Union. The Jaccard coefficient measures similarity between finite sample sets, and is defined as the size of the intersection divided by the size of the union of the sample sets: J ( A , B ) = | A [?] B | | A [?] B | = | A [?] B | | A | + | B | - | A [?] B | . {\displaystyle J(A,B)={{|A\cap B|} \over {|A\cup B|}}={{| A\cap B|} \over {|A|+|B|-|A\cap B|}}.} J(A,B) = {{|A \cap B|}\ over{|A \cup B|}} = {{|A \cap B|}\over{|A| + |B| - |A \cap B|}}. Note that by design, 0 <= J ( A , B ) <= 1. {\displaystyle 0\leq J(A,B) \leq 1.} 0\le J(A,B)\le 1. If A intersection B is empty, then J(A,B) = 0. The Jaccard coefficient is widely used in computer science, ecology, genomics, and other sciences, where binary or binarized data are used. Both the exact solution and approximation methods are available for hypothesis testing with the Jaccard coefficient.^[5] Jaccard similarity also applies to bags, i.e., Multisets. This has a similar formula,^[6] but the symbols mean bag intersection and bag sum (not union). The maximum value is 1/2. J ( A , B ) = | A [?] B | | A [?] B | = | A [?] B | | A | + | B | . {\ displaystyle J(A,B)={{|A\cap B|} \over {|A\uplus B|}}={{|A\cap B |} \over {|A|+|B|}}.} {\displaystyle J(A,B)={{|A\cap B|} \over {| A\uplus B|}}={{|A\cap B|} \over {|A|+|B|}}.} The Jaccard distance, which measures dissimilarity between sample sets, is complementary to the Jaccard coefficient and is obtained by subtracting the Jaccard coefficient from 1, or, equivalently, by dividing the difference of the sizes of the union and the intersection of two sets by the size of the union: d J ( A , B ) = 1 - J ( A , B ) = | A [?] B | - | A [?] B | | A [?] B | . {\displaystyle d_{J}(A,B)=1-J(A,B)={{|A\cup B|-|A\cap B|} \over |A\cup B|}.} d_J(A,B) = 1 - J(A,B) = { { |A \cup B| - |A \cap B| } \over |A \cup B| }. An alternative interpretation of the Jaccard distance is as the ratio of the size of the symmetric difference A ^ B = ( A [?] B ) - ( A [?] B ) {\displaystyle A\triangle B=(A\cup B)-(A\cap B)} A \triangle B = (A \ cup B) - (A \cap B) to the union. Jaccard distance is commonly used to calculate an n x n matrix for clustering and multidimensional scaling of n sample sets. This distance is a metric on the collection of all finite sets.^[7]^ [8]^[9] There is also a version of the Jaccard distance for measures, including probability measures. If m {\displaystyle \mu } \mu is a measure on a measurable space X {\displaystyle X} X, then we define the Jaccard coefficient by J m ( A , B ) = m ( A [?] B ) m ( A [?] B ) , {\displaystyle J_{\mu } (A,B)={{\mu (A\cap B)} \over {\mu (A\cup B)}},} {\displaystyle J_ {\mu }(A,B)={{\mu (A\cap B)} \over {\mu (A\cup B)}},} and the Jaccard distance by d m ( A , B ) = 1 - J m ( A , B ) = m ( A ^ B ) m ( A [?] B ) . {\ displaystyle d_{\mu }(A,B)=1-J_{\mu }(A,B)={{\mu (A\triangle B)} \over {\mu (A\cup B)}}.} {\displaystyle d_{\mu }(A,B)=1-J_{\mu } (A,B)={{\mu (A\triangle B)} \over {\mu (A\cup B)}}.} Care must be taken if m ( A [?] B ) = 0 {\displaystyle \mu (A\cup B)=0} \mu(A \cup B) = 0 or [?] {\displaystyle \infty } \infty , since these formulas are not well defined in these cases. The MinHash min-wise independent permutations locality sensitive hashing scheme may be used to efficiently compute an accurate estimate of the Jaccard similarity coefficient of pairs of sets, where each set is represented by a constant-sized signature derived from the minimum values of a hash function. Similarity of asymmetric binary attributes[edit] Given two objects, A and B, each with n binary attributes, the Jaccard coefficient is a useful measure of the overlap that A and B share with their attributes. Each attribute of A and B can either be 0 or 1. The total number of each combination of attributes for both A and B are specified as follows: M 11 {\displaystyle M_{11}} M_{11} represents the total number of attributes where A and B both have a value of 1. M 01 {\displaystyle M_{01}} M_{01} represents the total number of attributes where the attribute of A is 0 and the attribute of B is 1. M 10 {\displaystyle M_{10}} M_{10} represents the total number of attributes where the attribute of A is 1 and the attribute of B is 0. M 00 {\displaystyle M_{00}} M_{00} represents the total number of attributes where A and B both have a value of 0. A 0 1 B 0 M 00 {\displaystyle M_{00}} M_ M 10 {\displaystyle M_{10}} M_ {00} {10} 1 M 01 {\displaystyle M_{01}} M_ M 11 {\displaystyle M_{11}} M_ {01} {11} Each attribute must fall into one of these four categories, meaning that M 11 + M 01 + M 10 + M 00 = n . {\displaystyle M_{11}+M_{01}+M_ {10}+M_{00}=n.} M_{11} + M_{01} + M_{10} + M_{00} = n. The Jaccard similarity coefficient, J, is given as J = M 11 M 01 + M 10 + M 11 . {\displaystyle J={M_{11} \over M_ {01}+M_{10}+M_{11}}.} J = {M_{11} \over M_{01} + M_{10} + M_ {11}}. The Jaccard distance, d[J], is given as d J = M 01 + M 10 M 01 + M 10 + M 11 = 1 - J . {\displaystyle d_ {J}={M_{01}+M_{10} \over M_{01}+M_{10}+M_{11}}=1-J.} d_J = {M_ {01} + M_{10} \over M_{01} + M_{10} + M_{11}} = 1 - J. Statistical inference can be made based on the Jaccard similarity coefficients, and consequently related metrics.^[5] Given two sample sets A and B with n attributes, a statistical test can be conducted to see if an overlap is statistically significant. The exact solution is available, although computation can be costly as n increases.^[5] Estimation methods are available either by approximating a multinomial distribution or by bootstrapping.^[5] Difference with the simple matching coefficient (SMC)[edit] When used for binary attributes, the Jaccard index is very similar to the simple matching coefficient. The main difference is that the SMC has the term M 00 {\displaystyle M_{00}} M_{00} in its numerator and denominator, whereas the Jaccard index does not. Thus, the SMC counts both mutual presences (when an attribute is present in both sets) and mutual absence (when an attribute is absent in both sets) as matches and compares it to the total number of attributes in the universe, whereas the Jaccard index only counts mutual presence as matches and compares it to the number of attributes that have been chosen by at least one of the two sets. In market basket analysis, for example, the basket of two consumers who we wish to compare might only contain a small fraction of all the available products in the store, so the SMC will usually return very high values of similarities even when the baskets bear very little resemblance, thus making the Jaccard index a more appropriate measure of similarity in that context. For example, consider a supermarket with 1000 products and two customers. The basket of the first customer contains salt and pepper and the basket of the second contains salt and sugar. In this scenario, the similarity between the two baskets as measured by the Jaccard index would be 1/3, but the similarity becomes 0.998 using the SMC. In other contexts, where 0 and 1 carry equivalent information (symmetry), the SMC is a better measure of similarity. For example, vectors of demographic variables stored in dummy variables, such as gender, would be better compared with the SMC than with the Jaccard index since the impact of gender on similarity should be equal, independently of whether male is defined as a 0 and female as a 1 or the other way around. However, when we have symmetric dummy variables, one could replicate the behaviour of the SMC by splitting the dummies into two binary attributes (in this case, male and female), thus transforming them into asymmetric attributes, allowing the use of the Jaccard index without introducing any bias. The SMC remains, however, more computationally efficient in the case of symmetric dummy variables since it does not require adding extra dimensions. Weighted Jaccard similarity and distance[edit] If x = ( x 1 , x 2 , ... , x n ) {\displaystyle \mathbf {x} =(x_{1},x_ {2},\ldots ,x_{n})} \mathbf{x} = (x_1, x_2, \ldots, x_n) and y = ( y 1 , y 2 , ... , y n ) {\displaystyle \mathbf {y} =(y_{1},y_{2},\ldots ,y_{n})} \mathbf{y} = (y_1, y_2, \ldots, y_n) are two vectors with all real x i , y i >= 0 {\displaystyle x_{i},y_{i}\geq 0} x_i, y_i \ geq 0, then their Jaccard similarity coefficient (also known then as Ruzicka similarity) is defined as J W ( x , y ) = [?] i min ( x i , y i ) [?] i max ( x i , y i ) , {\ displaystyle J_{\mathcal {W}}(\mathbf {x} ,\mathbf {y} )={\frac {\sum _{i}\min(x_{i},y_{i})}{\sum _{i}\max(x_{i},y_{i})}},} {\ displaystyle J_{\mathcal {W}}(\mathbf {x} ,\mathbf {y} )={\frac {\sum _{i}\min(x_{i},y_{i})}{\sum _{i}\max(x_{i},y_{i})}},} and Jaccard distance (also known then as Soergel distance) d J W ( x , y ) = 1 - J W ( x , y ) . {\displaystyle d_{J{\ mathcal {W}}}(\mathbf {x} ,\mathbf {y} )=1-J_{\mathcal {W}}(\ mathbf {x} ,\mathbf {y} ).} {\displaystyle d_{J{\mathcal {W}}}(\ mathbf {x} ,\mathbf {y} )=1-J_{\mathcal {W}}(\mathbf {x} ,\mathbf {y} ).} With even more generality, if f {\displaystyle f} f and g {\ displaystyle g} g are two non-negative measurable functions on a measurable space X {\displaystyle X} X with measure m {\displaystyle \mu } \mu , then we can define J W ( f , g ) = [?] min ( f , g ) d m [?] max ( f , g ) d m , {\ displaystyle J_{\mathcal {W}}(f,g)={\frac {\int \min(f,g)d\mu }{\ int \max(f,g)d\mu }},} {\displaystyle J_{\mathcal {W}}(f,g)={\ frac {\int \min(f,g)d\mu }{\int \max(f,g)d\mu }},} where max {\displaystyle \max } \max and min {\displaystyle \min } \ min are pointwise operators. Then Jaccard distance is d J W ( f , g ) = 1 - J W ( f , g ) . {\displaystyle d_{J{\ mathcal {W}}}(f,g)=1-J_{\mathcal {W}}(f,g).} {\displaystyle d_{J {\mathcal {W}}}(f,g)=1-J_{\mathcal {W}}(f,g).} Then, for example, for two measurable sets A , B [?] X {\displaystyle A,B\subseteq X} A, B \subseteq X, we have J m ( A , B ) = J ( kh A , kh B ) , {\displaystyle J_{\mu }(A,B)=J(\chi _{A},\chi _{B}),} J_\mu (A,B) = J(\chi_A, \chi_B), where kh A {\displaystyle \chi _{A}} \chi _ {A} and kh B {\displaystyle \chi _{B}} \chi_B are the characteristic functions of the corresponding set. Probability Jaccard similarity and distance[edit] The weighted Jaccard similarity described above generalizes the Jaccard Index to positive vectors, where a set corresponds to a binary vector given by the indicator function, i.e. x i [?] { 0 , 1 } {\displaystyle x_{i}\in \{0,1\}} x_{i}\in \{0,1\}. However, it does not generalize the Jaccard Index to probability distributions, where a set corresponds to a uniform probability distribution, i.e. x i = { 1 | X | i [?] X 0 otherwise {\displaystyle x_{i}={\begin {cases}{\frac {1}{|X|}}&i\in X\\0&{\text{otherwise}}\end{cases}}} {\displaystyle x_{i}={\begin{cases}{\frac {1}{|X|}}&i\in X\\0&{\ text{otherwise}}\end{cases}}} It is always less if the sets differ in size. If | X | > | Y | {\ displaystyle |X|>|Y|} {\displaystyle |X|>|Y|}, and x i = 1 X ( i ) / | X | , y i = 1 Y ( i ) / | Y | {\displaystyle x_{i}=\mathbf {1} _{X} (i)/|X|,y_{i}=\mathbf {1} _{Y}(i)/|Y|} {\displaystyle x_{i}=\mathbf {1} _{X}(i)/|X|,y_{i}=\mathbf {1} _{Y}(i)/|Y|} then J W ( x , y ) = | X [?] Y | | X \ Y | + | X | < J ( X , Y ) . {\ displaystyle J_{\mathcal {W}}(x,y)={\frac {|X\cap Y|}{|X\setminus Y|+|X|}} J P ( x , y ) {\displaystyle \Pr[G(x)=G(y)]>J_{\mathcal {P}} (x,y)} {\displaystyle \Pr[G(x)=G(y)]>J_{\mathcal {P}}(x,y)} then for some z {\displaystyle z} z where J P ( x , z ) > J P ( x , y ) {\ displaystyle J_{\mathcal {P}}(x,z)>J_{\mathcal {P}}(x,y)} {\ displaystyle J_{\mathcal {P}}(x,z)>J_{\mathcal {P}}(x,y)} and J P ( y , z ) > J P ( x , y ) {\displaystyle J_{\mathcal {P}}(y,z)>J_{\ mathcal {P}}(x,y)} {\displaystyle J_{\mathcal {P}}(y,z)>J_{\mathcal {P}}(x,y)}, either Pr [ G ( x ) = G ( z ) ] < J P ( x , z ) {\ displaystyle \Pr[G(x)=G(z)]2.0.CO;2. ISSN 1520-0434. S2CID 54532560. 2. ^ https://www.swpc.noaa.gov/sites/default/files/images/u30/ Forecast%20Verification%20Glossary.pdf^[bare URL PDF] 3. ^ Jaccard, Paul (February 1912). "The Distribution of the Flora in the Alpine Zone.1". New Phytologist. 11 (2): 37-50. doi: 10.1111/j.1469-8137.1912.tb05611.x. ISSN 0028-646X. 4. ^ ^a ^b Tanimoto TT (17 Nov 1958). "An Elementary Mathematical theory of Classification and Prediction". Internal IBM Technical Report. 1957 (8?). 5. ^ ^a ^b ^c ^d Chung NC, Miasojedow B, Startek M, Gambin A (December 2019). "Jaccard/Tanimoto similarity test and estimation methods for biological presence-absence data". BMC Bioinformatics . 20 (Suppl 15): 644. arXiv:1903.11372. doi:10.1186/ s12859-019-3118-5. PMC 6929325. PMID 31874610. 6. ^ Leskovec J, Rajaraman A, Ullman J (2020). Mining of Massive Datasets. Cambridge. ISBN 9781108476348. and p. 76-77 in an earlier version http://infolab.stanford.edu/~ullman/mmds/ch3.pdf 7. ^ Kosub S (April 2019). "A note on the triangle inequality for the Jaccard distance". Pattern Recognition Letters. 120: 36-8. arXiv:1612.02696. Bibcode:2019PaReL.120...36K. doi:10.1016/ j.patrec.2018.12.007. S2CID 564831. 8. ^ ^a ^b Lipkus AH (1999). "A proof of the triangle inequality for the Tanimoto distance". Journal of Mathematical Chemistry. 26 (1-3): 263-265. doi:10.1023/A:1019154432472. S2CID 118263043. 9. ^ Levandowsky M, Winter D (1971). "Distance between sets". Nature . 234 (5): 34-35. Bibcode:1971Natur.234...34L. doi:10.1038/ 234034a0. S2CID 4283015. 10. ^ ^a ^b Moulton R, Jiang Y (2018). "Maximally Consistent Sampling and the Jaccard Index of Probability Distributions". International Conference on Data Mining, Workshop on High Dimensional Data Mining: 347-356. arXiv:1809.04052. doi:10.1109/ ICDM.2018.00050. ISBN 978-1-5386-9159-5. S2CID 49746072. 11. ^ For example Huihuan Q, Xinyu W, Yangsheng X (2011). Intelligent Surveillance Systems. Springer. p. 161. ISBN 978-94-007-1137-2. 12. ^ Rogers DJ, Tanimoto TT (October 1960). "A Computer Program for Classifying Plants". Science. 132 (3434): 1115-8. Bibcode: 1960Sci...132.1115R. doi:10.1126/science.132.3434.1115. PMID 17790723. 13. ^ Aziz Taha, Abdel (2015). "Metrics for evaluating 3D medical image segmentation: analysis, selection, and tool". BMC Medical Imaging. 15 (29): 1-28. doi:10.1186/s12880-015-0068-x. PMC 4533825. PMID 26263899. Further reading[edit] * Tan PN, Steinbach M, Kumar V (2005). Introduction to Data Mining. ISBN 0-321-32136-7. * Jaccard P (1901). "Etude comparative de la distribution florale dans une portion des Alpes et des Jura". Bulletin de la Societe vaudoise des sciences naturelles. 37: 547-579. * Jaccard P (1912). "The Distribution of the flora in the alpine zone". New Phytologist. 11 (2): 37-50. doi:10.1111/ j.1469-8137.1912.tb05611.x. External links[edit] * Introduction to Data Mining lecture notes from Tan, Steinbach, Kumar * SimMetrics a sourceforge implementation of Jaccard index and many other similarity metrics * A web-based calculator for finding the Jaccard Coefficient * Tutorial on how to calculate different similarities * Intersection over Union (IoU) for object detection * Kaggle Dstl Satellite Imagery Feature Detection - Evaluation * Similarity and dissimilarity measures used in data science * v * t * e Machine learning evaluation metrics Regression MSE * MAE * sMAPE * MAPE * MASE * MSPE * RMS * RMSE/ RMSD * R2 * MDA * MAD F-score * P4 * Accuracy * Precision * Recall * Kappa * Classification MCC * AUC * ROC * Sensitivity and specificity * Logarithmic Loss Silhouette * Calinski-Harabasz * Davies-Bouldin * Dunn Clustering index * Hopkins statistic * Jaccard index * Rand index * Similarity measure * SMC * SimHash Ranking MRR * DCG * NDCG * AP Computer PSNR * SSIM * IoU Vision NLP Perplexity * BLEU Deep Learning Related Inception score * FID Metrics Recommender Coverage * Intra-list Similarity system Similarity Cosine similarity * Euclidean distance * Pearson correlation coefficient Confusion matrix * Retrieved from "https://en.wikipedia.org/w/index.php?title= Jaccard_index&oldid=1144140835" Categories: * Index numbers * Measure theory * Clustering criteria * String metrics * Similarity measures Hidden categories: * All articles with bare URLs for citations * Articles with bare URLs for citations from May 2022 * Articles with PDF format bare URLs for citations * Articles with short description * Short description matches Wikidata * Articles lacking in-text citations from March 2011 * All articles lacking in-text citations * All articles with unsourced statements * Articles with unsourced statements from March 2022 * Pages that use a deprecated format of the math tags * This page was last edited on 12 March 2023, at 01:38 (UTC). * Text is available under the Creative Commons Attribution-ShareAlike License 3.0 ; additional terms may apply. By using this site, you agree to the Terms of Use and Privacy Policy. Wikipedia(r) is a registered trademark of the Wikimedia Foundation, Inc., a non-profit organization. * Privacy policy * About Wikipedia * Disclaimers * Contact Wikipedia * Mobile view * Developers * Statistics * Cookie statement * Wikimedia Foundation * Powered by MediaWiki