[HN Gopher] A New Coefficient of Correlation
       ___________________________________________________________________
        
       A New Coefficient of Correlation
        
       Author : malshe
       Score  : 649 points
       Date   : 2021-12-25 22:30 UTC (2 days ago)
        
 (HTM) web link (arxiv.org)
 (TXT) w3m dump (arxiv.org)
        
       | pmayrgundter wrote:
       | I didn't get the idea of ranking. But it's simple:
       | 
       | "In statistics, ranking is the data transformation in which
       | numerical or ordinal values are replaced by their rank when the
       | data are sorted. For example, the numerical data 3.4, 5.1, 2.6,
       | 7.3 are observed, the ranks of these data items would be 2, 3, 1
       | and 4 respectively. For example, the ordinal data hot, cold, warm
       | would be replaced by 3, 1, 2."
       | 
       | https://en.m.wikipedia.org/wiki/Ranking
       | 
       | Also learned that a Spearman coeff is just the Pearson
       | coefficient taken on the _rank_ of the data, instead of on the
       | raw data.
       | 
       | But whereas Pearson/Spearman takes the sum of product of
       | data/mean differences (S(x-x)(y-y)/sxsy) where x is mean and
       | s=std. dev., Chatterjee takes sum of rank differences
       | (3S(r+1-r)/n2-1), concerning just the ranks of the Y data after
       | the X,Y pairs have been sorted by X.
       | 
       | But still missing the intuition for why the sum of difference of
       | ranks is so useful or where the magic numbers come from.
        
         | cornel_io wrote:
         | jmount has an explanation elsewhere in this thread linking to
         | https://win-vector.com/2021/12/26/how-to-read-sourav-chatter...
         | which does a great job of explaining the intuition, but in a
         | nutshell the normalization factor of 3 comes from the fact that
         | if you select two random points between 0 and 1, the mean
         | distance between will be 1/3 (which is pretty easy to write
         | down and solve, boils down the the fact that a pyramid of
         | height 1 that's 1x1 at the base has volume = 1/3).
        
           | pmayrgundter wrote:
           | Thanks! That did it
        
       | lngnmn2 wrote:
        
         | Jweb_Guru wrote:
         | I mean, if you choose a bullshit ordering relationship, sure.
         | But that's (essentially) the only assumption.
         | 
         | Ah, I missed the reference to "the humanities" the first
         | time... seems like you have an ideological axe to grind.
        
         | nl wrote:
         | What are you talking about?
         | 
         | Correlation is absolutely useful for analysis of non-physical
         | based systems.
        
           | lngnmn2 wrote:
           | "analysis"
        
       | asdf_snar wrote:
       | I once attended a summer school in Saint-Flour, France, where
       | Sourav Chatterjee gave a masterful exposition of results on large
       | deviations for random graphs. All chalk talk, clear presentation.
       | My impression is he's a deep thinker. On that basis alone I give
       | such an article much more weight than the plethora of articles
       | that pass before the eyes; reading the masters has always been a
       | good rule of thumb.
        
         | rootveg wrote:
         | what kind of summer school was that?
        
           | spekcular wrote:
           | This one: https://recherche.math.univ-bpclermont.fr/stflour/
        
         | [deleted]
        
       | [deleted]
        
       | derbOac wrote:
       | Also worth looking at the cited and related paper with the same
       | author:
       | 
       | https://arxiv.org/abs/1910.12327
       | 
       | This follow-up paper presents a related measure of conditional
       | dependence that has a "natural interpretation as a nonlinear
       | generalization of the familiar partial R2 statistic for measuring
       | conditional dependence by regression."
       | 
       | The follow-up paper also provides some additional interpretive
       | insights, I think.
       | 
       | My intuitive impression is that both of these are important
       | developments.
       | 
       | I also have a very vague suspicion, based on the form of the
       | function, that the correlation measure has some interpretation in
       | terms of mutual information involving rank transformations of
       | random variables.
        
         | malshe wrote:
         | Thanks for finding this article. I agree, these are important
         | developments in particular because so many econometric models
         | are now using machine learning without any distributional
         | assumptions. Using correlation coefficients based on linearity
         | is grossly insufficient.
        
       | ur-whale wrote:
       | Is there code?
        
         | loxias wrote:
         | The equation is on the second page, and if you know enough to
         | know what correlation _is_ , you know enough to implement from
         | the equation given. Takes N*Log(N) to run though, if
         | implemented naively. (because you have to sort your data)
        
         | malshe wrote:
         | Yes, the author has shared the link to R package here:
         | 
         | https://cran.r-project.org/web/packages/XICOR/index.html
         | 
         | Edit: R code from Dr. Chatterjee's Stanford page is here -
         | https://souravchatterjee.su.domains//xi.R
         | 
         | If you have never worked with R, the code seems clunky so I
         | suggest checking out Python implementation on Github here:
         | 
         | https://github.com/czbiohub/xicor
         | 
         | The Python library is not from the original author though. But
         | it's easy to read the code and it works with pandas as well.
        
           | ur-whale wrote:
           | Thanks, the Python code is very clear and simple and makes it
           | super easy to understand the idea without having to digest
           | the paper.
        
           | zmachinaz wrote:
           | The current version of the python lib seems to be extremely
           | badly written code. Or is the algo so bad ? Takes something
           | like 21s to compute the correlation for just 10k samples.
        
             | flyingmutant wrote:
             | This issue contains simple code that is claimed to be >300x
             | faster: https://github.com/czbiohub/xicor/issues/17
        
               | malshe wrote:
               | Thanks for locating the solution. I didn't check the
               | Python code myself so I wasn't sure what was going on
               | with the slow processing
        
       | mellavora wrote:
       | Best line of the paper:
       | 
       | "The formula is so simple that it is likely that there are many
       | such coefficients, some of them possibly having better properties
       | than the one presented below."
       | 
       | suggesting that the author is seeking to highlight a basic
       | principle, not tune for a specific application.
        
       | malshe wrote:
       | This article is now published in the Journal of the American
       | Statistical Association [Volume 116, 2021 - Issue 536]
       | 
       | https://doi.org/10.1080/01621459.2020.1758115
        
         | mjcohen wrote:
         | That costs $47 (US). The archiv article is free.
        
           | ateng wrote:
           | I think pointing out that it is published on a known journal
           | is to display the fact that the paper is now peer-reviewed
        
             | ekianjo wrote:
             | Peer reviews dont mean much anymore these days.
        
               | malshe wrote:
               | Perhaps to you. In academia we take them seriously.
        
               | Nevermark wrote:
               | Do you have a measure of low correlation to support that?
               | 
               | Regardless, I expect reviews are pretty straight forward
               | for a concept that stands on its own like this one.
        
               | derbOac wrote:
               | https://link.springer.com/article/10.1007/s11192-017-2516
               | -6
        
               | sdoering wrote:
               | In answer to sister comment by derbOac:
               | 
               | Thanks for sharing this. On the other hand the cited
               | paper clearly relates to interdisciplinary peer review.
               | 
               | I am not sure how or if this can be transferred to
               | mathematics.
        
               | nl wrote:
               | In Math they really do.
        
             | malshe wrote:
             | Yes, that's right. I wanted to highlight that this is a
             | peer-reviewed article published in one of the top
             | statistical journals.
        
       | appleflaxen wrote:
       | This is incredibly cool. Thanks HN for bringing things like this
       | to the front page, which I would have never otherwise heard
       | about.
        
       | hdm41bc wrote:
       | One of the aspects of math papers that I dislike is how
       | unapproachable they are if you're unfamiliar with some of the
       | terminology and conventions. The esoteric symbols don't make it
       | any easier to Google their definitions either.
       | 
       | For instance, what does this mean? > m is the law of Y m is
       | usually the mean or average. Is "law" something else?
        
         | ebb_earl_co wrote:
         | The law of Y means the distribution of Y. See the last phrase
         | from this [0] StackOverflow answer.
         | 
         | [0] https://math.stackexchange.com/a/1397467
        
           | hdm41bc wrote:
           | Great thanks! I'll remember to use math.stackexchange.com as
           | a resource in the future.
        
         | 331c8c71 wrote:
         | Do you think a random production code snippet is approachable
         | for someone lacking the necessary background?
         | 
         | The notation in this paper is totally standard.
         | 
         | EDIT: \mu there refers to a probability measure. Nothing to do
         | with averages.
        
           | hdm41bc wrote:
           | Yeah, working code snippets would be great. That would
           | provide an unambiguous implementation that someone could use
           | to dig into the underlying functions used and learn the basic
           | concepts that would be tedious for the author to go through.
           | 
           | In terms of the notation, it seemed like the author actually
           | tried to keep his paper accessible, so my complaint isn't
           | with the author. My gripe is more with math notation in
           | general.
           | 
           | In my opinion, unless you've read the appropriate textbooks
           | or taken the right classes, math notation is hard to learn.
           | The symbols are hard to Google for. Integral symbols, R for
           | real numbers, sigma, delta, the round E that stands for IN
           | are not found on a standard keyboard so it's challenging for
           | a layman to Google and learn that notation on their own. Math
           | evolved over millennia and the notation wasn't constructed
           | with SEO in mind, so I understand why things are the way they
           | are, but it's a stumbling block for the uninitiated trying to
           | learn more advanced math. Maybe there are resources like
           | math.stackexchange out there that I'm unaware of that would
           | help make learning notation more approachable.
        
       | jmount wrote:
       | I share some notes on reading/explaining the basic definition
       | here: https://win-vector.com/2021/12/26/how-to-read-sourav-
       | chatter...
        
         | [deleted]
        
       | zaptheimpaler wrote:
       | How is it possible to make a general coefficient of correlation
       | that works for any non-linear relationship? Say if y=sha256(x),
       | doesn't that mean y is a predictable function of x, but its
       | statistically impossible to tell from looking at inputs/outputs
       | alone?
        
         | [deleted]
        
         | [deleted]
        
         | cirpis wrote:
         | Two things: 1. The result is asymptotical, i.e. holds as number
         | of samples approach infinity.
         | 
         | 2. The result is an "almost surely" result, i.e. in the
         | collection of all possible infinite samples, the set of samples
         | for which it fails has 0 measure. In non technical terms this
         | means that it works for typical random samples and may not work
         | for handpicked counterexamples.
         | 
         | In our particular case let f=Sha256. Then X must be discrete,
         | i.e. a natural number. Now the particulars depend on the
         | distribution on X, but the general idea is that since we have
         | discrete values, the probability that we get an infinite sample
         | where the values tend to infinity is 0. So we get that in a
         | typical sample theres going to be an infinitude of x ties and
         | furthermore most x values arent too large (in a way you can
         | make precise), so the tie factors l_i dominate since there just
         | arent that many distinct values encountered total. And so we
         | get that the coefficient tends to 1.
        
         | nnm wrote:
         | This works for hash function as hash function does not
         | introduce uncertainty, i.e. Y = f(X), not Y = f(x) +
         | random_noise. I just tested it:
         | 
         | https://ibb.co/nCXVSqB
        
         | idealmedtech wrote:
         | Perhaps they mean continuous or differentiable functions
         | (evident by the desire to put an order on the elements)
         | 
         | edit: was wrong about sha256
        
           | dandanua wrote:
           | That's right, the paper has practical estimates of
           | convergence only for continuous functions (kind of). For hash
           | functions this coefficient is useless.
        
           | bawolff wrote:
           | sha256() doesn't involve state or decision making.
        
         | meiji163 wrote:
         | In Theorem 1.1, f is a function of random variables, which
         | might be where you're confused.
         | 
         | > doesn't that mean y is a predictable function of x
         | 
         | Sort of: as function of real numbers, sha256 is just some
         | deterministic function. But point is its output "looks like" a
         | uniform random variable for any reasonable input distribution
         | i.e. as a function of random variables the input and output
         | variables should have 0 correlation
        
         | Sniffnoy wrote:
         | No. If you have a good hash function, that means it's
         | _computationally_ infeasible to determine anything about x
         | based only on y. It 's not _statistically_ impossible at all;
         | "statistically" doesn't concern itself with computational
         | difficulties.
         | 
         | This is similar to how, e.g., we generally assume that AES is
         | unbreakable from a _computational_ point of view, but if you
         | want a _statistically_ unbreakable cipher, your only (IINM)
         | option is a one-time pad.
        
         | hinoki wrote:
         | The summary says that it works if it is a measurable function
         | [0], which is structure preserving. So sha256 would scramble
         | the input too much for it to be detected here.
         | 
         | [0] https://en.m.wikipedia.org/wiki/Measurable_function
        
           | [deleted]
        
           | kkylin wrote:
           | I've not yet read the article, just the abstract. But the
           | abstract is pretty precise, and being "measurable" is a very
           | weak constraint. For (computationally) complex functions like
           | a hash, my first guess is the "escape clause" is in the
           | number of samples needed for the statistic to converge to its
           | expected value.
        
           | nnm wrote:
           | Any function that can be implemented in a computer is a
           | Lebesgue measurable function.
        
             | bollu wrote:
             | I don't doubt it, but I don't immediately see the proof
             | either. What's the key idea?
        
               | contravariant wrote:
               | Most spaces that computers deal with are basically
               | discrete.
               | 
               | Technically this may not always be the case but it's very
               | hard to construct a convincing counter example.
        
         | Hermel wrote:
         | Simple: if in the whole sample, the same x always comes with
         | the same y, then y is a function of x.
         | 
         | Example:
         | 
         | x1 = 1, y1 = 2
         | 
         | x2 = 1.1, y2 = 2
         | 
         | Here, y is a function of x, because if we know that x = 1, the
         | we also know that y = 2. However, x is not a function of y, as
         | we don't know what value x has given that y = 2.
         | 
         | I hope that made it more clear. Here, "function" simply means
         | that every time we started with the same x, we also got the
         | same y.
        
           | canjobear wrote:
           | Seems that by this logic if you don't have any repeats in
           | your x values then you are bound to conclude that any set of
           | y values is a function of the x.
        
           | roenxi wrote:
           | Either Chatterjee has made a mistake, that is wrong or that
           | definition has some extremely precise meanings because:
           | > counterexample = tibble(x = 1:6, y =
           | c(1.1,-1.2,1.3,-1.4,1.5,-1.6))         > counterexample
           | # A tibble: 6 x 2               x     y           <int> <dbl>
           | 1     1   1.1         2     2  -1.2         3     3   1.3
           | 4     4  -1.4         5     5   1.5         6     6  -1.6
           | > XICOR::calculateXI(xvec = counterexample$x,
           | yvec=counterexample$y)         [1] -0.2857143         >
           | -0.2857143 == 1         [1] FALSE
           | 
           | & I assume you could get data like that in the wild sampling
           | a sin wave as a timeseries.
           | 
           | I think the abstract is a bit strong, it probably means
           | "converges to" for a large repeating sample.
        
             | cirpis wrote:
             | There is no mistake in the definition and this is all
             | elaborated upon in page 4 of the article.
             | 
             | Quote: " On the other hand, it is not very hard to prove
             | that the minimum possible value of xn(X, Y ) is -1/2 +
             | O(1/n), and the minimum is attained when the top n/2 values
             | of Yi are placed alternately with the bottom n/2 values.
             | This seems to be paradoxical, since Theorem 1.1 says that
             | the limiting value is in [0, 1]. The resolution is that
             | Theorem 1.1 only applies to i.i.d. samples. Therefore a
             | large negative value of xn has only one possible
             | interpretation: the data does not resemble an i.i.d.
             | sample."
             | 
             | You propose Y = f(X) = (-1)^X*(1+X/10) as your functional
             | relation, which is measurable if X is discrete and indeed
             | if we let x_n=n, then the limiting value of the estimator
             | will be -1/2 not 1.
             | 
             | However, this is just a particular value of the estimator
             | on a particular sample of (x,y). The theorem is an "almost
             | surely statement", which means that it fails for a set of
             | samples with 0 propbability.
             | 
             | Indeed, if we actually picked a random sample of (X, f(X))
             | with your f, then independent on the distribution on X,
             | since X is discrete, we would expect to see many ties
             | (infinitely many ties as the number of samples goes to
             | infinity). This would mean that |r_{i+1}-r_i| is 0 for all
             | but finitely many i and so the estimator would be 1.
             | 
             | This also covers the case of f being a hash function as
             | mentioned before. Worse yet it only has finitely many
             | different values so once again infinitely many ties.
             | 
             | The way the estimator is defined, it will take care of the
             | (X, f(X)) case fairly easily as for a typical sample you
             | will get x values that cluster and for a measurable
             | function this implies that the resulting values will be
             | close together and so the rank differences will be small.
             | 
             | This discussion probably wasnt included in the abstract
             | since its fairly simple measure theory which most experts
             | readimg tje article will be intimately familiar with
        
             | mellavora wrote:
             | as an alternate to cirpus's reply, and noting that in
             | roenxi's example roenxi is using the R packaged supplied by
             | the paper's author (suggesting a real interest in
             | understanding), I again refer to page 4 of the article.
             | 
             | "it is not very hard to prove that the minimum possible
             | value of [the proposed measure] is -1/2 + O(1/n), and the
             | minimum is attained when the top n/2 values of Yi are
             | placed alternately with the bottom n/2 values. ...Theorem
             | 1.1 only applies to i.i.d. samples. Therefore a large
             | negative value of [the measure] has only one possible
             | interpretation: the data does not resemble an i.i.d.
             | sample."
             | 
             | roenxi, I congratulate your example, it shows that you are
             | working to understand the measure. "where does it break" is
             | always a good question.
        
       | petters wrote:
       | I implemented this in Python and it seems to work really well!
       | Thanks for this link!
       | 
       | Pretty cool that it can also work for e.g. textual data in
       | principle, as someone mentioned.
        
       | neeleshs wrote:
       | There is also another paper comparing the power of Chatterjee
       | coefficient with other traditional ones.
       | 
       | https://arxiv.org/abs/2008.11619
        
         | abeppu wrote:
         | It is interesting to note that the Chatterjee paper makes a
         | point of mentioning Pearson, Spearman, Kendall's tau, whereas
         | the ones focused on in this paper all appear as citations but
         | aren't explicitly discussed.
        
           | nestorD wrote:
           | To be fair, Pearson, Spearman and Kendall's tau are the
           | coeffisicent people use in practice. Had I been in the
           | author's position I would have done the same: cite all the
           | interesting developements but compare with what people
           | actually use.
           | 
           | Comparing with something people barely know should be nice
           | but people have a limited attention span so I would push to
           | that anexes at best and focus on the more important parts.
        
       | fieldcny wrote:
       | Dumb question, What is the significance of this?
        
         | csee wrote:
         | It's fast to calculate, simple to understand, and doesn't make
         | assumptions about the underlying distributions. This makes it a
         | more effective generic tool for practitioners. Perhaps useful
         | in the way the Pearson correlation is useful.
         | 
         | I'd like to learn more about the small sample properties.
         | Proofs of asymptotics are necessary but less interesting. But
         | the author's examples on example data sets look like it makes
         | sense.
        
         | singhrac wrote:
         | Well, in the abstract it says: "[a coefficient] which is 0 if
         | and only if the variables are independent and 1 if and only if
         | one is a measurable function of the other", the former property
         | which is not true of general random variables (but is true of
         | Gaussians, which is one part of the reason they are used
         | everywhere). I'm not sure about the latter property, actually,
         | but I also doubt it's true.
         | 
         | Worth noting the author is a highly regarded professor at
         | Stanford.
        
         | rwilson4 wrote:
         | Correlation typically means y is a linear function of x, but
         | people usually interpet it (incorrectly) as: knowing x tells
         | you something about y. If y = x^2, then y is determined
         | completely by x, but since it's nonlinear the correlation may
         | actually be zero depending on the distribution of x. This paper
         | proposes a statistic that will indicate if y is related to any
         | function of x, linear or nonlinear.
        
           | ouid wrote:
           | I don't think that there's a standard enough mathematical
           | definition of correlation to say that. Perhaps the word has
           | been coopted but the paper linked suggests that the cooprion
           | isn't accepted.
        
           | KarlKemp wrote:
           | This is... quite wrong? The dictionary says:
           | 1. a mutual relationship or connection between two or more
           | things       2. [Statistics] interdependence of variable
           | quantities.       3. [Statistics] a quantity measuring the
           | extent of the interdependence of variable quantities.
           | 
           | The most sympathetic to your definition is Wikipedia:
           | In statistics, correlation or dependence is any statistical
           | relationship, whether causal or not, between two random
           | variables or bivariate         data. In the broadest sense
           | correlation is any statistical association, though
           | it actually refers to the degree to which a pair of variables
           | are linearly          related.
           | 
           | And that's the _mathematical_ formulation. Correlation also
           | has a meaning in everyday speech, and mathematics doesn 't
           | have the authority to just adopt terms and then claim people
           | are wrong after they've changed the meaning.
           | 
           | Also correlation very definitely means that knowing <x> tells
           | you something about <y>. And vice versa. Like, for example:
           | its value. Or at least a better idea of it than pure guessing
           | without correlation.
        
             | doovd wrote:
             | That's s bit unnecessarily pedantic, I think we all
             | understand in which context we are talking about
             | correlation.
        
             | PopeUrbanX wrote:
             | I think it's safe to say you're not the intended audience
             | for anything math-related, given that you're going to a
             | dictionary to try to figure things out...
        
             | popcube wrote:
             | sorry, scientists always use nomal English words in their
             | region and then it will get meaning in this specific
             | region. It maybe is confused, but the advantage that most
             | people understand English is enough.
        
             | UncombedCoconut wrote:
             | Hello, see here for an explanation: https://en.wikipedia.or
             | g/wiki/Pearson_correlation_coefficien... It's widely
             | understood that the words "correlation" and "uncorrelated",
             | when used in the context of statistics and not otherwise
             | qualified, are shorthand for this definition in particular.
             | By "otherwise qualified" I mean, for example, saying
             | "Spearman's correlation" (in in the OP's abstract) to
             | specify this one: https://en.wikipedia.org/wiki/Spearman%27
             | s_rank_correlation_...
        
               | csee wrote:
               | I think that depends on context. Sometimes, in a
               | technical setting correlation just means dependence as an
               | abstract concept, and this includes non-linear
               | dependence. Similar how in financial circles, volatility
               | doesn't mean standard deviation, but in colloquial
               | settings it does.
        
               | mattkrause wrote:
               | That matches my experience too.
               | 
               | Correlation, in general, just means some sort of
               | statistical dependence: knowing _x_ tells you something
               | about _y_. It 's often "operationalized" by computing
               | Pearson's _r_ : it's easy to do and there's lots of
               | associated theory.
               | 
               | However, I would find it absolutely bizarre if someone
               | showed a plot with obvious non-linear dependence and
               | described it as "uncorrelated". In that case, the low _r_
               | reflects a failure of the measuring tool rather than
               | something being measured.
        
         | femto wrote:
         | From a signal processing perspective: being able to recognise
         | signals in the presence of interference, noise and distortion.
         | 
         | For example, you might have a radio signal (such as WiFi) that
         | you want to receive. First step is that you have to pick that
         | signal out of whatever signal comes out of your radio receiver:
         | which will be the WiFi signal along with all sorts of noise and
         | interference from other users. Typically the search will be
         | done with the mentioned "Pearson's Correlation", using it to
         | compare the received signal with an expected template: a value
         | of 1.0 meaning the received signal is a perfect match with the
         | template, a value of 0.0 meaning no match at all. If the wanted
         | signal is present, interference, noise and distortion will
         | reduce the result of the correlation to less than 1.0, meaning
         | you might miss the WiFi signal, even though it is present.
         | 
         | This article is about coming up with a measure that gives a
         | more robust result in the face of noise, interference and
         | distortion. It's fundamental stuff, in that it has quite
         | general application.
        
           | loxias wrote:
           | (Yay signal processing!)
           | 
           | Skimming it now, this looks wild. Using the variance of the
           | _rank_ of the dataset (for a given point, how many are less
           | than that point) seems... weird, and throwing out some
           | information. The author seems legit tho, so I can 't wait to
           | try drop-in implementing this in a few things.
        
             | mattkrause wrote:
             | Rank-transforms are pretty common: they show up in a lot of
             | non-parametric hypothesis tests, for example.
             | 
             | The neat thing about ranks is that, in aggregate, they're
             | very robust. You can make an estimate of the mean
             | arbitrarily bad by tweaking a single data point: just send
             | it towards +/- infinity and the mean will follow. The
             | median, on the other hand, is barely affected by that sort
             | of shenanigans.
        
       | ndynan wrote:
       | My understanding is that one of the major critiques of
       | statistics, especially its use in psychology, has been the use of
       | models which are derived from the mean.
       | 
       | There are inherent flaws/assumptions to this approach which Peter
       | Molenaar has done extensive work to critique (See Todd Rose's
       | book on the subject). For anyone who understands the technique
       | presented in this paper, does it also depend on the mean as a
       | model like when calculating Pearson's r?
        
         | extremelearning wrote:
         | This is an order-based algorithm, so it is more related to the
         | median than the mean.
         | 
         | Another very useful consequence of being order-based, is that
         | this new coefficient is much more robust to noise/outliers than
         | the canonical correlation coefficient.
        
         | LudwigNagasena wrote:
         | I think there are far bigger problems with the lack of
         | theoretical foundations and abuse of p-values rather than with
         | ergodicity or whatever is the pet peeve of Peter Molenaar.
        
         | mellavora wrote:
         | Isn't Molenaar looking at networks of symptoms over time? Yes,
         | in any multi-variate time series, when one is searching for
         | relationships between the variables (and allowing that you may
         | have multiple sets of time series which may have some type of
         | grouping, i.e. observations from a set of people with one
         | diagnosis vs observations from a set of people with a contrary
         | or with no diagnosis), then yes, any attempt to find co-
         | relations in the multivariate signal need to account for the
         | underlying statistical distribution of the signal components.
         | The normal distribution isn't a bad first a priori
         | approximation, but you really need to check.
         | 
         | side note- it also isn't clear that you can group by diagnosis,
         | see, for example, https://pubmed.ncbi.nlm.nih.gov/29154565/,
         | which shows that even within diagnostic groups there is
         | substantial individual variation.
        
       | neycoda wrote:
       | Correlation never means causation, but it typically implies it,
       | and often is directly entangled. Like the lead in car pumps being
       | tied to city crime, the correlation likely was directly related
       | to causation. That's not true of all correlation, but typically
       | cause-effect has a correlation to something, even if there's only
       | one and many other correlations that appear meaningful are
       | meaningless.
        
       | The_rationalist wrote:
       | Why woud someone use pearson or smearson coefficients vs this new
       | one?
        
         | [deleted]
        
         | tomrod wrote:
         | Pearson is for affine/linear+intercept relationships. Spearman
         | is for monotonic relationships.
        
         | rwilson4 wrote:
         | If the relationship is linear, I'm guessing a test based on
         | Pearson has greater power.
        
           | mellavora wrote:
           | power and interpretability.
           | 
           | Assuming a linear relationship, if you know the correlation
           | coefficient, you can predict unobserved values of y based on
           | a known x with good accuracy.
           | 
           | y = ax + b + error
           | 
           | where strong correlation means error is small.
        
           | im3w1l wrote:
           | Seems likely. The presented coefficient only looks at the
           | ordering of the X-values, and how it relates to the ordering
           | of the Y-values. All other information is thrown away. That's
           | how it can be so general, but it should come at the expense
           | of power.
        
             | emerged wrote:
             | Ah that makes sense intuitively. I was confused how non
             | linear correlation could be detected without knowing
             | anything about the function itself.
        
       | tdullien wrote:
       | I have a very naive question: What are the downsides of
       | estimating mutual information instead?
       | 
       | I have a math (but not statistics) background, and am sometimes
       | bewildered by the many correlation coefficients that float around
       | when MI describes pretty exactly what one wants ("how much does
       | knowledge of one variable tell you about the other variable"?).
       | 
       | So ... what am I not understanding?
        
         | tibbar wrote:
         | Different coefficients help you look at different kinds of
         | relationships. For example, Pearson's R tells you about linear
         | relationships between variables -- it's closely tied to the
         | covariance: "how useful is it to draw a line through these data
         | points, and how accurate is interpolating likely to be?".
         | 
         | Spearman's correlation helps you understand monotonic/rank-
         | order relationships between variables: "Is there a trend where
         | increasing X tends to also increase Y?" (This way we can be
         | just as good at measuring the existence of linear, logarithmic,
         | or exponential relationships, although we can't tell them
         | apart.)
         | 
         | Mutual information helps you understand how similar two
         | collections of data are, in the sort of unstructured way that's
         | useful in building decision trees. You could have high mutual
         | information without any sort of linear or monotonic
         | relationship at all. Thus it's more general while at the same
         | time not telling you anything that would be helpful in
         | building, for instance, a predictive multivariate linear model.
         | 
         | TLDR; More specific coefficients leverage assumptions about the
         | structure of the data (eg linearity), which can help you
         | construct optimal versions of models under those assumptions.
         | Mutual information doesn't make any assumptions about the
         | structure of the data so it won't feed into such a model, but
         | it still has lots of applications!
        
         | mellavora wrote:
         | Instead of mutual information, can I recommend the Transfer
         | Entropy?
         | 
         | https://journals.aps.org/prl/abstract/10.1103/PhysRevLett.85...
         | 
         | https://en.wikipedia.org/wiki/Transfer_entropy
         | 
         | It is designed for time series, but can be adopted to a more
         | general case. The advantage over MI is that with MI, you could
         | only see that A and B are related, while TE can show that A
         | dives B, but not the reverse.
        
           | LittlePeter wrote:
           | arxiv link: https://arxiv.org/abs/nlin/0001042
        
         | afiori wrote:
         | In the article it is explained that the purpose of this
         | coefficient is to estimate how much X is a function of Y [1](or
         | how noisy this association is); in particular this coefficient
         | is 1 iff X is a function of Y.
         | 
         | With MI (the article claims that) you can have a coefficient of
         | 1 without X being a function of Y.
         | 
         | [1] this means that this coefficient is intentionally not
         | symmetric.
        
         | anewhnaccount2 wrote:
         | I'm also interested in this, having tried and semi-successfully
         | used mutual information for finding associations between
         | multinomial variables. As an even more naive question, I find
         | the actual selection of estimators bewildering. How do I know
         | which estimator to use for mutual information? How do I know if
         | my chosen estimator has converged or is doing a bad job on my
         | data? Bringing it back to the topic at hand, does the estimator
         | presented in the paper provide good estimates for a wider
         | variety of cases than the mutual information plug-in estimator?
         | If so I can see it might be nice for simplicity reasons alone.
         | Can we have different estimators for this new correlation
         | coefficient? Any ideas what that would look like?
        
         | tbenst wrote:
         | MI is quite useful and widely used. It typically requires
         | binning data though when distributions are unknown /
         | empirically estimated. This approach is a rank-based score,
         | more similar to Spearman correlation than Pearson. This allows
         | for nonlinear relationships between the two variables.
         | 
         | A slightly critical review on the work van be seen here:
         | https://academic.oup.com/biomet/advance-
         | article/doi/10.1093/.... They argue that the older forms of
         | rank correlation, namely D, R, and tau*, are superior.
         | Nonetheless, it seems like a nice contribution to the stats
         | literature, although I doubt the widespread use of correlation
         | is going anywhere.
        
           | spekcular wrote:
           | Freely available version of the review paper, for those who
           | want it: https://arxiv.org/abs/2008.11619
        
         | dcanelhas wrote:
         | Mutual information is not trivial or even possible to estimate
         | in many practical situations as far as i know. Example
         | applications in robotics or computer vision, where mutual
         | information would be useful are segmentation and denoising of
         | unordered 3d point data, for example.
        
           | mellavora wrote:
           | Yes, as someone mentioned above, the problem is getting the
           | underlying distribution of the data, so you can measure SUM
           | p_i log(p_i); this usually involves some binning, which can
           | be tricky (and yes, I know the formula I gave is entropy not
           | MI)
           | 
           | I try to remind myself that "it is just a model", as a
           | corollary to "all models are wrong, some are useful." You are
           | _never_ dealing with the real world. And you are usually
           | trying to estimate some future as of yet unobserved signal
           | based on existing data. In other words, if your bins are
           | reasonable and reasonably usefully accurate, you can build a
           | working if not perfect system.
           | 
           | Don't try to optimize testing error performance to a value
           | lower than the irreducible error in the system.
        
             | AstralStorm wrote:
             | Even with binning, the problem is one of accurate sampling
             | from an unknown probability distribution.
             | 
             | Biased samples produce biased results and this OP
             | correlation coefficient might be sensitive to such an
             | issue.
             | 
             | In one of the projects we were assuming gamma distribution
             | (for speech processing) and sampling that is notoriously
             | hard. Trying to use binned MI produced serious errors, as
             | opposed to Minimim MSE one, even Maximum Likelihood did
             | better (if noisy).
        
             | dcanelhas wrote:
             | I'm not sure I understand how binning applies in e.g.
             | segmentation of point clouds into distinct objects. The
             | data would likely contain a mix of unknown distributions,
             | partially observed (due to occlusions) and not easily
             | parametrized (chair, table, toaster, etc.)... Locally you
             | can find planar patches though, so correlation can still be
             | useful.
        
       | nmca wrote:
       | non-pdf link to the article via arxiv vanity: https://www.arxiv-
       | vanity.com/papers/1909.10140/
        
       | laplacesdemon48 wrote:
       | Is there a way to extend this to measure a multivariate
       | relationship?
       | 
       | For example: Cor(X, Y & Z)
       | 
       | I know you could run them pairwise but it's possible Cor(X, Y)
       | and Cor(X, Z) are close to zero but Cor(X, Y & Z) is close to 1.
        
         | tomrod wrote:
         | This is a good question. When I read the article I'll keep it
         | in mind!
        
         | kortex wrote:
         | Seems easy enough to play around with, and need not be strictly
         | numbers either, as long as rank is defined on your fields (they
         | are sortable...total ordering? I'm rusty on my terminology).
         | Basically sort X, take the variance of the rank of Y, Z, etc.
         | Much in the same way you would compute multi variable
         | correlation.
        
       | SubiculumCode wrote:
       | I think there could be a lot of use of distance based
       | correlations https://towardsdatascience.com/introducing-distance-
       | correlat...
        
       | infocollector wrote:
       | Am I reading the paper definition correctly? If yes, why does the
       | vertical line and horizontal line in 2D have so different
       | correlations?                 def corr(x,y):         n = len(x)
       | an_array = np.stack((x, y), axis=-1)         sorted_array =
       | an_array[np.argsort(an_array[:, 0])]         r =
       | rankdata(sorted_array[:,1])         s = sum(np.abs(np.diff(r)))
       | return 1.0 - (3.0 \* s/(n\*n -1.0))
        
         | yorwba wrote:
         | You're using the formula for the case where there are no ties,
         | but the vertical line has all X values tied and the horizontal
         | line has all Y values tied. Also, the case where Y is a
         | constant is explicitly excluded from consideration in the
         | paper.
        
       | The_rationalist wrote:
       | could this better represent semantic textual similarity?
       | https://paperswithcode.com/sota/semantic-textual-similarity-...
        
       | m1sta_ wrote:
       | How does this relate to Maximal Information Coefficient (MIC),
       | MGC, PPS and the like?
        
         | derbOac wrote:
         | The paper discusses MIC in particular at least. They suggest
         | that MIC sometimes overstates the strength of a noisy
         | relationship, giving the example of a bivariate normal mixture
         | (Figure 9). In that example MIC is 1, but the new measure is
         | 0.48 or something, which seems more reasonable.
        
       | d883kd8 wrote:
       | There will be no AI fire alarm
        
       | kortex wrote:
       | There are so many super cool aspects to this!
       | 
       | - it's always fun when a new equation or formula is discovered,
       | doubly so when it is very practical
       | 
       | - actually really easy to wrap your head around it
       | 
       | - seems very computationally efficient. Basically boils down to
       | sorts and distance
       | 
       | - not limited to strict numeric fields ( integers and reals).
       | Anything with an ordering defined can act as your Xs/Ys:
       | characters, words, complex/vectors (under magnitude). I think you
       | could even apply it recursively to divide and conquer high
       | dimensional datasets.
       | 
       | - looks useful in both LTI and non stationary signal analysis
       | 
       | - possible use cases: exoplanet search, cosmology in general,
       | ecology, climate, cryptanalysis, medicine, neuroscience, and of
       | course, someone will find a way to win (more likely lose) money
       | on stonks.
        
         | thesz wrote:
         | You can update Pearson correlation online, you cannot update
         | online OP correlation coefficient.
         | 
         | Pearson and other correlation coefficients are linear, O(n),
         | sorting would incur logiarithmic multiplier O(NlogN).
         | 
         | Thus, it is not computationally efficient.
         | 
         | To break ties randomly one has to have good random number
         | generator, which is a problem in itself.
         | 
         | Finally, if you want to have differentiable version of this
         | correlation coefficient, you will need to use sorting networks
         | which are O(Nlog^2(N)).
         | 
         | But, it is a cool idea nevertheless, it brings up use of
         | ranking in statistics and there are ties to other areas of the
         | statistical science.
         | 
         | For example, it appears that you can more efficiently prune
         | language models if you use rank metric than probability [1].
         | 
         | [1] https://aclanthology.org/P02-1023.pdf
        
           | TeeMassive wrote:
           | > Pearson and other correlation coefficients are linear,
           | O(n), sorting would incur logiarithmic multiplier O(NlogN).
           | Thus, it is not computationally efficient.
           | 
           | That doesn't seem a deal breaker to me. Sure if you're
           | dealing with billions of data entries it will be 9 times
           | slower, but throwing 9 times more computational power to
           | solve a problem is far from something unheard of nowadays.
        
             | tromp wrote:
             | > it will be 9 times slower
             | 
             | Closer to 30 times slower, as the logarithm in sorting
             | complexity is a binary rather than decimal one.
        
             | estomagordo wrote:
             | I guess more like 30 times slower. Consider mergesort and
             | quicksort, which both (aim to) halve the search space with
             | every iteration. There is a log2 amount of steps. 2 is our
             | base. And yes, base 2 and base 10 logarithms will always be
             | within a constant factor of one another, but seeing as we
             | _are_ talking about constant factors here to begin with...
        
               | asQuirreL wrote:
               | If we are talking about billions of entries, even of just
               | 32bit integers, we are talking about 4GB of data, your
               | sort is probably going to be IO dominated, not CPU
               | (comparison) dominated. In which case you'd likely use a
               | sorted data structure with a much higher branching factor
               | than 2 (like a B-tree or LSM tree).
               | 
               | Take the B-tree example, with 4KB pages, you can fit 1000
               | integers in there for a branching factor of 500, and a
               | performance multiplier of just over 3.
        
           | jazzkingrt wrote:
           | Can you explain in more detail why this can't be updated
           | online?
           | 
           | We can track the rank of newly sampled pairs in LogN using
           | something like an order statistic tree:
           | 
           | https://en.wikipedia.org/wiki/Order_statistic_tree
           | 
           | But I guess the problem is that with each new pair, O(n)
           | pairs could change their contribution to the correlation
           | coefficient.
        
             | thesz wrote:
             | Your point is very valid one. I was not able to find out
             | how to keep ranks updated efficiently.
        
             | phkahler wrote:
             | But now we're moving away from being simple to compute.
        
           | asxd wrote:
           | Out of curiosity, why do you think having a good random
           | number generator is problematic? It seems like it's easy
           | enough to access one in most situations.
        
             | medo-bear wrote:
             | how do you sample from an arbitrary probability
             | distribution? it is not a trivial problem at all
        
               | kortex wrote:
               | Why do you need an arbitrary distribution? We are talking
               | about breaking ties, just use uniform. You can probably
               | just use RDRAND.
        
               | Hendrikto wrote:
               | They are equivalent. Generating an arbitrary distribution
               | from a uniform one is trivial.
        
               | kortex wrote:
               | If you are willing to accept lookup tables or
               | approximations, then yeah, arbitrary distributions are
               | trivial. However for certain domains, you may want a more
               | closed form solution for mapping uniform to your
               | distribution, which may not be obvious. E.g. I would say
               | generating pink noise is not "trivial" because there is
               | no closed form solution (c.f. generating gaussian
               | distribution via Box-Muller, that is trivial), so you
               | need to pick a method which may have tradeoffs.
        
               | medo-bear wrote:
               | i agree. to break ties uniform distribution should be
               | sufficient
        
               | asxd wrote:
               | I guess I assumed the operative word was "good". The term
               | "random number generator" almost always refers to a
               | generator that intends to produce a uniform distribution.
        
               | Hendrikto wrote:
               | If you can sample from a uniform distribution, it is
               | trivial to turn that into any arbitrary distribution.
        
             | thesz wrote:
             | Yes, exactly. There are so many to choose from. ;)
             | 
             | There are several discussions here in HN about PRNGs of
             | different kinds and people still want to invent new ones.
             | Why? Because old ones do not satisfy them in full.
             | 
             | Returning to the problem at hand. How many consecutive ties
             | do we expect? This would helps us to define how many bits
             | should we extract from state - for N expected consecutive
             | ties we should extract 2log_2(N) bits or more (birthday
             | paradox). For 64K ties we need 32 bits, which is fair
             | amount of state, 1/8 of typical PRNG discussed here.
             | 
             | Most PRNGs are splittable and this also brings the question
             | "how does splitting affect bits generated?" Will these bits
             | be correlated between splitted generators?
             | 
             | I definitely do not want correlation coefficient
             | computation that produces close results for provably
             | different sets due to correlation between tie-breaked
             | sections filled with random numbers.
        
               | ykonstant wrote:
               | > Yes, exactly. There are so many to choose from. ;)
               | 
               | Just use a good PRNG to pick one of them :D
        
           | eutectic wrote:
           | You don't have to use sorting networks; to differentiate (or
           | get a subgradient, anyway) a sorted list you just pass the
           | derivative back to the original index, so you can use any
           | sorting algorithm as long as you keep track of the
           | permutation. You could even use radix sort.
           | 
           | Also, I would say that random number generation is a well-
           | solved problem.
        
           | omnicognate wrote:
           | > Thus, it is not computationally efficient.
           | 
           | Is there some standard sense in which only (sub)linear
           | algorithms are considered "computationally efficient"?
           | O(nlogn) is fine for a very wide variety of practical uses.
        
             | thesz wrote:
             | I should point that Pearson's correlation coefficient can
             | be updated on-line, the storage complexity is O(1). The
             | storage complexity for sorting is O(N).
             | 
             | Thus, this particular algorithm will not fit into, say,
             | traffic analysis framework.
        
               | eru wrote:
               | The runtime and storage requirements you cite only apply
               | when you need exact answers.
               | 
               | If you want to compute a correlation coefficient, you are
               | probably happy to get eg 3 significant digits of
               | precision. Thus you can use approximation algorithms and
               | sampling.
        
               | omnicognate wrote:
               | Yeah, this has worse complexity than Pearson. However,
               | it's not competing on complexity.
               | 
               | The point of it is to detect general dependencies, where
               | Pearson only detects linear ones (cases where Y is
               | completely determined by X but their Pearson correlation
               | is 0 are trivial to construct). The "linear" in linear
               | dependency isn't the same as the "linear" in linear
               | complexity here, but it doesn't seem surprising for an
               | algorithm that can detect more general dependencies to
               | have worse than linear performance (and nlogn is pretty
               | much the next best).
               | 
               | In the absence of a known algorithm with the same
               | properties and better complexity, it doesn't seem
               | unreasonable to call it "computationally efficient" when
               | it comes with a complexity sufficient for most uses,
               | rather than say quadratic or exponential.
        
               | thesz wrote:
               | Rank based correlation coefficients: https://en.wikipedia
               | .org/wiki/Correlation_coefficient#Rank
               | 
               | There are several such algorithms.
               | 
               | Also, there is a comment with a link to paper comparing
               | this correlation coefficient to some other:
               | https://news.ycombinator.com/item?id=29690208
               | 
               | It looks like this "new coefficient of correlation" is
               | not without flaws compared to others.
        
           | hansvm wrote:
           | Pearson still has a lot of advantages on modern hardware and
           | in practice should be a ton faster, but:
           | 
           | (1) Rank coefficients can absolutely be updated online.
           | 
           | (2) There aren't many applications where an extra
           | polylogarithmic factor is the difference between a good-
           | enough algorithm and something too slow.
           | 
           | (3) The RNG is an implementation detail I'd probably omit.
           | It's asymptotically equivalent to the deterministic operation
           | of rescaling the sum of all pairwise absolute differences of
           | Y values for each run of identical X values. If in your
           | initial sort of the X values you instead lexicographically
           | sort by X then Y then you can get away with a linear
           | algorithm for computing those pairwise absolute differences.
        
           | petters wrote:
           | > Finally, if you want to have differentiable version of this
           | correlation coefficient
           | 
           | Could you expand a little bit more on how to differentiate
           | this coefficient? Since it is only based on ranks, it feels
           | very non-differentiable.
        
           | kortex wrote:
           | > Pearson and other correlation coefficients are linear,
           | O(n), sorting would incur logiarithmic multiplier O(NlogN).
           | 
           | > Thus, it is not computationally efficient.
           | 
           | That is not really what I meant. First, I personally consider
           | NlogN the bar for "efficient" as far as algorithms go.
           | Anything polynomial is inefficient, and anything better than
           | NlogN is like, "super efficient" in my mind. Maybe that is a
           | weird opinion, should I update my mental model? My reasoning
           | is a lot of well known algorithms operate in better than
           | polynomial but worse than linear time.
           | 
           | Second, sorting is a solved problem. Regardless of
           | theoretical efficiency, we have so many ways to sort things,
           | there is almost always really fast in practice ways of
           | sorting data.
           | 
           | I didn't know about sorting networks, that is fascinating!
        
             | ummonk wrote:
             | Indeed, in practice the logN factor can only get so big,
             | even if N is taken to be the number of particles in the
             | observable universe.
        
               | [deleted]
        
             | HelloNurse wrote:
             | > Second, sorting is a solved problem. Regardless of
             | theoretical efficiency, we have so many ways to sort
             | things, there is almost always really fast in practice ways
             | of sorting data.
             | 
             | By the same reasoning, economy would be a solved problem:
             | you just need to have money.
             | 
             | Computing correlation by sorting a large amount of samples
             | instead of reading them once is a cost difference that can
             | make your data processing practically infeasible despite
             | the efficiency of sorting algorithms.
        
             | anovikov wrote:
             | I can only second this. Few algorithms are as efficient as
             | NlogN.
        
               | thesz wrote:
               | Compare Fourier (O(NlogN)) and wavelet (O(N)) transforms.
               | Both can be used for similarity (correlation) search
               | through projections.
        
             | eru wrote:
             | > First, I personally consider NlogN the bar for
             | "efficient" as far as algorithms go. Anything polynomial is
             | inefficient, and anything better than NlogN is like, "super
             | efficient" in my mind.
             | 
             | What is efficient and what ain't depends totally on
             | context.
             | 
             | For example, for some domains even proving that a problem
             | is in NP (instead of something worse), means that they
             | consider the problem tractable, because we have good
             | solvers for many NP hard problems in practice.
             | 
             | For other domains, you want O(1) or O(log n) or O(log* n)
             | etc.
             | 
             | In general, P vs NP is such a big deal, because polynomial
             | runtime is generally seen as tractable, and anything worse
             | as intractable.
             | 
             | (Btw, O(n log n) _is_ a polynomial runtime, and so is O(N)
             | or even O(1). Though I can see that as a shorthand of
             | notation you are excluding them.)
        
               | kortex wrote:
               | > (Btw, O(n log n) _is_ a polynomial runtime, and so is
               | O(N) or even O(1). Though I can see that as a shorthand
               | of notation you are excluding them.)
               | 
               | I mean yes, technically n^1 is a polynomial, but normally
               | this is just called linear. You hit the nail on the head
               | with the last sentence. Here I'm using polynomial to
               | basically mean "n^k, k>1". From a NP vs P point of view,
               | P is considered tractable. But from a "actually writing
               | libraries" point of view, I think software developers
               | consider O(n^2) "kinda bleh" if it can be at all avoided.
               | Also developers tend to give much more weight to the
               | actual coefficients / parallelism potential / practical
               | runtime. Neural networks have superlinear bounds
               | everywhere, but GPU goes brrr.
               | 
               | Brass tacks, sorting some tuples and diffing the rank
               | then reducing strikes me as likely to be:
               | 
               | - pretty fast even in the naive implementation
               | 
               | - sorting is a well trodden field
               | 
               | - easy to write things like shaders for
               | 
               | - amenable to speculative execution
               | 
               | - amenable to divide and conquer
               | 
               | So that is what I mean by efficient.
        
               | frutiger wrote:
               | The "P" in NP means a solution to a decision problem is
               | verifiable in polynomial time.
               | 
               | The "N" in NP means the solution can be produced by a
               | non-deterministic Turing machine (i.e. a computer that
               | can search the problem space in many dimensions in each
               | step).
               | 
               | NP does not mean "non-polynomial" time to produce a
               | solution, otherwise P != NP by definition. Instead P is a
               | subset of NP.
               | 
               | The open question is if all decision problems with
               | solutions verifiable in polynomial time can also be
               | produced in polynomial time.
        
               | eru wrote:
               | Yes.
               | 
               | I know that the N stands for non-deterministic. Read my
               | comment carefully, it agrees with everything you write
               | here. (Or at least, does not disagree. It's silent on
               | most of the intricacies.)
               | 
               | For many people moving a problem from NP to P means the
               | same as having a tractable solution. See eg Scott
               | Aaronson's writing on the topic.
        
               | frutiger wrote:
               | > Read my comment carefully
               | 
               | Please take a spoon of your own medicine and read my
               | comment even more carefully. I never suggested my
               | comments were a contradiction to yours.
               | 
               | I added some information for bystanders that may have
               | read your comment and come to (or reinforced) a wrong
               | conclusion.
        
               | eru wrote:
               | Thanks for the clarification!
               | 
               | Yes, interpreting NP as non-polynomial is a common
               | misunderstanding. Especially because it's 'sort-of almost
               | right'.
        
               | saurik wrote:
               | (That you were trying to add and not correct was
               | extremely unclear, FWIW.)
        
               | [deleted]
        
       | shahbazac wrote:
       | This tweet contains a visualization comparison between linear
       | correlation and "Chatarjee correlation:"
       | 
       | https://twitter.com/adad8m/status/1474754752193830912?s=21
        
         | extremelearning wrote:
         | This new coefficient of correlation is really really awesome,
         | and this visualization shows its value in such a beautifully
         | simple presentation.
         | 
         | It would be great if someone who has Wikipedia edit privileges,
         | can edit the Wikipedia article at [1] to describe/link how the
         | Chatarjee's correlation coefficient solves many of the known
         | limitation of Pearson's correlation coefficient. ;)
         | 
         | [1]
         | https://en.wikipedia.org/wiki/Pearson_correlation_coefficien...
         | 
         | (especially second top-left diagram)
        
           | [deleted]
        
           | Mathnerd314 wrote:
           | The right place to add it would be here AFAICT: https://en.wi
           | kipedia.org/wiki/Correlation#Other_measures_of_.... The
           | article on Pearson's shouldn't get sidetracked by discussing
           | other types of variables. (or rather, it already has too
           | many)
           | 
           | But the part that one would add would not necessarily be the
           | definition of the coefficient xn, but rather the interesting
           | discussion at the beginning about what makes for a good
           | correlation coefficient.
        
             | extremelearning wrote:
             | thanks. this is what i was intending to say, but you said
             | it much better. :)
        
           | querez wrote:
           | Easy there. New correlation coefficients get proposed all the
           | time (eg. the introduction of the linked paper lists ~10-20
           | alone!). It's not a good idea to add every newly proposed
           | coefficient to established wiki pages, just because they
           | trend on social media. Yes, the paper looks nice, but if you
           | read any new paper proposing a new measure, they all do!
           | They're meant to be written that way. Let the community
           | decide and test and discuss, and if in 10 years this new
           | coefficient is well accepted and has proven itself, we can
           | think about your proposed edit. Doing it before is putting
           | the cart before the horse, and is a recipe for astroturfing.
        
         | querez wrote:
         | Since it's rank-based, it would be nice to see a comparison to
         | Spearman's correlation instead. It's very easy to find failure
         | cases for Pearson, so the visualization is next to meaningless,
         | IMO.
        
       | nextaccountic wrote:
       | Can information theoretic measures like Granger causality[0] be
       | used for this purpose?
       | 
       | [0] https://en.wikipedia.org/wiki/Granger_causality
        
         | csee wrote:
         | Granger causality is about linear inter-temporal dependence.
         | Chatterjee correlation is about non-linear contemporaneous (or
         | time-independent) dependence. They're tools for quite different
         | applications.
        
         | pizza wrote:
         | Looks like you could do the opposite: use this measure as a
         | proxy for correlation between input data X and Y, and seeing if
         | increased correlation between X and Y is superior to just
         | prediction of Y without X. Maybe model the correlation value
         | going over a threshold as a sequence of Bernoulli trials
        
       ___________________________________________________________________
       (page generated 2021-12-27 23:02 UTC)