[HN Gopher] A New Coefficient of Correlation
___________________________________________________________________
A New Coefficient of Correlation
Author : malshe
Score : 579 points
Date : 2021-12-25 22:30 UTC (1 days ago)
(HTM) web link (arxiv.org)
(TXT) w3m dump (arxiv.org)
| pmayrgundter wrote:
| I didn't get the idea of ranking. But it's simple:
|
| "In statistics, ranking is the data transformation in which
| numerical or ordinal values are replaced by their rank when the
| data are sorted. For example, the numerical data 3.4, 5.1, 2.6,
| 7.3 are observed, the ranks of these data items would be 2, 3, 1
| and 4 respectively. For example, the ordinal data hot, cold, warm
| would be replaced by 3, 1, 2."
|
| https://en.m.wikipedia.org/wiki/Ranking
|
| Also learned that a Spearman coeff is just the Pearson
| coefficient taken on the _rank_ of the data, instead of on the
| raw data.
|
| But whereas Pearson/Spearman takes the sum of product of
| data/mean differences (S(x-x)(y-y)/sxsy) where x is mean and
| s=std. dev., Chatterjee takes sum of rank differences
| (3S(r+1-r)/n2-1), concerning just the ranks of the Y data after
| the X,Y pairs have been sorted by X.
|
| But still missing the intuition for why the sum of difference of
| ranks is so useful or where the magic numbers come from.
| cornel_io wrote:
| jmount has an explanation elsewhere in this thread linking to
| https://win-vector.com/2021/12/26/how-to-read-sourav-chatter...
| which does a great job of explaining the intuition, but in a
| nutshell the normalization factor of 3 comes from the fact that
| if you select two random points between 0 and 1, the mean
| distance between will be 1/3 (which is pretty easy to write
| down and solve, boils down the the fact that a pyramid of
| height 1 that's 1x1 at the base has volume = 1/3).
| pmayrgundter wrote:
| Thanks! That did it
| lngnmn2 wrote:
| This, by the way, is exactly what is wrong with so-called modern
| (abstract model based) science and why it is a bubble.
|
| This is not even alchemy, this is pure astrology.
|
| If a variable is abstract (not based on a particular measure of
| something real) the whole subsequent reasoning is just pseudo
| scientific bullshit. Correlations between abstract variables are
| imaginary.
|
| The "a correlation is not a causation" statement is much deeper
| than it seems. It basically implies that whenever causality
| cannot be clearly (logically and confirmed experimentally)
| established is just abstract bullshit, the kind of which most of
| humanities consist of.
| Jweb_Guru wrote:
| I mean, if you choose a bullshit ordering relationship, sure.
| But that's (essentially) the only assumption.
|
| Ah, I missed the reference to "the humanities" the first
| time... seems like you have an ideological axe to grind.
| nl wrote:
| What are you talking about?
|
| Correlation is absolutely useful for analysis of non-physical
| based systems.
| lngnmn2 wrote:
| "analysis"
| asdf_snar wrote:
| I once attended a summer school in Saint-Flour, France, where
| Sourav Chatterjee gave a masterful exposition of results on large
| deviations for random graphs. All chalk talk, clear presentation.
| My impression is he's a deep thinker. On that basis alone I give
| such an article much more weight than the plethora of articles
| that pass before the eyes; reading the masters has always been a
| good rule of thumb.
| rootveg wrote:
| what kind of summer school was that?
| spekcular wrote:
| This one: https://recherche.math.univ-bpclermont.fr/stflour/
| [deleted]
| [deleted]
| derbOac wrote:
| Also worth looking at the cited and related paper with the same
| author:
|
| https://arxiv.org/abs/1910.12327
|
| This follow-up paper presents a related measure of conditional
| dependence that has a "natural interpretation as a nonlinear
| generalization of the familiar partial R2 statistic for measuring
| conditional dependence by regression."
|
| The follow-up paper also provides some additional interpretive
| insights, I think.
|
| My intuitive impression is that both of these are important
| developments.
|
| I also have a very vague suspicion, based on the form of the
| function, that the correlation measure has some interpretation in
| terms of mutual information involving rank transformations of
| random variables.
| malshe wrote:
| Thanks for finding this article. I agree, these are important
| developments in particular because so many econometric models
| are now using machine learning without any distributional
| assumptions. Using correlation coefficients based on linearity
| is grossly insufficient.
| ur-whale wrote:
| Is there code?
| loxias wrote:
| The equation is on the second page, and if you know enough to
| know what correlation _is_ , you know enough to implement from
| the equation given. Takes N*Log(N) to run though, if
| implemented naively. (because you have to sort your data)
| malshe wrote:
| Yes, the author has shared the link to R package here:
|
| https://cran.r-project.org/web/packages/XICOR/index.html
|
| Edit: R code from Dr. Chatterjee's Stanford page is here -
| https://souravchatterjee.su.domains//xi.R
|
| If you have never worked with R, the code seems clunky so I
| suggest checking out Python implementation on Github here:
|
| https://github.com/czbiohub/xicor
|
| The Python library is not from the original author though. But
| it's easy to read the code and it works with pandas as well.
| ur-whale wrote:
| Thanks, the Python code is very clear and simple and makes it
| super easy to understand the idea without having to digest
| the paper.
| zmachinaz wrote:
| The current version of the python lib seems to be extremely
| badly written code. Or is the algo so bad ? Takes something
| like 21s to compute the correlation for just 10k samples.
| flyingmutant wrote:
| This issue contains simple code that is claimed to be >300x
| faster: https://github.com/czbiohub/xicor/issues/17
| mellavora wrote:
| Best line of the paper:
|
| "The formula is so simple that it is likely that there are many
| such coefficients, some of them possibly having better properties
| than the one presented below."
|
| suggesting that the author is seeking to highlight a basic
| principle, not tune for a specific application.
| malshe wrote:
| This article is now published in the Journal of the American
| Statistical Association [Volume 116, 2021 - Issue 536]
|
| https://doi.org/10.1080/01621459.2020.1758115
| mjcohen wrote:
| That costs $47 (US). The archiv article is free.
| ateng wrote:
| I think pointing out that it is published on a known journal
| is to display the fact that the paper is now peer-reviewed
| ekianjo wrote:
| Peer reviews dont mean much anymore these days.
| malshe wrote:
| Perhaps to you. In academia we take them seriously.
| Nevermark wrote:
| Do you have a measure of low correlation to support that?
|
| Regardless, I expect reviews are pretty straight forward
| for a concept that stands on its own like this one.
| derbOac wrote:
| https://link.springer.com/article/10.1007/s11192-017-2516
| -6
| sdoering wrote:
| In answer to sister comment by derbOac:
|
| Thanks for sharing this. On the other hand the cited
| paper clearly relates to interdisciplinary peer review.
|
| I am not sure how or if this can be transferred to
| mathematics.
| nl wrote:
| In Math they really do.
| malshe wrote:
| Yes, that's right. I wanted to highlight that this is a
| peer-reviewed article published in one of the top
| statistical journals.
| appleflaxen wrote:
| This is incredibly cool. Thanks HN for bringing things like this
| to the front page, which I would have never otherwise heard
| about.
| hdm41bc wrote:
| One of the aspects of math papers that I dislike is how
| unapproachable they are if you're unfamiliar with some of the
| terminology and conventions. The esoteric symbols don't make it
| any easier to Google their definitions either.
|
| For instance, what does this mean? > m is the law of Y m is
| usually the mean or average. Is "law" something else?
| ebb_earl_co wrote:
| The law of Y means the distribution of Y. See the last phrase
| from this [0] StackOverflow answer.
|
| [0] https://math.stackexchange.com/a/1397467
| hdm41bc wrote:
| Great thanks! I'll remember to use math.stackexchange.com as
| a resource in the future.
| 331c8c71 wrote:
| Do you think a random production code snippet is approachable
| for someone lacking the necessary background?
|
| The notation in this paper is totally standard.
|
| EDIT: \mu there refers to a probability measure. Nothing to do
| with averages.
| hdm41bc wrote:
| Yeah, working code snippets would be great. That would
| provide an unambiguous implementation that someone could use
| to dig into the underlying functions used and learn the basic
| concepts that would be tedious for the author to go through.
|
| In terms of the notation, it seemed like the author actually
| tried to keep his paper accessible, so my complaint isn't
| with the author. My gripe is more with math notation in
| general.
|
| In my opinion, unless you've read the appropriate textbooks
| or taken the right classes, math notation is hard to learn.
| The symbols are hard to Google for. Integral symbols, R for
| real numbers, sigma, delta, the round E that stands for IN
| are not found on a standard keyboard so it's challenging for
| a layman to Google and learn that notation on their own. Math
| evolved over millennia and the notation wasn't constructed
| with SEO in mind, so I understand why things are the way they
| are, but it's a stumbling block for the uninitiated trying to
| learn more advanced math. Maybe there are resources like
| math.stackexchange out there that I'm unaware of that would
| help make learning notation more approachable.
| jmount wrote:
| I share some notes on reading/explaining the basic definition
| here: https://win-vector.com/2021/12/26/how-to-read-sourav-
| chatter...
| [deleted]
| zaptheimpaler wrote:
| How is it possible to make a general coefficient of correlation
| that works for any non-linear relationship? Say if y=sha256(x),
| doesn't that mean y is a predictable function of x, but its
| statistically impossible to tell from looking at inputs/outputs
| alone?
| [deleted]
| [deleted]
| cirpis wrote:
| Two things: 1. The result is asymptotical, i.e. holds as number
| of samples approach infinity.
|
| 2. The result is an "almost surely" result, i.e. in the
| collection of all possible infinite samples, the set of samples
| for which it fails has 0 measure. In non technical terms this
| means that it works for typical random samples and may not work
| for handpicked counterexamples.
|
| In our particular case let f=Sha256. Then X must be discrete,
| i.e. a natural number. Now the particulars depend on the
| distribution on X, but the general idea is that since we have
| discrete values, the probability that we get an infinite sample
| where the values tend to infinity is 0. So we get that in a
| typical sample theres going to be an infinitude of x ties and
| furthermore most x values arent too large (in a way you can
| make precise), so the tie factors l_i dominate since there just
| arent that many distinct values encountered total. And so we
| get that the coefficient tends to 1.
| nnm wrote:
| This works for hash function as hash function does not
| introduce uncertainty, i.e. Y = f(X), not Y = f(x) +
| random_noise. I just tested it:
|
| https://ibb.co/nCXVSqB
| idealmedtech wrote:
| Perhaps they mean continuous or differentiable functions
| (evident by the desire to put an order on the elements)
|
| edit: was wrong about sha256
| dandanua wrote:
| That's right, the paper has practical estimates of
| convergence only for continuous functions (kind of). For hash
| functions this coefficient is useless.
| bawolff wrote:
| sha256() doesn't involve state or decision making.
| meiji163 wrote:
| In Theorem 1.1, f is a function of random variables, which
| might be where you're confused.
|
| > doesn't that mean y is a predictable function of x
|
| Sort of: as function of real numbers, sha256 is just some
| deterministic function. But point is its output "looks like" a
| uniform random variable for any reasonable input distribution
| i.e. as a function of random variables the input and output
| variables should have 0 correlation
| Sniffnoy wrote:
| No. If you have a good hash function, that means it's
| _computationally_ infeasible to determine anything about x
| based only on y. It 's not _statistically_ impossible at all;
| "statistically" doesn't concern itself with computational
| difficulties.
|
| This is similar to how, e.g., we generally assume that AES is
| unbreakable from a _computational_ point of view, but if you
| want a _statistically_ unbreakable cipher, your only (IINM)
| option is a one-time pad.
| hinoki wrote:
| The summary says that it works if it is a measurable function
| [0], which is structure preserving. So sha256 would scramble
| the input too much for it to be detected here.
|
| [0] https://en.m.wikipedia.org/wiki/Measurable_function
| [deleted]
| kkylin wrote:
| I've not yet read the article, just the abstract. But the
| abstract is pretty precise, and being "measurable" is a very
| weak constraint. For (computationally) complex functions like
| a hash, my first guess is the "escape clause" is in the
| number of samples needed for the statistic to converge to its
| expected value.
| nnm wrote:
| Any function that can be implemented in a computer is a
| Lebesgue measurable function.
| bollu wrote:
| I don't doubt it, but I don't immediately see the proof
| either. What's the key idea?
| contravariant wrote:
| Most spaces that computers deal with are basically
| discrete.
|
| Technically this may not always be the case but it's very
| hard to construct a convincing counter example.
| Hermel wrote:
| Simple: if in the whole sample, the same x always comes with
| the same y, then y is a function of x.
|
| Example:
|
| x1 = 1, y1 = 2
|
| x2 = 1.1, y2 = 2
|
| Here, y is a function of x, because if we know that x = 1, the
| we also know that y = 2. However, x is not a function of y, as
| we don't know what value x has given that y = 2.
|
| I hope that made it more clear. Here, "function" simply means
| that every time we started with the same x, we also got the
| same y.
| canjobear wrote:
| Seems that by this logic if you don't have any repeats in
| your x values then you are bound to conclude that any set of
| y values is a function of the x.
| roenxi wrote:
| Either Chatterjee has made a mistake, that is wrong or that
| definition has some extremely precise meanings because:
| > counterexample = tibble(x = 1:6, y =
| c(1.1,-1.2,1.3,-1.4,1.5,-1.6)) > counterexample
| # A tibble: 6 x 2 x y <int> <dbl>
| 1 1 1.1 2 2 -1.2 3 3 1.3
| 4 4 -1.4 5 5 1.5 6 6 -1.6
| > XICOR::calculateXI(xvec = counterexample$x,
| yvec=counterexample$y) [1] -0.2857143 >
| -0.2857143 == 1 [1] FALSE
|
| & I assume you could get data like that in the wild sampling
| a sin wave as a timeseries.
|
| I think the abstract is a bit strong, it probably means
| "converges to" for a large repeating sample.
| cirpis wrote:
| There is no mistake in the definition and this is all
| elaborated upon in page 4 of the article.
|
| Quote: " On the other hand, it is not very hard to prove
| that the minimum possible value of xn(X, Y ) is -1/2 +
| O(1/n), and the minimum is attained when the top n/2 values
| of Yi are placed alternately with the bottom n/2 values.
| This seems to be paradoxical, since Theorem 1.1 says that
| the limiting value is in [0, 1]. The resolution is that
| Theorem 1.1 only applies to i.i.d. samples. Therefore a
| large negative value of xn has only one possible
| interpretation: the data does not resemble an i.i.d.
| sample."
|
| You propose Y = f(X) = (-1)^X*(1+X/10) as your functional
| relation, which is measurable if X is discrete and indeed
| if we let x_n=n, then the limiting value of the estimator
| will be -1/2 not 1.
|
| However, this is just a particular value of the estimator
| on a particular sample of (x,y). The theorem is an "almost
| surely statement", which means that it fails for a set of
| samples with 0 propbability.
|
| Indeed, if we actually picked a random sample of (X, f(X))
| with your f, then independent on the distribution on X,
| since X is discrete, we would expect to see many ties
| (infinitely many ties as the number of samples goes to
| infinity). This would mean that |r_{i+1}-r_i| is 0 for all
| but finitely many i and so the estimator would be 1.
|
| This also covers the case of f being a hash function as
| mentioned before. Worse yet it only has finitely many
| different values so once again infinitely many ties.
|
| The way the estimator is defined, it will take care of the
| (X, f(X)) case fairly easily as for a typical sample you
| will get x values that cluster and for a measurable
| function this implies that the resulting values will be
| close together and so the rank differences will be small.
|
| This discussion probably wasnt included in the abstract
| since its fairly simple measure theory which most experts
| readimg tje article will be intimately familiar with
| mellavora wrote:
| as an alternate to cirpus's reply, and noting that in
| roenxi's example roenxi is using the R packaged supplied by
| the paper's author (suggesting a real interest in
| understanding), I again refer to page 4 of the article.
|
| "it is not very hard to prove that the minimum possible
| value of [the proposed measure] is -1/2 + O(1/n), and the
| minimum is attained when the top n/2 values of Yi are
| placed alternately with the bottom n/2 values. ...Theorem
| 1.1 only applies to i.i.d. samples. Therefore a large
| negative value of [the measure] has only one possible
| interpretation: the data does not resemble an i.i.d.
| sample."
|
| roenxi, I congratulate your example, it shows that you are
| working to understand the measure. "where does it break" is
| always a good question.
| petters wrote:
| I implemented this in Python and it seems to work really well!
| Thanks for this link!
|
| Pretty cool that it can also work for e.g. textual data in
| principle, as someone mentioned.
| neeleshs wrote:
| There is also another paper comparing the power of Chatterjee
| coefficient with other traditional ones.
|
| https://arxiv.org/abs/2008.11619
| abeppu wrote:
| It is interesting to note that the Chatterjee paper makes a
| point of mentioning Pearson, Spearman, Kendall's tau, whereas
| the ones focused on in this paper all appear as citations but
| aren't explicitly discussed.
| nestorD wrote:
| To be fair, Pearson, Spearman and Kendall's tau are the
| coeffisicent people use in practice. Had I been in the
| author's position I would have done the same: cite all the
| interesting developements but compare with what people
| actually use.
|
| Comparing with something people barely know should be nice
| but people have a limited attention span so I would push to
| that anexes at best and focus on the more important parts.
| fieldcny wrote:
| Dumb question, What is the significance of this?
| csee wrote:
| It's fast to calculate, simple to understand, and doesn't make
| assumptions about the underlying distributions. This makes it a
| more effective generic tool for practitioners. Perhaps useful
| in the way the Pearson correlation is useful.
|
| I'd like to learn more about the small sample properties.
| Proofs of asymptotics are necessary but less interesting. But
| the author's examples on example data sets look like it makes
| sense.
| singhrac wrote:
| Well, in the abstract it says: "[a coefficient] which is 0 if
| and only if the variables are independent and 1 if and only if
| one is a measurable function of the other", the former property
| which is not true of general random variables (but is true of
| Gaussians, which is one part of the reason they are used
| everywhere). I'm not sure about the latter property, actually,
| but I also doubt it's true.
|
| Worth noting the author is a highly regarded professor at
| Stanford.
| rwilson4 wrote:
| Correlation typically means y is a linear function of x, but
| people usually interpet it (incorrectly) as: knowing x tells
| you something about y. If y = x^2, then y is determined
| completely by x, but since it's nonlinear the correlation may
| actually be zero depending on the distribution of x. This paper
| proposes a statistic that will indicate if y is related to any
| function of x, linear or nonlinear.
| ouid wrote:
| I don't think that there's a standard enough mathematical
| definition of correlation to say that. Perhaps the word has
| been coopted but the paper linked suggests that the cooprion
| isn't accepted.
| KarlKemp wrote:
| This is... quite wrong? The dictionary says:
| 1. a mutual relationship or connection between two or more
| things 2. [Statistics] interdependence of variable
| quantities. 3. [Statistics] a quantity measuring the
| extent of the interdependence of variable quantities.
|
| The most sympathetic to your definition is Wikipedia:
| In statistics, correlation or dependence is any statistical
| relationship, whether causal or not, between two random
| variables or bivariate data. In the broadest sense
| correlation is any statistical association, though
| it actually refers to the degree to which a pair of variables
| are linearly related.
|
| And that's the _mathematical_ formulation. Correlation also
| has a meaning in everyday speech, and mathematics doesn 't
| have the authority to just adopt terms and then claim people
| are wrong after they've changed the meaning.
|
| Also correlation very definitely means that knowing <x> tells
| you something about <y>. And vice versa. Like, for example:
| its value. Or at least a better idea of it than pure guessing
| without correlation.
| doovd wrote:
| That's s bit unnecessarily pedantic, I think we all
| understand in which context we are talking about
| correlation.
| PopeUrbanX wrote:
| I think it's safe to say you're not the intended audience
| for anything math-related, given that you're going to a
| dictionary to try to figure things out...
| popcube wrote:
| sorry, scientists always use nomal English words in their
| region and then it will get meaning in this specific
| region. It maybe is confused, but the advantage that most
| people understand English is enough.
| UncombedCoconut wrote:
| Hello, see here for an explanation: https://en.wikipedia.or
| g/wiki/Pearson_correlation_coefficien... It's widely
| understood that the words "correlation" and "uncorrelated",
| when used in the context of statistics and not otherwise
| qualified, are shorthand for this definition in particular.
| By "otherwise qualified" I mean, for example, saying
| "Spearman's correlation" (in in the OP's abstract) to
| specify this one: https://en.wikipedia.org/wiki/Spearman%27
| s_rank_correlation_...
| csee wrote:
| I think that depends on context. Sometimes, in a
| technical setting correlation just means dependence as an
| abstract concept, and this includes non-linear
| dependence. Similar how in financial circles, volatility
| doesn't mean standard deviation, but in colloquial
| settings it does.
| mattkrause wrote:
| That matches my experience too.
|
| Correlation, in general, just means some sort of
| statistical dependence: knowing _x_ tells you something
| about _y_. It 's often "operationalized" by computing
| Pearson's _r_ : it's easy to do and there's lots of
| associated theory.
|
| However, I would find it absolutely bizarre if someone
| showed a plot with obvious non-linear dependence and
| described it as "uncorrelated". In that case, the low _r_
| reflects a failure of the measuring tool rather than
| something being measured.
| femto wrote:
| From a signal processing perspective: being able to recognise
| signals in the presence of interference, noise and distortion.
|
| For example, you might have a radio signal (such as WiFi) that
| you want to receive. First step is that you have to pick that
| signal out of whatever signal comes out of your radio receiver:
| which will be the WiFi signal along with all sorts of noise and
| interference from other users. Typically the search will be
| done with the mentioned "Pearson's Correlation", using it to
| compare the received signal with an expected template: a value
| of 1.0 meaning the received signal is a perfect match with the
| template, a value of 0.0 meaning no match at all. If the wanted
| signal is present, interference, noise and distortion will
| reduce the result of the correlation to less than 1.0, meaning
| you might miss the WiFi signal, even though it is present.
|
| This article is about coming up with a measure that gives a
| more robust result in the face of noise, interference and
| distortion. It's fundamental stuff, in that it has quite
| general application.
| loxias wrote:
| (Yay signal processing!)
|
| Skimming it now, this looks wild. Using the variance of the
| _rank_ of the dataset (for a given point, how many are less
| than that point) seems... weird, and throwing out some
| information. The author seems legit tho, so I can 't wait to
| try drop-in implementing this in a few things.
| mattkrause wrote:
| Rank-transforms are pretty common: they show up in a lot of
| non-parametric hypothesis tests, for example.
|
| The neat thing about ranks is that, in aggregate, they're
| very robust. You can make an estimate of the mean
| arbitrarily bad by tweaking a single data point: just send
| it towards +/- infinity and the mean will follow. The
| median, on the other hand, is barely affected by that sort
| of shenanigans.
| ndynan wrote:
| My understanding is that one of the major critiques of
| statistics, especially its use in psychology, has been the use of
| models which are derived from the mean.
|
| There are inherent flaws/assumptions to this approach which Peter
| Molenaar has done extensive work to critique (See Todd Rose's
| book on the subject). For anyone who understands the technique
| presented in this paper, does it also depend on the mean as a
| model like when calculating Pearson's r?
| extremelearning wrote:
| This is an order-based algorithm, so it is more related to the
| median than the mean.
|
| Another very useful consequence of being order-based, is that
| this new coefficient is much more robust to noise/outliers than
| the canonical correlation coefficient.
| LudwigNagasena wrote:
| I think there are far bigger problems with the lack of
| theoretical foundations and abuse of p-values rather than with
| ergodicity or whatever is the pet peeve of Peter Molenaar.
| mellavora wrote:
| Isn't Molenaar looking at networks of symptoms over time? Yes,
| in any multi-variate time series, when one is searching for
| relationships between the variables (and allowing that you may
| have multiple sets of time series which may have some type of
| grouping, i.e. observations from a set of people with one
| diagnosis vs observations from a set of people with a contrary
| or with no diagnosis), then yes, any attempt to find co-
| relations in the multivariate signal need to account for the
| underlying statistical distribution of the signal components.
| The normal distribution isn't a bad first a priori
| approximation, but you really need to check.
|
| side note- it also isn't clear that you can group by diagnosis,
| see, for example, https://pubmed.ncbi.nlm.nih.gov/29154565/,
| which shows that even within diagnostic groups there is
| substantial individual variation.
| neycoda wrote:
| Correlation never means causation, but it typically implies it,
| and often is directly entangled. Like the lead in car pumps being
| tied to city crime, the correlation likely was directly related
| to causation. That's not true of all correlation, but typically
| cause-effect has a correlation to something, even if there's only
| one and many other correlations that appear meaningful are
| meaningless.
| The_rationalist wrote:
| Why woud someone use pearson or smearson coefficients vs this new
| one?
| [deleted]
| tomrod wrote:
| Pearson is for affine/linear+intercept relationships. Spearman
| is for monotonic relationships.
| rwilson4 wrote:
| If the relationship is linear, I'm guessing a test based on
| Pearson has greater power.
| mellavora wrote:
| power and interpretability.
|
| Assuming a linear relationship, if you know the correlation
| coefficient, you can predict unobserved values of y based on
| a known x with good accuracy.
|
| y = ax + b + error
|
| where strong correlation means error is small.
| im3w1l wrote:
| Seems likely. The presented coefficient only looks at the
| ordering of the X-values, and how it relates to the ordering
| of the Y-values. All other information is thrown away. That's
| how it can be so general, but it should come at the expense
| of power.
| emerged wrote:
| Ah that makes sense intuitively. I was confused how non
| linear correlation could be detected without knowing
| anything about the function itself.
| tdullien wrote:
| I have a very naive question: What are the downsides of
| estimating mutual information instead?
|
| I have a math (but not statistics) background, and am sometimes
| bewildered by the many correlation coefficients that float around
| when MI describes pretty exactly what one wants ("how much does
| knowledge of one variable tell you about the other variable"?).
|
| So ... what am I not understanding?
| tibbar wrote:
| Different coefficients help you look at different kinds of
| relationships. For example, Pearson's R tells you about linear
| relationships between variables -- it's closely tied to the
| covariance: "how useful is it to draw a line through these data
| points, and how accurate is interpolating likely to be?".
|
| Spearman's correlation helps you understand monotonic/rank-
| order relationships between variables: "Is there a trend where
| increasing X tends to also increase Y?" (This way we can be
| just as good at measuring the existence of linear, logarithmic,
| or exponential relationships, although we can't tell them
| apart.)
|
| Mutual information helps you understand how similar two
| collections of data are, in the sort of unstructured way that's
| useful in building decision trees. You could have high mutual
| information without any sort of linear or monotonic
| relationship at all. Thus it's more general while at the same
| time not telling you anything that would be helpful in
| building, for instance, a predictive multivariate linear model.
|
| TLDR; More specific coefficients leverage assumptions about the
| structure of the data (eg linearity), which can help you
| construct optimal versions of models under those assumptions.
| Mutual information doesn't make any assumptions about the
| structure of the data so it won't feed into such a model, but
| it still has lots of applications!
| mellavora wrote:
| Instead of mutual information, can I recommend the Transfer
| Entropy?
|
| https://journals.aps.org/prl/abstract/10.1103/PhysRevLett.85...
|
| https://en.wikipedia.org/wiki/Transfer_entropy
|
| It is designed for time series, but can be adopted to a more
| general case. The advantage over MI is that with MI, you could
| only see that A and B are related, while TE can show that A
| dives B, but not the reverse.
| LittlePeter wrote:
| arxiv link: https://arxiv.org/abs/nlin/0001042
| afiori wrote:
| In the article it is explained that the purpose of this
| coefficient is to estimate how much X is a function of Y [1](or
| how noisy this association is); in particular this coefficient
| is 1 iff X is a function of Y.
|
| With MI (the article claims that) you can have a coefficient of
| 1 without X being a function of Y.
|
| [1] this means that this coefficient is intentionally not
| symmetric.
| anewhnaccount2 wrote:
| I'm also interested in this, having tried and semi-successfully
| used mutual information for finding associations between
| multinomial variables. As an even more naive question, I find
| the actual selection of estimators bewildering. How do I know
| which estimator to use for mutual information? How do I know if
| my chosen estimator has converged or is doing a bad job on my
| data? Bringing it back to the topic at hand, does the estimator
| presented in the paper provide good estimates for a wider
| variety of cases than the mutual information plug-in estimator?
| If so I can see it might be nice for simplicity reasons alone.
| Can we have different estimators for this new correlation
| coefficient? Any ideas what that would look like?
| tbenst wrote:
| MI is quite useful and widely used. It typically requires
| binning data though when distributions are unknown /
| empirically estimated. This approach is a rank-based score,
| more similar to Spearman correlation than Pearson. This allows
| for nonlinear relationships between the two variables.
|
| A slightly critical review on the work van be seen here:
| https://academic.oup.com/biomet/advance-
| article/doi/10.1093/.... They argue that the older forms of
| rank correlation, namely D, R, and tau*, are superior.
| Nonetheless, it seems like a nice contribution to the stats
| literature, although I doubt the widespread use of correlation
| is going anywhere.
| spekcular wrote:
| Freely available version of the review paper, for those who
| want it: https://arxiv.org/abs/2008.11619
| dcanelhas wrote:
| Mutual information is not trivial or even possible to estimate
| in many practical situations as far as i know. Example
| applications in robotics or computer vision, where mutual
| information would be useful are segmentation and denoising of
| unordered 3d point data, for example.
| mellavora wrote:
| Yes, as someone mentioned above, the problem is getting the
| underlying distribution of the data, so you can measure SUM
| p_i log(p_i); this usually involves some binning, which can
| be tricky (and yes, I know the formula I gave is entropy not
| MI)
|
| I try to remind myself that "it is just a model", as a
| corollary to "all models are wrong, some are useful." You are
| _never_ dealing with the real world. And you are usually
| trying to estimate some future as of yet unobserved signal
| based on existing data. In other words, if your bins are
| reasonable and reasonably usefully accurate, you can build a
| working if not perfect system.
|
| Don't try to optimize testing error performance to a value
| lower than the irreducible error in the system.
| AstralStorm wrote:
| Even with binning, the problem is one of accurate sampling
| from an unknown probability distribution.
|
| Biased samples produce biased results and this OP
| correlation coefficient might be sensitive to such an
| issue.
|
| In one of the projects we were assuming gamma distribution
| (for speech processing) and sampling that is notoriously
| hard. Trying to use binned MI produced serious errors, as
| opposed to Minimim MSE one, even Maximum Likelihood did
| better (if noisy).
| dcanelhas wrote:
| I'm not sure I understand how binning applies in e.g.
| segmentation of point clouds into distinct objects. The
| data would likely contain a mix of unknown distributions,
| partially observed (due to occlusions) and not easily
| parametrized (chair, table, toaster, etc.)... Locally you
| can find planar patches though, so correlation can still be
| useful.
| nmca wrote:
| non-pdf link to the article via arxiv vanity: https://www.arxiv-
| vanity.com/papers/1909.10140/
| laplacesdemon48 wrote:
| Is there a way to extend this to measure a multivariate
| relationship?
|
| For example: Cor(X, Y & Z)
|
| I know you could run them pairwise but it's possible Cor(X, Y)
| and Cor(X, Z) are close to zero but Cor(X, Y & Z) is close to 1.
| tomrod wrote:
| This is a good question. When I read the article I'll keep it
| in mind!
| kortex wrote:
| Seems easy enough to play around with, and need not be strictly
| numbers either, as long as rank is defined on your fields (they
| are sortable...total ordering? I'm rusty on my terminology).
| Basically sort X, take the variance of the rank of Y, Z, etc.
| Much in the same way you would compute multi variable
| correlation.
| SubiculumCode wrote:
| I think there could be a lot of use of distance based
| correlations https://towardsdatascience.com/introducing-distance-
| correlat...
| infocollector wrote:
| Am I reading the paper definition correctly? If yes, why does the
| vertical line and horizontal line in 2D have so different
| correlations? def corr(x,y): n = len(x)
| an_array = np.stack((x, y), axis=-1) sorted_array =
| an_array[np.argsort(an_array[:, 0])] r =
| rankdata(sorted_array[:,1]) s = sum(np.abs(np.diff(r)))
| return 1.0 - (3.0 \* s/(n\*n -1.0))
| yorwba wrote:
| You're using the formula for the case where there are no ties,
| but the vertical line has all X values tied and the horizontal
| line has all Y values tied. Also, the case where Y is a
| constant is explicitly excluded from consideration in the
| paper.
| The_rationalist wrote:
| could this better represent semantic textual similarity?
| https://paperswithcode.com/sota/semantic-textual-similarity-...
| m1sta_ wrote:
| How does this relate to Maximal Information Coefficient (MIC),
| MGC, PPS and the like?
| derbOac wrote:
| The paper discusses MIC in particular at least. They suggest
| that MIC sometimes overstates the strength of a noisy
| relationship, giving the example of a bivariate normal mixture
| (Figure 9). In that example MIC is 1, but the new measure is
| 0.48 or something, which seems more reasonable.
| d883kd8 wrote:
| There will be no AI fire alarm
| kortex wrote:
| There are so many super cool aspects to this!
|
| - it's always fun when a new equation or formula is discovered,
| doubly so when it is very practical
|
| - actually really easy to wrap your head around it
|
| - seems very computationally efficient. Basically boils down to
| sorts and distance
|
| - not limited to strict numeric fields ( integers and reals).
| Anything with an ordering defined can act as your Xs/Ys:
| characters, words, complex/vectors (under magnitude). I think you
| could even apply it recursively to divide and conquer high
| dimensional datasets.
|
| - looks useful in both LTI and non stationary signal analysis
|
| - possible use cases: exoplanet search, cosmology in general,
| ecology, climate, cryptanalysis, medicine, neuroscience, and of
| course, someone will find a way to win (more likely lose) money
| on stonks.
| thesz wrote:
| You can update Pearson correlation online, you cannot update
| online OP correlation coefficient.
|
| Pearson and other correlation coefficients are linear, O(n),
| sorting would incur logiarithmic multiplier O(NlogN).
|
| Thus, it is not computationally efficient.
|
| To break ties randomly one has to have good random number
| generator, which is a problem in itself.
|
| Finally, if you want to have differentiable version of this
| correlation coefficient, you will need to use sorting networks
| which are O(Nlog^2(N)).
|
| But, it is a cool idea nevertheless, it brings up use of
| ranking in statistics and there are ties to other areas of the
| statistical science.
|
| For example, it appears that you can more efficiently prune
| language models if you use rank metric than probability [1].
|
| [1] https://aclanthology.org/P02-1023.pdf
| TeeMassive wrote:
| > Pearson and other correlation coefficients are linear,
| O(n), sorting would incur logiarithmic multiplier O(NlogN).
| Thus, it is not computationally efficient.
|
| That doesn't seem a deal breaker to me. Sure if you're
| dealing with billions of data entries it will be 9 times
| slower, but throwing 9 times more computational power to
| solve a problem is far from something unheard of nowadays.
| tromp wrote:
| > it will be 9 times slower
|
| Closer to 30 times slower, as the logarithm in sorting
| complexity is a binary rather than decimal one.
| estomagordo wrote:
| I guess more like 30 times slower. Consider mergesort and
| quicksort, which both (aim to) halve the search space with
| every iteration. There is a log2 amount of steps. 2 is our
| base. And yes, base 2 and base 10 logarithms will always be
| within a constant factor of one another, but seeing as we
| _are_ talking about constant factors here to begin with...
| asQuirreL wrote:
| If we are talking about billions of entries, even of just
| 32bit integers, we are talking about 4GB of data, your
| sort is probably going to be IO dominated, not CPU
| (comparison) dominated. In which case you'd likely use a
| sorted data structure with a much higher branching factor
| than 2 (like a B-tree or LSM tree).
|
| Take the B-tree example, with 4KB pages, you can fit 1000
| integers in there for a branching factor of 500, and a
| performance multiplier of just over 3.
| jazzkingrt wrote:
| Can you explain in more detail why this can't be updated
| online?
|
| We can track the rank of newly sampled pairs in LogN using
| something like an order statistic tree:
|
| https://en.wikipedia.org/wiki/Order_statistic_tree
|
| But I guess the problem is that with each new pair, O(n)
| pairs could change their contribution to the correlation
| coefficient.
| thesz wrote:
| Your point is very valid one. I was not able to find out
| how to keep ranks updated efficiently.
| phkahler wrote:
| But now we're moving away from being simple to compute.
| asxd wrote:
| Out of curiosity, why do you think having a good random
| number generator is problematic? It seems like it's easy
| enough to access one in most situations.
| medo-bear wrote:
| how do you sample from an arbitrary probability
| distribution? it is not a trivial problem at all
| kortex wrote:
| Why do you need an arbitrary distribution? We are talking
| about breaking ties, just use uniform. You can probably
| just use RDRAND.
| Hendrikto wrote:
| They are equivalent. Generating an arbitrary distribution
| from a uniform one is trivial.
| kortex wrote:
| If you are willing to accept lookup tables or
| approximations, then yeah, arbitrary distributions are
| trivial. However for certain domains, you may want a more
| closed form solution for mapping uniform to your
| distribution, which may not be obvious. E.g. I would say
| generating pink noise is not "trivial" because there is
| no closed form solution (c.f. generating gaussian
| distribution via Box-Muller, that is trivial), so you
| need to pick a method which may have tradeoffs.
| medo-bear wrote:
| i agree. to break ties uniform distribution should be
| sufficient
| Hendrikto wrote:
| If you can sample from a uniform distribution, it is
| trivial to turn that into any arbitrary distribution.
| thesz wrote:
| Yes, exactly. There are so many to choose from. ;)
|
| There are several discussions here in HN about PRNGs of
| different kinds and people still want to invent new ones.
| Why? Because old ones do not satisfy them in full.
|
| Returning to the problem at hand. How many consecutive ties
| do we expect? This would helps us to define how many bits
| should we extract from state - for N expected consecutive
| ties we should extract 2log_2(N) bits or more (birthday
| paradox). For 64K ties we need 32 bits, which is fair
| amount of state, 1/8 of typical PRNG discussed here.
|
| Most PRNGs are splittable and this also brings the question
| "how does splitting affect bits generated?" Will these bits
| be correlated between splitted generators?
|
| I definitely do not want correlation coefficient
| computation that produces close results for provably
| different sets due to correlation between tie-breaked
| sections filled with random numbers.
| ykonstant wrote:
| > Yes, exactly. There are so many to choose from. ;)
|
| Just use a good PRNG to pick one of them :D
| eutectic wrote:
| You don't have to use sorting networks; to differentiate (or
| get a subgradient, anyway) a sorted list you just pass the
| derivative back to the original index, so you can use any
| sorting algorithm as long as you keep track of the
| permutation. You could even use radix sort.
|
| Also, I would say that random number generation is a well-
| solved problem.
| omnicognate wrote:
| > Thus, it is not computationally efficient.
|
| Is there some standard sense in which only (sub)linear
| algorithms are considered "computationally efficient"?
| O(nlogn) is fine for a very wide variety of practical uses.
| thesz wrote:
| I should point that Pearson's correlation coefficient can
| be updated on-line, the storage complexity is O(1). The
| storage complexity for sorting is O(N).
|
| Thus, this particular algorithm will not fit into, say,
| traffic analysis framework.
| eru wrote:
| The runtime and storage requirements you cite only apply
| when you need exact answers.
|
| If you want to compute a correlation coefficient, you are
| probably happy to get eg 3 significant digits of
| precision. Thus you can use approximation algorithms and
| sampling.
| omnicognate wrote:
| Yeah, this has worse complexity than Pearson. However,
| it's not competing on complexity.
|
| The point of it is to detect general dependencies, where
| Pearson only detects linear ones (cases where Y is
| completely determined by X but their Pearson correlation
| is 0 are trivial to construct). The "linear" in linear
| dependency isn't the same as the "linear" in linear
| complexity here, but it doesn't seem surprising for an
| algorithm that can detect more general dependencies to
| have worse than linear performance (and nlogn is pretty
| much the next best).
|
| In the absence of a known algorithm with the same
| properties and better complexity, it doesn't seem
| unreasonable to call it "computationally efficient" when
| it comes with a complexity sufficient for most uses,
| rather than say quadratic or exponential.
| thesz wrote:
| Rank based correlation coefficients: https://en.wikipedia
| .org/wiki/Correlation_coefficient#Rank
|
| There are several such algorithms.
|
| Also, there is a comment with a link to paper comparing
| this correlation coefficient to some other:
| https://news.ycombinator.com/item?id=29690208
|
| It looks like this "new coefficient of correlation" is
| not without flaws compared to others.
| hansvm wrote:
| Pearson still has a lot of advantages on modern hardware and
| in practice should be a ton faster, but:
|
| (1) Rank coefficients can absolutely be updated online.
|
| (2) There aren't many applications where an extra
| polylogarithmic factor is the difference between a good-
| enough algorithm and something too slow.
|
| (3) The RNG is an implementation detail I'd probably omit.
| It's asymptotically equivalent to the deterministic operation
| of rescaling the sum of all pairwise absolute differences of
| Y values for each run of identical X values. If in your
| initial sort of the X values you instead lexicographically
| sort by X then Y then you can get away with a linear
| algorithm for computing those pairwise absolute differences.
| petters wrote:
| > Finally, if you want to have differentiable version of this
| correlation coefficient
|
| Could you expand a little bit more on how to differentiate
| this coefficient? Since it is only based on ranks, it feels
| very non-differentiable.
| kortex wrote:
| > Pearson and other correlation coefficients are linear,
| O(n), sorting would incur logiarithmic multiplier O(NlogN).
|
| > Thus, it is not computationally efficient.
|
| That is not really what I meant. First, I personally consider
| NlogN the bar for "efficient" as far as algorithms go.
| Anything polynomial is inefficient, and anything better than
| NlogN is like, "super efficient" in my mind. Maybe that is a
| weird opinion, should I update my mental model? My reasoning
| is a lot of well known algorithms operate in better than
| polynomial but worse than linear time.
|
| Second, sorting is a solved problem. Regardless of
| theoretical efficiency, we have so many ways to sort things,
| there is almost always really fast in practice ways of
| sorting data.
|
| I didn't know about sorting networks, that is fascinating!
| ummonk wrote:
| Indeed, in practice the logN factor can only get so big,
| even if N is taken to be the number of particles in the
| observable universe.
| [deleted]
| anovikov wrote:
| I can only second this. Few algorithms are as efficient as
| NlogN.
| thesz wrote:
| Compare Fourier (O(NlogN)) and wavelet (O(N)) transforms.
| Both can be used for similarity (correlation) search
| through projections.
| eru wrote:
| > First, I personally consider NlogN the bar for
| "efficient" as far as algorithms go. Anything polynomial is
| inefficient, and anything better than NlogN is like, "super
| efficient" in my mind.
|
| What is efficient and what ain't depends totally on
| context.
|
| For example, for some domains even proving that a problem
| is in NP (instead of something worse), means that they
| consider the problem tractable, because we have good
| solvers for many NP hard problems in practice.
|
| For other domains, you want O(1) or O(log n) or O(log* n)
| etc.
|
| In general, P vs NP is such a big deal, because polynomial
| runtime is generally seen as tractable, and anything worse
| as intractable.
|
| (Btw, O(n log n) _is_ a polynomial runtime, and so is O(N)
| or even O(1). Though I can see that as a shorthand of
| notation you are excluding them.)
| kortex wrote:
| > (Btw, O(n log n) _is_ a polynomial runtime, and so is
| O(N) or even O(1). Though I can see that as a shorthand
| of notation you are excluding them.)
|
| I mean yes, technically n^1 is a polynomial, but normally
| this is just called linear. You hit the nail on the head
| with the last sentence. Here I'm using polynomial to
| basically mean "n^k, k>1". From a NP vs P point of view,
| P is considered tractable. But from a "actually writing
| libraries" point of view, I think software developers
| consider O(n^2) "kinda bleh" if it can be at all avoided.
| Also developers tend to give much more weight to the
| actual coefficients / parallelism potential / practical
| runtime. Neural networks have superlinear bounds
| everywhere, but GPU goes brrr.
|
| Brass tacks, sorting some tuples and diffing the rank
| then reducing strikes me as likely to be:
|
| - pretty fast even in the naive implementation
|
| - sorting is a well trodden field
|
| - easy to write things like shaders for
|
| - amenable to speculative execution
|
| - amenable to divide and conquer
|
| So that is what I mean by efficient.
| frutiger wrote:
| The "P" in NP means a solution to a decision problem is
| verifiable in polynomial time.
|
| The "N" in NP means the solution can be produced by a
| non-deterministic Turing machine (i.e. a computer that
| can search the problem space in many dimensions in each
| step).
|
| NP does not mean "non-polynomial" time to produce a
| solution, otherwise P != NP by definition. Instead P is a
| subset of NP.
|
| The open question is if all decision problems with
| solutions verifiable in polynomial time can also be
| produced in polynomial time.
| [deleted]
| shahbazac wrote:
| This tweet contains a visualization comparison between linear
| correlation and "Chatarjee correlation:"
|
| https://twitter.com/adad8m/status/1474754752193830912?s=21
| extremelearning wrote:
| This new coefficient of correlation is really really awesome,
| and this visualization shows its value in such a beautifully
| simple presentation.
|
| It would be great if someone who has Wikipedia edit privileges,
| can edit the Wikipedia article at [1] to describe/link how the
| Chatarjee's correlation coefficient solves many of the known
| limitation of Pearson's correlation coefficient. ;)
|
| [1]
| https://en.wikipedia.org/wiki/Pearson_correlation_coefficien...
|
| (especially second top-left diagram)
| [deleted]
| querez wrote:
| Easy there. New correlation coefficients get proposed all the
| time (eg. the introduction of the linked paper lists ~10-20
| alone!). It's not a good idea to add every newly proposed
| coefficient to established wiki pages, just because they
| trend on social media. Yes, the paper looks nice, but if you
| read any new paper proposing a new measure, they all do!
| They're meant to be written that way. Let the community
| decide and test and discuss, and if in 10 years this new
| coefficient is well accepted and has proven itself, we can
| think about your proposed edit. Doing it before is putting
| the cart before the horse, and is a recipe for astroturfing.
| querez wrote:
| Since it's rank-based, it would be nice to see a comparison to
| Spearman's correlation instead. It's very easy to find failure
| cases for Pearson, so the visualization is next to meaningless,
| IMO.
| nextaccountic wrote:
| Can information theoretic measures like Granger causality[0] be
| used for this purpose?
|
| [0] https://en.wikipedia.org/wiki/Granger_causality
| csee wrote:
| Granger causality is about linear inter-temporal dependence.
| Chatterjee correlation is about non-linear contemporaneous (or
| time-independent) dependence. They're tools for quite different
| applications.
| pizza wrote:
| Looks like you could do the opposite: use this measure as a
| proxy for correlation between input data X and Y, and seeing if
| increased correlation between X and Y is superior to just
| prediction of Y without X. Maybe model the correlation value
| going over a threshold as a sequence of Bernoulli trials
___________________________________________________________________
(page generated 2021-12-26 23:01 UTC)