[HN Gopher] Histogram vs. ECDF
___________________________________________________________________
Histogram vs. ECDF
Author : r4um
Score : 57 points
Date : 2022-09-03 05:42 UTC (1 days ago)
(HTM) web link (brooker.co.za)
(TXT) w3m dump (brooker.co.za)
| mike-the-mikado wrote:
| I recommend Kernel Density Estimation as an alternative to
| histograms if you are specifically interested in the density -
| e.g. which values are particularly likely to occur (perhaps for
| multimodal distributions).
|
| https://en.wikipedia.org/wiki/Kernel_density_estimation
| iamcreasy wrote:
| I've been told box plot with Kernel density estimation on the
| side(axis) are very useful.
| ttpphd wrote:
| I'm a behavioral scientist and I find both are useful. If you
| never look at a histogram it's surprisingly easy to fool yourself
| about what exactly the ecdf is telling you in certain situations,
| particularly when comparing distributions.
| aquafox wrote:
| The ECDF is particularly useful to compare two distributions. And
| it has the nice connection to the Kolmogoroff-Smirnov test for
| testing if two distributions are different: It's test statistic
| is the maximum distance between the two ECDFs.
| lukego wrote:
| This test seems really underrated. It's my go-to for comparing
| computer system performance (e.g. between versions on CI) since
| they often have very peculiar distributions and are relatively
| cheap to produce enough samples from.
| bigbillheck wrote:
| I've stopped using it as I found it far too sensitive to
| small differences.
| klipt wrote:
| Perhaps this comes from a misunderstanding of what
| statistical significance means. A test reporting a
| statistically significant difference doesn't mean the
| difference is big, just that it's big enough to separate
| out as a "signal" from the underlying random noise.
|
| It's basically saying "yes, given this data I am very
| confident that there is an underlying difference that is
| not just an artefact of random sampling".
| mapme wrote:
| Hmm, I was trying to do the same with KS test on performance
| data but it seemed extremely sensitive to outliers even when
| an eyeball test of the two distributions look near exact.
| Have you ran into any of those issues?
| lukego wrote:
| Not with outliers, no, I think it handles them particularly
| well (conservatively.) But yes to small consistent
| differences (e.g. uniform 1%) that affect the relative
| order of results but not by an very important amount. So
| have to consider effect size even when the test statistic
| is strong.
| dafelst wrote:
| While this is nice, it seems like without bucketing you would run
| into complexity issues with large amounts of data, right? i.e. to
| plot a true eCDF you need a sorted list of all the collected
| datapoints. I guess for actual plotting you have to effectively
| bucketize based on the number of pixels in your plot, but that
| seems fairly arbitrary.
|
| Histograms are nice in that they effectively compress non-trivial
| datasets (at least those that have a reasonable bounded domain)
| to something quite manageable.
|
| I guess there is nothing stopping you from doing the same thing
| here, but it does kind of discount the author's claim of not
| being able to go between histogram and eCDF.
|
| Am I missing something?
| AstralStorm wrote:
| Computing a sorted list online is an amortized O(1) operation,
| O(log n) worst case.
| ironSkillet wrote:
| Adding elements one at a time is O(log n) into an already
| sorted list. But producing a complete sorted list requires
| doing that n times, so you end up with n log n anyways. Am I
| missing something?
| marcosdumay wrote:
| If you have more data points than horizontal pixels, yes, you
| will bucket the data on your display resolution. That happens
| with any kind of plotting.
|
| What is a completely different thing from the arbitrary
| bucketing for histograms. CDF doesn't go to zero or becomes
| misleading if you bucket it wrong. You just lose the details.
| dafelst wrote:
| My point is more that you need to store n values (where n is
| the number of samples) or 2k if there are dupes (where k is
| the number of distinct values) for an eCDF, which if you did
| that anyways, you could generate a histogram from the same
| data.
|
| If there are duplicate sample values, you can still store a
| sorted list of (sample,count) here and generate either a
| histogram OR an eCDF, or any other plot really.
|
| Effectively it is not a fair comparison to compare the two
| methods since they both have storage tradeoffs that are not
| really discussed.
| uluyol wrote:
| This is a nice article, but one this that's not quite right is
| that you can go from a histogram to an eCDF (basically view the
| bucketing as a loss in measurement precision).
|
| I mention this because histograms, especially HDR histograms, are
| a very compact way of measuring distributions, and it's nice that
| you can keep those benefits and still convert to an eCDF.
| TTPrograms wrote:
| I think there's an issue with the histogram rendering in this
| post. The rapid descent from the spike on the left is not
| consistent with high ECDF impact and the apparent binning
| resolution visible in the piecewise line-segments. In general
| histograms should not be visualized with connected line-graphs in
| this way - the standard bar graph depiction makes the bin-width
| apparent and resolves some of the issues the article needs the
| ECDF for (e.g. relative impact can be assessed visually by
| comparing the relative areas of the associated bars). The bar
| visualization also makes it possible to use varying bin sizes,
| which is extremely useful with any distribution that has tails.
| chrsig wrote:
| for anyone finding themselves doing a bit of analysis using a
| eCDF, seaborn[0] has a plot for it
|
| https://seaborn.pydata.org/generated/seaborn.ecdfplot.html
___________________________________________________________________
(page generated 2022-09-04 23:01 UTC)