[HN Gopher] How percentile approximation works and why it's more...
___________________________________________________________________
How percentile approximation works and why it's more useful than
averages
Author : od0
Score : 415 points
Date : 2021-09-14 16:14 UTC (6 hours ago)
(HTM) web link (blog.timescale.com)
(TXT) w3m dump (blog.timescale.com)
| axpy906 wrote:
| There's something called a five number summary in statistics:
| mean, median, standard deviation, 25th percentile and 75th
| percentile.
|
| The bonus is that the 75th - 50th gives you the interquartile
| range.
|
| Mean is not a robust measure and as such you need to look at
| variety to truly understand the spread of your data.
| bluesmoon wrote:
| IQR is 75th - 25th, aka, the middle-50%
| axpy906 wrote:
| You're right I mistyped.
| monkeydust wrote:
| Ok this, box plots are a good way to visualize and show
| distribution esp to a not so stat heavy audience.
| 10000truths wrote:
| If you're going to use multiple quantities to summarize a
| distribution, wouldn't using percentiles for all of them give
| you the most information? The mean and standard deviation could
| then be estimated from that data.
| [deleted]
| sharmin123 wrote:
| Let's Secure WiFi Network and Prevent WiFi Hacking:
| https://www.hackerslist.co/lets-secure-wifi-network-and-prev...
| achenatx wrote:
| Ive been trying to get the marketing team to always include a std
| deviation with averages. Average alone is simply not useful,
| standard deviation is a simple way to essentially include
| percentiles.
|
| They regularly compare experiments to the mean but dont use a T
| test to ensure the results are actually different from the mean.
| doctorsher wrote:
| I heavily caution against the feeling that "standard deviation
| is a simple way to essentially include percentiles." The
| usefulness of the standard deviation depends on the
| distributions that you are working with. Heavy tailed
| distributions appear a fair amount in practice, and the combo
| of summary statistics mentioned would not do well on those.
| Also, Madars' comment in this thread is a beautiful example of
| this: 4 completely different distributions, with identical mean
| and standard deviation (among other things). Histograms and
| percentiles, and if necessary their approximations, are more
| desirable for the above reasons.
| slaymaker1907 wrote:
| I assume most of the distributions a marketing department
| would be dealing with are generally normal in which case
| stddev is a great way to analyze the data. This can be easily
| verified by just plotting said data and making sure the tails
| don't look weird.
| djk447 wrote:
| NB: Post author here.
|
| Std deviation definitely helps a lot, still often not as good
| as percentiles, was actually thinking about adding some of that
| in the post but it was already getting so long. It's funny how
| things you think are simple sometimes take the most effort to
| explain, definitely found that on this one.
| waynecochran wrote:
| Yeah -- std deviation has a similar problem to the mean in
| that it doesn't give you a full picture unless the
| distribution is close to normal / gaussian.
| esyir wrote:
| Pretty much why summary statistics often give the IQR,
| which gives some idea to the skew and shape of the
| distribution as well.
|
| Unfortunately, BD and marketing just want a single number
| to show that the value is bigger and hate anything more
| complicated than a barchart.
| varelaz wrote:
| Barchart is basically your percentiles (just more of
| them) so why not show it? Bars and whiskers could be more
| complicated for them but still the same sort of data
| fwip wrote:
| Barcharts across categorical data :P
|
| That is, the first bar is "Our Number" and the second bar
| is "Competitor's number."
| djk447 wrote:
| NB: Post author here.
|
| We've been meaning to add IQR as an accessor function for
| these, may have to go back and do it...the frequency
| trails [1] stuff from Brendan Gregg also goes into some
| of this and it's really cool as a visualization.
|
| [1]:
| https://www.brendangregg.com/FrequencyTrails/mean.html
| LoriP wrote:
| One thing I like about this post is that it explains things in
| an accessible way before getting into a deep dive. Might be
| worth sharing with the marketing team as they'll "get" long
| tail in the context of web search, so the concept is fairly
| transferable to stuff that they would know about.
| 123pie123 wrote:
| I had to explain the data usage of an interface that looked
| extremely busy from the standard graphs
|
| I did a percentile graph of the usage - the data was only
| typically using 5% of the maximum throughput, no-one could really
| understand the graph though
|
| so I did a zoomed-in version of the normal data usage graph and
| it looked like a blip lasting 1/20 of the time - everyone got
| that - eg it was peaking every few seconds and then doing nothing
| for ages
| madars wrote:
| Good opportunity to plug
| https://en.wikipedia.org/wiki/Anscombe%27s_quartet : if you don't
| know much about the underlying distribution, simple statistics
| don't describe it well.
|
| From Wikipedia description: Anscombe's quartet comprises four
| data sets that have nearly identical simple descriptive
| statistics, yet have very different distributions and appear very
| different when graphed. Each dataset consists of eleven (x,y)
| points. They were constructed in 1973 by the statistician Francis
| Anscombe to demonstrate both the importance of graphing data
| before analyzing it, and the effect of outliers and other
| influential observations on statistical properties.
| doctorsher wrote:
| This is excellent information, thank you for posting this! I
| was not familiar with this example previously, but it is a
| _perfect_ example of summary statistics not capturing certain
| distributions well. It 's very approachable, even if you had to
| limit the discussion to mean and variance alone. Bookmarked,
| and much appreciated.
| pdpi wrote:
| There's a fun paper by Autodesk where they make datasets that
| look whatever way you want them to.
|
| https://www.autodesk.com/research/publications/same-stats-di...
| djk447 wrote:
| NB: Post author here.
|
| This is great! So fun...will have to use in the future...
| pdpi wrote:
| While I have your attention...
|
| > For the median and average to be equal, the points less
| than the median and greater than the median must have the
| same distribution (i.e., there must be the same number of
| points that are somewhat larger and somewhat smaller and
| much larger and much smaller).
|
| [0, 2, 5, 9, 9] has both median and mean = 5, but the two
| sides don't really have the same distribution.
| djk447 wrote:
| Totally true...thoughts on how I could rephrase? I guess
| it's more the "weight" of points greater than and less
| than the median should be the same, so symmetric
| distributions definitely have it, asymmetric may or may
| not. Definitely open to revising...
| pdpi wrote:
| > symmetric distributions definitely have it, asymmetric
| may or may not
|
| Doesn't have to be any more complicated than that. It's
| more a curio than an important point anyway :)
| s_gourichon wrote:
| Yes, a both funny and insightful lesson on how weak basic
| indicators (mean, standard deviation, correlation) can be,
| the interest and limits of box-plot with quartiles and
| whiskers, the benefit of the violin-plot.
|
| Definitely worth a quick look.
| djk447 wrote:
| NB: Post author here.
|
| That's really nifty, wish I'd heard about it earlier. Might go
| back and add a link to it in the post at some point too! Very
| useful. Definitely know I wasn't breaking new ground or
| anything, but fun to see it represented so succinctly.
| [deleted]
| pachico wrote:
| Surprisingly, many software engineers I know never used
| percentiles and keep using mean average. True story.
| yongjik wrote:
| Bah, I'll be happy if I could even get correct averages. I see
| pipelines getting value X1 from a server that served 100
| requests, another value X2 from a server that served one
| request, and then it returns (X1+X2)/2.
| mschuetz wrote:
| The mean is something you can easily compute progressively and
| with trivial resources. Median and percentiles, on the other
| hand, can be super expensive and potentially unsuitable for
| some real-time applications, since you need to maintain a
| sorted list of all relevant samples.
| groaner wrote:
| Not surprising, because computing mean is O(n) and median is
| O(n log n).
|
| Lack of resources or pure laziness doesn't make it the right
| measure to use though.
| gpderetta wrote:
| Introselect is O(n), right?
| lewispb wrote:
| Side note, but I love the animations, code snippet design and
| typography in this blog post.
|
| Will think about how I can improve my own blog with these ideas.
| djk447 wrote:
| Thank you! Huge shout out to Shane, Jacob and others on our
| team who helped with the graphics / design elements!
| solumos wrote:
| looking at a histogram is probably the best thing you can do to
| understand how data is distributed
| airstrike wrote:
| Sorry for the minor nitpick, but I find it a bit unusual
| (disappointing?) that there's an image of a candlestick chart at
| the very top, but the article only uses API response times as
| examples...
| Lightbody wrote:
| Whenever this topic comes up, I always encourage folks to watch
| this 2011 classic 15m talk at the Velocity conference by John
| Rauser:
|
| https://www.youtube.com/watch?v=coNDCIMH8bk
| mherdeg wrote:
| I've skimmed some of the literature here when I've spent time
| trying to help people with their bucket boundaries for
| Prometheus-style instrumentation of things denominated in
| "seconds", such as processing time and freshness.
|
| My use case is a little different from what's described here or
| in a lot of the literature. Some of the differences:
|
| (1) You have to pre-decide on bucket values, often hardcoded or
| stored in code-like places, and realistically won't bother to
| update them often unless the data look unusably noisy.
|
| (2) Your maximum number of buckets is pretty small -- like, no
| more than 10 or 15 histogram buckets probably. This is because my
| metrics are very high cardinality (my times get recorded
| alongside other dimensions that may have 5-100 distinct values,
| things like server instance number, method name, client name, or
| response status).
|
| (3) I think I know what percentiles I care about -- I'm
| particularly interested in minimizing error for, say, p50, p95,
| p99, p999 values and don't care too much about others.
|
| (4) I think I know what values I care about knowing precisely!
| Sometimes people call my metrics "SLIs" and sometimes they even
| set an "SLO" which says, say, I want no more than 0.1% of
| interactions to take more than 500ms. (Yes, those people say, we
| have accepted that this means that 0.1% of people may have an
| unbounded bad experience.) So, okay, fine, let's force a bucket
| boundary at 500ms and then we'll always be measuring that SLO
| with no error.
|
| (5) I know that the test data I use as input don't always reflect
| how the system will behave over time. For example I might feed my
| bucket-designing algorithm yesterday's freshness data and that
| might have been a day when our async data processing pipeline was
| never more than 10 minutes backlogged. But in fact in the real
| world every few months we get a >8 hour backlog and it turns out
| we'd like to be able to accurately measure the p99 age of
| processed messages even if they are very old... So despite our
| very limited bucket budget we probably do want some buckets at 1,
| 2, 4, 8, 16 hours, even if at design time they seem useless.
|
| I have always ended up hand-writing my own error approximation
| function which takes as input like
|
| (1) sample data - a representative subset of the actual times
| observed in my system yesterday
|
| (2) proposed buckets - a bundle of, say, 15 bucket boundaries
|
| (3) percentiles I care about
|
| then returns as output info about how far off (%age error) each
| estimated percentile is from the actual value for my sample data.
|
| Last time I looked at this I tried using libraries that purport
| to compute very good bucket boundaries but they give me, like,
| 1500 buckets with very nice tiny error, but no clear way to make
| real-world choice about collapsing this into a much smaller set
| of buckets with comparatively huge, but manageable, error.
|
| I ended up just advising people to
|
| * set bucket boundaries at SLO boundaries, and be sure to update
| when the SLO does
|
| * actually look at your data and understand the data's shape
|
| * minimize error for the data set you have now; logarithmic
| bucket sizes with extra buckets near the distribution's current
| median value seems to work well
|
| * minimize worst-case error if the things you're measuring grow
| very small or very large and you care about being able to observe
| that (add extra buckets)
| convolvatron wrote:
| look at Greenwald-Khanna (and the followon work). It adapts the
| bucket size to minimize the total error (and the number of
| buckets is proportional to log-epsilon I think).
|
| with all the competition and 'innovation', you would think the
| data dogs of this world would grow up beyond time series.
| slaymaker1907 wrote:
| One thing you could try to use are variance optimal histograms.
| These are histograms which set bucket boundaries such that the
| weighted average variance in the buckets is minimized.
| Unfortunately, this algorithm is quadratic with the number of
| data points, but you could try approximating the optimum by
| taking random subsets of the data, computing the histogram,
| then seeing how well the histogram does on the whole dataset.
|
| http://www.mathcs.emory.edu/~cheung/Courses/584/Syllabus/pap...
| djk447 wrote:
| NB: Post author here.
|
| This is interesting and I totally get at least some of the
| problems you're facing. I wonder if you could take some of the
| strategies from t-digest and modify a bit to accomplish...I'd
| be interested in seeing some sort of implementation of this and
| would love to see if we can get it into our toolkit if you
| do...or you can also open up a ticket for us and we'll see if
| we can prioritize to work on something like it ourselves.
|
| I do think there are some interesting ways you can cut corners
| if you know things about the "SLO" or other sort of cutoff
| values, but I'd have to think more deeply about it to say more.
| Essentially we want a variable error rate based on the distance
| away from a known value, where you have little error in the
| values relatively close to the known value, care little if at
| all for fine distinctions on the low end and, once you get past
| some high end value you could probably bucket everything above
| it into a "large outliers" bucket or something too...meaning
| you p999 could get out of hand if you start getting lots of
| stuff above a threshold, but, if that starts happening,
| probably everything's already burning down, so might not be
| that useful to know it's burning down in a very specific way...
| louisnow wrote:
| ```To calculate the 10th percentile, let's say we have 10,000
| values. We take all of the values, order them from largest to
| smallest, and identify the 1001st value (where 1000 or 10% of the
| values are below it), which will be our 10th percentile.```
|
| Isn't this contradictory? If we order the values from largest to
| smallest and take the 1001st value, then 10 % of the values are
| above/larger and not below/smaller. I believe it should say order
| from smallest to largest.
| emgo wrote:
| Yes, this looks like a typo in the article. It should be
| smallest to largest.
| djk447 wrote:
| NB: Post author here.
|
| Oops, yep, that should probably be order from smallest to
| largest. Thanks for the correction!
| djk447 wrote:
| Fixed!
| tiffanyh wrote:
| Statistics 101. Mean, median and mode.
| k__ wrote:
| Shouldn't averages&variance be enough for start?
| deft wrote:
| Looks like timescale did a big marketing push this morning only
| for their whole service to go down minutes later. lol.
| satvikpendem wrote:
| I often see HN articles crop up soon after a related post, in
| this case this Ask HN poster [0] being driven crazy by people
| averaging percentiles and them not seeing that it's a big deal.
| It's pretty funny to see such tuples of posts appearing.
|
| https://news.ycombinator.com/item?id=28518795
| michaelmdresser wrote:
| Looking particularly at latency measurements, I found the "How
| NOT to Measure Latency" [1] talk very illuminating. It goes quite
| deep into discussing how percentiles can be used and abused for
| measurement.
|
| [1]: https://www.infoq.com/presentations/latency-response-time/
| lanstin wrote:
| I watch this video once a year and send it to my co-workers
| whenever averages or medians shows up in a graph for public
| consumption.
| peheje wrote:
| Are the points written in a readable format anywhere?
| varelaz wrote:
| Percentiles for sure are better than average if you want to
| explore distribution: there are several percentiles comparing to
| single average. However median is harder to use if you want to do
| calculations based on this metric. For example distribution of
| sample median could be a problem, if you want to understand
| confidence interval for it for example.
| djk447 wrote:
| NB: Post author here.
|
| Totally can be true. In our case, we use these approximation
| methods that allow you to get multiple percentiles "for free"
| definitely need to choose the right ones for the job. (We talk
| a bit more about the whole approach where we do the aggregation
| then the accessor thing in the previous post on two-step
| aggregation [1]). But there are definitely times when
| averages/stddev and potentially the 3rd and 4th moments are
| more useful etc.
|
| [1]: https://blog.timescale.com/blog/how-postgresql-
| aggregation-w...
| motohagiography wrote:
| Recovering product manager in me sees 90th percentile queries
| with outlier high latency and starts asking instead of how to
| reduce it, how we can to spin it out into a dedicated premium
| query feature, as if they're willing to wait, they're probably
| also willing to pay.
|
| Highly recommend modelling your solution using queueing theory
| with this: https://queueing-tool.readthedocs.io/en/latest/
|
| As an exercise in discovering the economics of how your platform
| works, even just thinking about it in these terms can save a
| great deal of time.
| tunesmith wrote:
| I just recently tried giving a presentation to my department
| (they're developers, I'm architect) about this stuff and they all
| just sort of blinked at me. It brought in Little's Law and
| Kingman's Formula, in an attempt to underscore why we need to
| limit variation in the response times of our requests.
|
| There are a bunch of queuing theory formulas that are really cool
| but don't exactly apply if your responses vary a lot like the
| article describes. I think the assumption is that response time
| distributions are exponential distributions, which I don't think
| is a good assumption (is an Erlang distribution an exponential
| distribution?) Nevertheless, hooking the equations up to some
| models is a good way to get directional intuition. I didn't
| realize how steep the performance drop-off is for server
| utilization until I started moving sliders around.
|
| Our ops team doesn't really follow this either. We're not a huge
| department though - is this the kind of stuff that an SRE team
| usually pays attention to?
| djk447 wrote:
| NB: Post author here.
|
| I found it surprisingly difficult to explain well. Took a lot
| of passes and a lot more words than I was expecting. It seems
| like such a simple concept. I thought the post was gonna be the
| shortest of my recent ones, and then after really explaining it
| and getting lots of edits and rewriting, it was 7000 words and
| ...whoops! But I guess it's what I needed to explain it well
| (hope you thought so anyway).
|
| It's somewhat exponential, but yeah, not necessarily, it's
| definitely long-tailed in some way and it sort of doesn't
| matter what the theoretical description is (at least in my
| mind) the point is these types of distributions really don't
| get described well by a lot of typical statistics.
|
| Can't talk too much about SREs at the ad analytics company I
| mentioned, we were the backend team that wrote a lot of the
| backend stores / managed databases / ran the APIs and monitored
| this stuff a bit (and probably not all that well). It was a bit
| more ad hoc I guess, probably now that company is large enough
| they have a dedicated team for that...
| tunesmith wrote:
| The article hit the pocket pretty exactly for how I felt I
| needed to explain it. (Actually I had the same experience
| where I thought I would be able to go over it quickly and
| then I felt like I was getting mired.) The graphs look great
| too - I've been able to explore it pretty well with
| JupyterLab but I can't export that into something super-
| interactive that is read-only. I've thought about creating an
| idyll page out of it to help others explore but that's a bit
| overkill for the work style our team has.
|
| I _think_ the weirdly-shaped long-tail graphs we come across
| are just sums of more naturally-distributed response times,
| for different types of responses. Another reason to limit
| variation I think.
| benlivengood wrote:
| > I think the weirdly-shaped long-tail graphs we come
| across are just sums of more naturally-distributed response
| times, for different types of responses. Another reason to
| limit variation I think.
|
| The best method I've found for digging into latency is
| profiling RPC traces to see where time is spent. This at
| least separates the code and parameters that have simple
| latency distributions from those that don't. Some
| distributions will be very data-dependent (SQL queries).
| bqe wrote:
| Awhile ago I wrote a Python library called LiveStats[1] that
| computed any percentile for any amount of data using a fixed
| amount of memory per percentile. It uses an algorithm I found in
| an old paper[2] called P^2. It uses a polynomial to find good
| approximations.
|
| The reason I made this was an old Amazon interview question. The
| question was basically, "Find the median of a huge data set
| without sorting it," and the "correct" answer was to have a fixed
| size sorted buffer and randomly evict items from it and then use
| the median of the buffer. However, a candidate I was interviewing
| had a really brilliant insight: if we estimate the median and
| move it a small amount for each new data point, it would be
| pretty close. I ended up doing some research on this and found
| P^2, which is a more sophisticated version of that insight.
|
| [1]: https://github.com/cxxr/LiveStats
|
| [2]: https://www.cs.wustl.edu/~jain/papers/ftp/psqr.pdf
| [deleted]
| [deleted]
| cyral wrote:
| There are some newer data structures that take this to the next
| level such as T-Digest[1], which remains extremely accurate
| even when determining percentiles at the very tail end (like
| 99.999%)
|
| [1]: https://arxiv.org/pdf/1902.04023.pdf /
| https://github.com/tdunning/t-digest
| fotta wrote:
| yep I had to implement t-digest in a monitoring library.
| another alternative (although older) that the prometheus
| libraries use is CKMS quantiles [0].
|
| [0] http://dimacs.rutgers.edu/~graham/pubs/papers/bquant-
| icde.pd...
| djk447 wrote:
| NB: Post author here.
|
| Yeah, that was one of the reasons we chose it as one of the
| ones to implement, seemed like that was a really interesting
| tradeoff, we also used uddsketch[1] which provides relative
| error guarantees, which is pretty nifty. We thought they
| provided different enough tradeoffs that we wanted to
| implement both.
|
| [1]: https://arxiv.org/abs/1908.10693
| vvern wrote:
| Is it using https://github.com/tvondra/tdigest under the
| hood, or a separate implementation?
| Lockerman wrote:
| a separate implementation
| mfreed wrote:
| If folks are interested:
|
| https://github.com/timescale/timescaledb-
| toolkit/blob/main/e...
|
| (The TimescaleDB Toolkit is also implemented in Rust)
| riskneutral wrote:
| Also relevant: Single-Pass Online Statistics Algorithms
|
| [1] http://www.numericalexpert.com/articles/single_pass_stat/
| breuleux wrote:
| That's pretty neat! Can these be used to efficiently compute
| rolling percentiles (over windows of the data), or just
| incremental?
| [deleted]
| WireBaron wrote:
| The UDDSketch (default) implementation will allow rolling
| percentiles, though we still need a bit of work on our end
| to support it. There isn't a way to do this with TDigest
| however.
| jeffbee wrote:
| Sure there is. You simply maintain N phases of digests,
| and every T time you evict a phase and recompute the
| summary (because T-digests are easily merged).
| cyral wrote:
| This is what I do, it's not a true rolling digest but it
| works well enough for my purposes.
| djk447 wrote:
| I think this would be a tumbling window rather than a
| true "rolling" tdigest. I suppose you could decrement the
| buckets, but it gets a little weird as splits can't
| really be unsplit. The tumbling window one would probably
| work, though Tdigest is a little weird on merge etc as
| it's not completely deterministic with respect to
| ordering and merging (Uddsketch is) so it's _likely_ you
| get something that is more than good enough, but wouldn
| 't be the same as if you just calculated it directly so
| it gets a little confusing and difficult.
|
| (NB: Post author here).
| convolvatron wrote:
| i think the new ones started wtih Greenwald-Khanna. but i
| definately agree - p^2 can be a little silly and misleading.
| in particular it is really poor at finding those little modes
| on the tail that correspond to interesting system behaviours.
| cyral wrote:
| That sounds familiar, I remember reading about Greenwald-
| Khanna before I found T-Digest after I ran into the "how to
| find a percentile of a massive data set" problem myself.
| djk447 wrote:
| NB: Post author here.
|
| Thanks for sharing! Hadn't heard of that algorithm, have seen a
| number of other ones out there, we chose a couple that we knew
| about / were requested by users. (And we are open to more user
| requests if folks want to use other ones!
| https://github.com/timescale/timescaledb-toolkit and open an
| issue!)
| hikerclimber1 wrote:
| Smoking cannabis is also bad for you. It's just a ploy by
| governments to get you to smoke so you die earlier.
| tantalor wrote:
| > Find the median ... randomly evict items
|
| So, not find, but approximate. That's a different thing.
| Izikiel43 wrote:
| > The question was basically, "Find the median of a huge data
| set without sorting it,"
|
| Isn't this done using a min heap and a max heap in conjuction?
| 5faulker wrote:
| It's funny that this is often left out from a data &
| algorithm class.
| lanstin wrote:
| Because unlike many dynamic programming algorithms, it is
| something anyone running a large system will need.
| thaumasiotes wrote:
| Is the minheap-maxheap approach faster than sorting the
| data? The obvious approach (process each element, one by
| one, into the appropriate heap, and rebalance the heaps so
| they are of equal size) takes n log n time and linear
| space. You can use the same resources to just produce a
| sorted copy of the input, which is a much better thing to
| have than two heaps that center on the median.
| ilikebits wrote:
| The real constraint here is probably "find the median of a
| huge data set without holding the entire data set in memory".
| robotresearcher wrote:
| 'Estimate the median of an arbitrary sized data set using a
| constant amount of memory'.
| tantalor wrote:
| No, for 2 reasons,
|
| 1. huge data set: heaps requires storing the whole set, but
| "huge" means "more than you can store"
|
| 2. without sorting it: heaps are basically semi-sorted, so
| you are breaking the rules
| xyzzyz wrote:
| You can actually construct the heap from unsorted data in
| O(n) time, so constructing the heap is definitely not
| sorting. However, yeah, to actually use the heap to find
| median in O(n) time, you need to do something similar to
| magic-five (median of medians) algorithm.
| thaumasiotes wrote:
| > but "huge" means "more than you can store"
|
| Really? Where's it coming from?
| namrog84 wrote:
| I think they mean having more than you can store
| simultaneously on a single device.
|
| With a few exceptions this is pretty common scenario.
| Dylan16807 wrote:
| More than _you_ can store.
|
| And possibly it's live data.
| abnry wrote:
| One way to think about why we tend to use averages instead of
| medians is that it is related to a really deep theorem in
| probability: The Central Limit Theorem.
|
| But I think we can twist our heads and see in a way that this is
| backwards. Mathematically, the mean is much easier to work with
| because it is linear and we can do algebra with it. That's how we
| got the Central Limit Theorem. Percentiles and the median, except
| for symmetric distributions, are not as easy to work with. They
| involve solving for the inverse of the cumulative function.
|
| But in many ways, the median and percentiles are a more relevant
| and intuitive number to think about. Especially in contexts where
| linearity is inappropriate!
| a-dub wrote:
| i think of it as: if the data is gaussian, use a mean,
| otherwise go non-parametric (medians/percentiles).
|
| or put another way, if you can't model it, you're going to have
| to sort, or estimate a sort, because that's all that's really
| left to do.
|
| this shows up in things from estimating centers with
| means/percentiles to doing statistical tests with things like
| the wilcoxon tests.
| lanstin wrote:
| Assume up front none of your measured latencies from a
| software networked system will be Gaussian, or
| <exaggereation> you will die a painful death </exaggeration>.
| Even ping times over the internet have no mean. The only good
| thing about means is you can combine them easily, but since
| they are probably a mathematical fiction, combining them is
| even worse. Use T-Digest or one of the other algorithms being
| highlighted here.
| a-dub wrote:
| yep, have made that mistake before. even turned in a write-
| up for a measurement project in a graduate level systems
| course that reported network performance dependent
| measurements with means over trials with error bars from
| standard deviations.
|
| sadly, the instructor just gave it an A and moved on. (that
| said, the amount of work that went into a single semester
| project was a bit herculean, even if i do say so myself)
| bluesmoon wrote:
| This is exactly the algorithm we developed at LogNormal (now part
| of Akamai) 10 years ago for doing fast, low-memory percentiles on
| large datasets.
|
| It's implemented in this Node library:
| https://github.com/bluesmoon/node-faststats
|
| Side note: I wish everyone would stop using the term Average to
| refer to the Arithmetic mean. "Average" just means some statistic
| used to summarize a dataset. It could be the Arithmetic Mean,
| Median, Mode(s), Geometric Mean, Harmonic Mean, or any of a bunch
| of other statistics. We're stuck with AVG because that's the
| function used by early databases and Lotus 123.
| JackFr wrote:
| No we're stuck with it because average was used colloquially
| for arithmetic mean for decades.
|
| I wish people would stop bad-mouthing the arithmetic mean. If
| you have to convey information about a distribution and you've
| got only one number to do it, the arithmetic mean is for you.
| bluesmoon wrote:
| Yes, Lotus 123 came out 38 years ago :)
| jrochkind1 wrote:
| I think it still depends on the nature of the data and your
| questions about it, and median is often the better choice if
| you have to pick only one.
| hnuser123456 wrote:
| Gamers have an intuitive sense of this. Your average framerate
| can be arbitrarily high, but if you have a big stutter every
| second between the smooth moments, then a lower but more
| consistent framerate may be preferable, typically expressed as
| the 1% and 0.1% slowest frames, which at a relatively typical
| 100fps, represents the slowest frame every second and every 10
| seconds.
| slaymaker1907 wrote:
| Yep, I often find it useful to limit framerate to a number I
| know my GPU can handle so that I don't get as much stuttering.
| It's better to run at 60 FPS all the time than 80 FPS most of
| the time but with stuttering.
| djk447 wrote:
| NB: Post author here.
|
| _Love_ this example. Might have to use that in a future post.
| Feel like a lot of us are running into a similar thing with
| remote work and video calls these days...
| tzs wrote:
| I have no hope of finding a cite for this, but a long time ago
| I read some command line UI research that found if you had a
| system where commands ranged from instant to taking a small but
| noticeable time and you introduced delays in the faster
| commands to make it so all commands took the same small but
| noticeable time people would think that the system was now
| faster overall.
| wruza wrote:
| I guess that's because our minds (and animal minds as well)
| are always aware of the pace of repetitive events. If
| something is off, the anxiety alarm rings. One old book on
| the brain machinery described an example of a cat that was
| relaxing near the metronome and became alert when it was
| suddenly stopped. Unpredictable delays are disturbing,
| because a mispredicted event means you may be in a dangerous
| situation and have to recalibrate _now_.
| [deleted]
| kazinator wrote:
| Gamers do not have a more intuitive sense for this than movie
| watchers, video/voice talkers, not to mention users who type
| text into bloated web browsers or lagged remote login sessions.
|
| I would rather have a rock-steady consistent 500 ms reponse
| time when typing text, than to have a 100 ms average response
| time which randomly spikes to outliers that go past one second.
|
| A rock-steady, though poor, event rate in a paint application
| is better for drawing a freehand curve (especially if the
| program interpolates well) than a really fast rate that
| suddenly has a glitch in it, spoiling your work.
| hnuser123456 wrote:
| Maybe not all, but ones who create and consume analysis like
| this
|
| https://www.techpowerup.com/review/msi-geforce-
| rtx-3090-gami...
| exolymph wrote:
| "Slow is smooth and smooth is fast"
| rahimnathwani wrote:
| For some things, you can't even sensibly measure the mean. For
| example, if you're measuring the mean response time for a
| service, a single failure/timeout makes the mean response time
| infinite (because 100 years from now the response still hasn't
| been received).
|
| "Why Averages Suck and Percentiles are Great":
| https://www.dynatrace.com/news/blog/why-averages-suck-and-pe...
| djk447 wrote:
| totally. that blog was also one of the sources I mentioned in
| the post! Good stuff
|
| NB: Post author here.
| contravariant wrote:
| It's indeed always worth pointing that a mean may or may not
| exist. Same with variances/standard-deviation.
|
| The central limit theorem seems to have given people the
| slightly wrong idea that things will _always_ average out
| eventually.
| maxnoe wrote:
| Coughs Cauchy
| ruchin_k wrote:
| Spent several years in venture capital investing and averages
| were always misleading - as Nassim Taleb says "Never cross a
| river that is on average 4 feet deep"
| robbrown451 wrote:
| Median/50 percentile isn't a whole lot better in that case.
| aidenn0 wrote:
| Be careful translating percentiles of requests to percentiles of
| users; if less than 10% of your requests take over 1 second, but
| a typical user makes 10 requests, it's possible that the majority
| your users are seeing a request take over 1 second.
| djk447 wrote:
| NB: Post author here.
|
| Yep! Briefly noted that in the post, but deserves re-stating!
| it's definitely a more complex analysis to figure out the
| percentage of users affected (though often more important)
| could be majority could also be one user who has some data
| scientist programmatically making hundreds of long API calls
| for some task...(can you tell that I ran into that? Even worse
| it was one of our own data scientists ;) ).
___________________________________________________________________
(page generated 2021-09-14 23:00 UTC)