[HN Gopher] How percentile approximation works and why it's more...
       ___________________________________________________________________
        
       How percentile approximation works and why it's more useful than
       averages
        
       Author : od0
       Score  : 650 points
       Date   : 2021-09-14 16:14 UTC (1 days ago)
        
 (HTM) web link (blog.timescale.com)
 (TXT) w3m dump (blog.timescale.com)
        
       | axpy906 wrote:
       | There's something called a five number summary in statistics:
       | mean, median, standard deviation, 25th percentile and 75th
       | percentile.
       | 
       | The bonus is that the 75th - 50th gives you the interquartile
       | range.
       | 
       | Mean is not a robust measure and as such you need to look at
       | variety to truly understand the spread of your data.
        
         | bluesmoon wrote:
         | IQR is 75th - 25th, aka, the middle-50%
        
           | axpy906 wrote:
           | You're right I mistyped.
        
           | monkeydust wrote:
           | Ok this, box plots are a good way to visualize and show
           | distribution esp to a not so stat heavy audience.
        
         | 10000truths wrote:
         | If you're going to use multiple quantities to summarize a
         | distribution, wouldn't using percentiles for all of them give
         | you the most information? The mean and standard deviation could
         | then be estimated from that data.
        
       | ableal wrote:
       | Just as a note, same topic:
       | https://news.ycombinator.com/item?id=19552112
       | 
       | """ Averages Can Be Misleading: Try a Percentile (2014)
       | (elastic.co) 199 points by donbox on April 2, 2019 | | 55
       | comments """
        
       | jpgvm wrote:
       | If you are doing this sort of work I highly recommend the
       | Datasketches library. It's used in Druid, which is a database
       | specifically designed for these sorts of aggregations over
       | ridiculously large datasets.
        
       | [deleted]
        
       | sharmin123 wrote:
       | Let's Secure WiFi Network and Prevent WiFi Hacking:
       | https://www.hackerslist.co/lets-secure-wifi-network-and-prev...
        
       | achenatx wrote:
       | Ive been trying to get the marketing team to always include a std
       | deviation with averages. Average alone is simply not useful,
       | standard deviation is a simple way to essentially include
       | percentiles.
       | 
       | They regularly compare experiments to the mean but dont use a T
       | test to ensure the results are actually different from the mean.
        
         | doctorsher wrote:
         | I heavily caution against the feeling that "standard deviation
         | is a simple way to essentially include percentiles." The
         | usefulness of the standard deviation depends on the
         | distributions that you are working with. Heavy tailed
         | distributions appear a fair amount in practice, and the combo
         | of summary statistics mentioned would not do well on those.
         | Also, Madars' comment in this thread is a beautiful example of
         | this: 4 completely different distributions, with identical mean
         | and standard deviation (among other things). Histograms and
         | percentiles, and if necessary their approximations, are more
         | desirable for the above reasons.
        
           | slaymaker1907 wrote:
           | I assume most of the distributions a marketing department
           | would be dealing with are generally normal in which case
           | stddev is a great way to analyze the data. This can be easily
           | verified by just plotting said data and making sure the tails
           | don't look weird.
        
             | im3w1l wrote:
             | I can't help but idly wonder what humans are doing when
             | they are eyeballing the tails, to see if things look good.
             | Like lets say we wanted to do the eyeball test but
             | automatically. Would the best way be to use an image
             | classifier on the plot? Is there something magic about the
             | plot representation that would make it good even for
             | computers to use?
        
         | djk447 wrote:
         | NB: Post author here.
         | 
         | Std deviation definitely helps a lot, still often not as good
         | as percentiles, was actually thinking about adding some of that
         | in the post but it was already getting so long. It's funny how
         | things you think are simple sometimes take the most effort to
         | explain, definitely found that on this one.
        
           | waynecochran wrote:
           | Yeah -- std deviation has a similar problem to the mean in
           | that it doesn't give you a full picture unless the
           | distribution is close to normal / gaussian.
        
             | esyir wrote:
             | Pretty much why summary statistics often give the IQR,
             | which gives some idea to the skew and shape of the
             | distribution as well.
             | 
             | Unfortunately, BD and marketing just want a single number
             | to show that the value is bigger and hate anything more
             | complicated than a barchart.
        
               | varelaz wrote:
               | Barchart is basically your percentiles (just more of
               | them) so why not show it? Bars and whiskers could be more
               | complicated for them but still the same sort of data
        
               | fwip wrote:
               | Barcharts across categorical data :P
               | 
               | That is, the first bar is "Our Number" and the second bar
               | is "Competitor's number."
        
               | esyir wrote:
               | Someone clearly gets it. Variability viz and spread
               | detracts from that clear message.
        
               | djk447 wrote:
               | NB: Post author here.
               | 
               | We've been meaning to add IQR as an accessor function for
               | these, may have to go back and do it...the frequency
               | trails [1] stuff from Brendan Gregg also goes into some
               | of this and it's really cool as a visualization.
               | 
               | [1]:
               | https://www.brendangregg.com/FrequencyTrails/mean.html
        
         | LoriP wrote:
         | One thing I like about this post is that it explains things in
         | an accessible way before getting into a deep dive. Might be
         | worth sharing with the marketing team as they'll "get" long
         | tail in the context of web search, so the concept is fairly
         | transferable to stuff that they would know about.
        
       | 123pie123 wrote:
       | I had to explain the data usage of an interface that looked
       | extremely busy from the standard graphs
       | 
       | I did a percentile graph of the usage - the data was only
       | typically using 5% of the maximum throughput, no-one could really
       | understand the graph though
       | 
       | so I did a zoomed-in version of the normal data usage graph and
       | it looked like a blip lasting 1/20 of the time - everyone got
       | that - eg it was peaking every few seconds and then doing nothing
       | for ages
        
       | madars wrote:
       | Good opportunity to plug
       | https://en.wikipedia.org/wiki/Anscombe%27s_quartet : if you don't
       | know much about the underlying distribution, simple statistics
       | don't describe it well.
       | 
       | From Wikipedia description: Anscombe's quartet comprises four
       | data sets that have nearly identical simple descriptive
       | statistics, yet have very different distributions and appear very
       | different when graphed. Each dataset consists of eleven (x,y)
       | points. They were constructed in 1973 by the statistician Francis
       | Anscombe to demonstrate both the importance of graphing data
       | before analyzing it, and the effect of outliers and other
       | influential observations on statistical properties.
        
         | hoseja wrote:
         | >importance of graphing data before analyzing it
         | 
         | Very discouraging if one is trying to analyze data
         | algorithmically. Often when faced with a problem in statistics,
         | the answer is: "Look at the graph and use intuition!".
        
         | potatoman22 wrote:
         | Interesting, thanks for the concept. What's one to do for high
         | dimensional data then?
        
         | doctorsher wrote:
         | This is excellent information, thank you for posting this! I
         | was not familiar with this example previously, but it is a
         | _perfect_ example of summary statistics not capturing certain
         | distributions well. It 's very approachable, even if you had to
         | limit the discussion to mean and variance alone. Bookmarked,
         | and much appreciated.
        
         | pdpi wrote:
         | There's a fun paper by Autodesk where they make datasets that
         | look whatever way you want them to.
         | 
         | https://www.autodesk.com/research/publications/same-stats-di...
        
           | djk447 wrote:
           | NB: Post author here.
           | 
           | This is great! So fun...will have to use in the future...
        
             | pdpi wrote:
             | While I have your attention...
             | 
             | > For the median and average to be equal, the points less
             | than the median and greater than the median must have the
             | same distribution (i.e., there must be the same number of
             | points that are somewhat larger and somewhat smaller and
             | much larger and much smaller).
             | 
             | [0, 2, 5, 9, 9] has both median and mean = 5, but the two
             | sides don't really have the same distribution.
        
               | djk447 wrote:
               | Totally true...thoughts on how I could rephrase? I guess
               | it's more the "weight" of points greater than and less
               | than the median should be the same, so symmetric
               | distributions definitely have it, asymmetric may or may
               | not. Definitely open to revising...
        
               | pdpi wrote:
               | > symmetric distributions definitely have it, asymmetric
               | may or may not
               | 
               | Doesn't have to be any more complicated than that. It's
               | more a curio than an important point anyway :)
        
           | s_gourichon wrote:
           | Yes, a both funny and insightful lesson on how weak basic
           | indicators (mean, standard deviation, correlation) can be,
           | the interest and limits of box-plot with quartiles and
           | whiskers, the benefit of the violin-plot.
           | 
           | Definitely worth a quick look.
        
         | djk447 wrote:
         | NB: Post author here.
         | 
         | That's really nifty, wish I'd heard about it earlier. Might go
         | back and add a link to it in the post at some point too! Very
         | useful. Definitely know I wasn't breaking new ground or
         | anything, but fun to see it represented so succinctly.
        
         | [deleted]
        
       | Bostonian wrote:
       | If you think the data is normal-ish but want to account for skew
       | and kurtosis, you can try fitting a distribution such as the
       | skewed Student's t -- there are R packages for this.
        
       | pachico wrote:
       | Surprisingly, many software engineers I know never used
       | percentiles and keep using mean average. True story.
        
         | yongjik wrote:
         | Bah, I'll be happy if I could even get correct averages. I see
         | pipelines getting value X1 from a server that served 100
         | requests, another value X2 from a server that served one
         | request, and then it returns (X1+X2)/2.
        
         | mschuetz wrote:
         | The mean is something you can easily compute progressively and
         | with trivial resources. Median and percentiles, on the other
         | hand, can be super expensive and potentially unsuitable for
         | some real-time applications, since you need to maintain a
         | sorted list of all relevant samples.
        
         | groaner wrote:
         | Not surprising, because computing mean is O(n) and median is
         | O(n log n).
         | 
         | Lack of resources or pure laziness doesn't make it the right
         | measure to use though.
        
           | gpderetta wrote:
           | Introselect is O(n), right?
        
       | lewispb wrote:
       | Side note, but I love the animations, code snippet design and
       | typography in this blog post.
       | 
       | Will think about how I can improve my own blog with these ideas.
        
         | djk447 wrote:
         | Thank you! Huge shout out to Shane, Jacob and others on our
         | team who helped with the graphics / design elements!
        
       | jaygreco wrote:
       | FYI, typo:
       | 
       | > "In the below graph, half of the data is to the left (shaded in
       | blue), and a half is to the right (shaded in purple), with the
       | 50th percentile directly in the center."
       | 
       | But the half on the right is actually shaded yellow.
        
         | djk447 wrote:
         | Thanks! Will get that fixed!
        
       | camel_gopher wrote:
       | Toss the percentiles to the curb and just use a histogram.
        
       | solumos wrote:
       | looking at a histogram is probably the best thing you can do to
       | understand how data is distributed
        
       | airstrike wrote:
       | Sorry for the minor nitpick, but I find it a bit unusual
       | (disappointing?) that there's an image of a candlestick chart at
       | the very top, but the article only uses API response times as
       | examples...
        
       | Lightbody wrote:
       | Whenever this topic comes up, I always encourage folks to watch
       | this 2011 classic 15m talk at the Velocity conference by John
       | Rauser:
       | 
       | https://www.youtube.com/watch?v=coNDCIMH8bk
        
       | jonnydubowsky wrote:
       | I really enjoyed this post! The author also wrote an interactive
       | demonstration of the concepts (using Desmos). It's super helpful.
       | 
       | https://www.desmos.com/calculator/ty3jt8ftgs
        
         | djk447 wrote:
         | NB: Post author here.
         | 
         | Glad you liked it! I was so excited to actually be able to get
         | to use Desmos for something work wise, I've been wanting to do
         | it for years!
        
       | tdiff wrote:
       | One of the difficulties when dealing with percentiles is
       | estimating error of the estimated percentile values, which is not
       | necessarily trivial compared to estimating error of mean
       | approximation (e.g. see
       | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6294150/)
        
       | zeteo wrote:
       | If you want to calculate exact percentiles there's a simple in-
       | place algorithm that runs in expected O(n) time. You basically do
       | quicksort but ignore the "wrong" partition in each recursive
       | step. For instance if your pivot is at the 25% percentile and
       | you're looking for the 10% percentile you only recurse on the
       | "left" partition at that point. It's pretty easy to implement.
       | (And rather straightforward to change to a loop, if necessary.)
        
       | ChuckMcM wrote:
       | This is a really important topic if you're doing web services.
       | Especially if you're doing parallel processing services. When I
       | was at Blekko the "interesting" queries were the ones above the
       | 95th percentile because they always indicated "something" that
       | hadn't worked according to plan. Sometimes it was a disk going
       | bad on one of the bucket servers, sometimes it was a network port
       | dropping packets, and sometimes it was a corrupted index file.
       | But it was always something that needed to be looked at and then
       | (usually) fixed.
       | 
       | It also always separated the 'good' Ad networks from the 'bad'
       | ones as the bad ones would take to long to respond.
        
       | mherdeg wrote:
       | I've skimmed some of the literature here when I've spent time
       | trying to help people with their bucket boundaries for
       | Prometheus-style instrumentation of things denominated in
       | "seconds", such as processing time and freshness.
       | 
       | My use case is a little different from what's described here or
       | in a lot of the literature. Some of the differences:
       | 
       | (1) You have to pre-decide on bucket values, often hardcoded or
       | stored in code-like places, and realistically won't bother to
       | update them often unless the data look unusably noisy.
       | 
       | (2) Your maximum number of buckets is pretty small -- like, no
       | more than 10 or 15 histogram buckets probably. This is because my
       | metrics are very high cardinality (my times get recorded
       | alongside other dimensions that may have 5-100 distinct values,
       | things like server instance number, method name, client name, or
       | response status).
       | 
       | (3) I think I know what percentiles I care about -- I'm
       | particularly interested in minimizing error for, say, p50, p95,
       | p99, p999 values and don't care too much about others.
       | 
       | (4) I think I know what values I care about knowing precisely!
       | Sometimes people call my metrics "SLIs" and sometimes they even
       | set an "SLO" which says, say, I want no more than 0.1% of
       | interactions to take more than 500ms. (Yes, those people say, we
       | have accepted that this means that 0.1% of people may have an
       | unbounded bad experience.) So, okay, fine, let's force a bucket
       | boundary at 500ms and then we'll always be measuring that SLO
       | with no error.
       | 
       | (5) I know that the test data I use as input don't always reflect
       | how the system will behave over time. For example I might feed my
       | bucket-designing algorithm yesterday's freshness data and that
       | might have been a day when our async data processing pipeline was
       | never more than 10 minutes backlogged. But in fact in the real
       | world every few months we get a >8 hour backlog and it turns out
       | we'd like to be able to accurately measure the p99 age of
       | processed messages even if they are very old... So despite our
       | very limited bucket budget we probably do want some buckets at 1,
       | 2, 4, 8, 16 hours, even if at design time they seem useless.
       | 
       | I have always ended up hand-writing my own error approximation
       | function which takes as input like
       | 
       | (1) sample data - a representative subset of the actual times
       | observed in my system yesterday
       | 
       | (2) proposed buckets - a bundle of, say, 15 bucket boundaries
       | 
       | (3) percentiles I care about
       | 
       | then returns as output info about how far off (%age error) each
       | estimated percentile is from the actual value for my sample data.
       | 
       | Last time I looked at this I tried using libraries that purport
       | to compute very good bucket boundaries but they give me, like,
       | 1500 buckets with very nice tiny error, but no clear way to make
       | real-world choice about collapsing this into a much smaller set
       | of buckets with comparatively huge, but manageable, error.
       | 
       | I ended up just advising people to
       | 
       | * set bucket boundaries at SLO boundaries, and be sure to update
       | when the SLO does
       | 
       | * actually look at your data and understand the data's shape
       | 
       | * minimize error for the data set you have now; logarithmic
       | bucket sizes with extra buckets near the distribution's current
       | median value seems to work well
       | 
       | * minimize worst-case error if the things you're measuring grow
       | very small or very large and you care about being able to observe
       | that (add extra buckets)
        
         | convolvatron wrote:
         | look at Greenwald-Khanna (and the followon work). It adapts the
         | bucket size to minimize the total error (and the number of
         | buckets is proportional to log-epsilon I think).
         | 
         | with all the competition and 'innovation', you would think the
         | data dogs of this world would grow up beyond time series.
        
         | slaymaker1907 wrote:
         | One thing you could try to use are variance optimal histograms.
         | These are histograms which set bucket boundaries such that the
         | weighted average variance in the buckets is minimized.
         | Unfortunately, this algorithm is quadratic with the number of
         | data points, but you could try approximating the optimum by
         | taking random subsets of the data, computing the histogram,
         | then seeing how well the histogram does on the whole dataset.
         | 
         | http://www.mathcs.emory.edu/~cheung/Courses/584/Syllabus/pap...
        
         | djk447 wrote:
         | NB: Post author here.
         | 
         | This is interesting and I totally get at least some of the
         | problems you're facing. I wonder if you could take some of the
         | strategies from t-digest and modify a bit to accomplish...I'd
         | be interested in seeing some sort of implementation of this and
         | would love to see if we can get it into our toolkit if you
         | do...or you can also open up a ticket for us and we'll see if
         | we can prioritize to work on something like it ourselves.
         | 
         | I do think there are some interesting ways you can cut corners
         | if you know things about the "SLO" or other sort of cutoff
         | values, but I'd have to think more deeply about it to say more.
         | Essentially we want a variable error rate based on the distance
         | away from a known value, where you have little error in the
         | values relatively close to the known value, care little if at
         | all for fine distinctions on the low end and, once you get past
         | some high end value you could probably bucket everything above
         | it into a "large outliers" bucket or something too...meaning
         | you p999 could get out of hand if you start getting lots of
         | stuff above a threshold, but, if that starts happening,
         | probably everything's already burning down, so might not be
         | that useful to know it's burning down in a very specific way...
        
           | wikibob wrote:
           | Are you following the work on sparse high resolution
           | histograms in both Prometheus and OpenTelemetry?
           | 
           | It's coming together and will be available soon.
           | 
           | What this means is that you won't have to compromise and pick
           | histogram bucket boundaries anymore. And each bucket will be
           | much narrower.
        
       | louisnow wrote:
       | ```To calculate the 10th percentile, let's say we have 10,000
       | values. We take all of the values, order them from largest to
       | smallest, and identify the 1001st value (where 1000 or 10% of the
       | values are below it), which will be our 10th percentile.```
       | 
       | Isn't this contradictory? If we order the values from largest to
       | smallest and take the 1001st value, then 10 % of the values are
       | above/larger and not below/smaller. I believe it should say order
       | from smallest to largest.
        
         | LoriP wrote:
         | Thanks, we corrected this quite quickly. Appreciated!
        
         | emgo wrote:
         | Yes, this looks like a typo in the article. It should be
         | smallest to largest.
        
         | djk447 wrote:
         | NB: Post author here.
         | 
         | Oops, yep, that should probably be order from smallest to
         | largest. Thanks for the correction!
        
           | djk447 wrote:
           | Fixed!
        
       | tiffanyh wrote:
       | Statistics 101. Mean, median and mode.
        
       | k__ wrote:
       | Shouldn't averages&variance be enough for start?
        
       | deft wrote:
       | Looks like timescale did a big marketing push this morning only
       | for their whole service to go down minutes later. lol.
        
       | satvikpendem wrote:
       | I often see HN articles crop up soon after a related post, in
       | this case this Ask HN poster [0] being driven crazy by people
       | averaging percentiles and them not seeing that it's a big deal.
       | It's pretty funny to see such tuples of posts appearing.
       | 
       | https://news.ycombinator.com/item?id=28518795
        
       | michaelmdresser wrote:
       | Looking particularly at latency measurements, I found the "How
       | NOT to Measure Latency" [1] talk very illuminating. It goes quite
       | deep into discussing how percentiles can be used and abused for
       | measurement.
       | 
       | [1]: https://www.infoq.com/presentations/latency-response-time/
        
         | rattray wrote:
         | What are some of the ways percentiles can be abused?
        
           | alexanderdmitri wrote:
           | # Power Statement
           | 
           | !*Plowser*!* is in the 99th percentile of browsers by global
           | user count!
           | 
           | # Fact Sheet
           | 
           | - _!*Plowser*!_ is used by 2,050 people
           | 
           | - sample size of browsers is 10,000 (this includes toy apps
           | on GitHub and almost all browsers in the sample have no
           | record of active users)
           | 
           | - those using _!*Plowser*!_ have no choice as the developing
           | company _$*DendralRot Inc*$_ forces all its employees,
           | contractors and users of their enterprise  '? __shuiteware__?
           | ' (a name derived by mashing _|-- >shite|software|suite<--|_
           | into one word) to use their browser
           | 
           | - if we place the number of global browser users at a
           | conservative 1,000,000,000, _!*Plowser*!_ actually has
           | 0.00000205% of users
        
         | lanstin wrote:
         | I watch this video once a year and send it to my co-workers
         | whenever averages or medians shows up in a graph for public
         | consumption.
        
           | peheje wrote:
           | Are the points written in a readable format anywhere?
        
             | severine wrote:
             | http://highscalability.com/blog/2015/10/5/your-load-
             | generato...
             | 
             | Discussed previously:
             | https://news.ycombinator.com/item?id=10334335
        
       | varelaz wrote:
       | Percentiles for sure are better than average if you want to
       | explore distribution: there are several percentiles comparing to
       | single average. However median is harder to use if you want to do
       | calculations based on this metric. For example distribution of
       | sample median could be a problem, if you want to understand
       | confidence interval for it for example.
        
         | djk447 wrote:
         | NB: Post author here.
         | 
         | Totally can be true. In our case, we use these approximation
         | methods that allow you to get multiple percentiles "for free"
         | definitely need to choose the right ones for the job. (We talk
         | a bit more about the whole approach where we do the aggregation
         | then the accessor thing in the previous post on two-step
         | aggregation [1]). But there are definitely times when
         | averages/stddev and potentially the 3rd and 4th moments are
         | more useful etc.
         | 
         | [1]: https://blog.timescale.com/blog/how-postgresql-
         | aggregation-w...
        
       | motohagiography wrote:
       | Recovering product manager in me sees 90th percentile queries
       | with outlier high latency and starts asking instead of how to
       | reduce it, how we can to spin it out into a dedicated premium
       | query feature, as if they're willing to wait, they're probably
       | also willing to pay.
       | 
       | Highly recommend modelling your solution using queueing theory
       | with this: https://queueing-tool.readthedocs.io/en/latest/
       | 
       | As an exercise in discovering the economics of how your platform
       | works, even just thinking about it in these terms can save a
       | great deal of time.
        
       | tunesmith wrote:
       | I just recently tried giving a presentation to my department
       | (they're developers, I'm architect) about this stuff and they all
       | just sort of blinked at me. It brought in Little's Law and
       | Kingman's Formula, in an attempt to underscore why we need to
       | limit variation in the response times of our requests.
       | 
       | There are a bunch of queuing theory formulas that are really cool
       | but don't exactly apply if your responses vary a lot like the
       | article describes. I think the assumption is that response time
       | distributions are exponential distributions, which I don't think
       | is a good assumption (is an Erlang distribution an exponential
       | distribution?) Nevertheless, hooking the equations up to some
       | models is a good way to get directional intuition. I didn't
       | realize how steep the performance drop-off is for server
       | utilization until I started moving sliders around.
       | 
       | Our ops team doesn't really follow this either. We're not a huge
       | department though - is this the kind of stuff that an SRE team
       | usually pays attention to?
        
         | djk447 wrote:
         | NB: Post author here.
         | 
         | I found it surprisingly difficult to explain well. Took a lot
         | of passes and a lot more words than I was expecting. It seems
         | like such a simple concept. I thought the post was gonna be the
         | shortest of my recent ones, and then after really explaining it
         | and getting lots of edits and rewriting, it was 7000 words and
         | ...whoops! But I guess it's what I needed to explain it well
         | (hope you thought so anyway).
         | 
         | It's somewhat exponential, but yeah, not necessarily, it's
         | definitely long-tailed in some way and it sort of doesn't
         | matter what the theoretical description is (at least in my
         | mind) the point is these types of distributions really don't
         | get described well by a lot of typical statistics.
         | 
         | Can't talk too much about SREs at the ad analytics company I
         | mentioned, we were the backend team that wrote a lot of the
         | backend stores / managed databases / ran the APIs and monitored
         | this stuff a bit (and probably not all that well). It was a bit
         | more ad hoc I guess, probably now that company is large enough
         | they have a dedicated team for that...
        
           | tunesmith wrote:
           | The article hit the pocket pretty exactly for how I felt I
           | needed to explain it. (Actually I had the same experience
           | where I thought I would be able to go over it quickly and
           | then I felt like I was getting mired.) The graphs look great
           | too - I've been able to explore it pretty well with
           | JupyterLab but I can't export that into something super-
           | interactive that is read-only. I've thought about creating an
           | idyll page out of it to help others explore but that's a bit
           | overkill for the work style our team has.
           | 
           | I _think_ the weirdly-shaped long-tail graphs we come across
           | are just sums of more naturally-distributed response times,
           | for different types of responses. Another reason to limit
           | variation I think.
        
             | benlivengood wrote:
             | > I think the weirdly-shaped long-tail graphs we come
             | across are just sums of more naturally-distributed response
             | times, for different types of responses. Another reason to
             | limit variation I think.
             | 
             | The best method I've found for digging into latency is
             | profiling RPC traces to see where time is spent. This at
             | least separates the code and parameters that have simple
             | latency distributions from those that don't. Some
             | distributions will be very data-dependent (SQL queries).
        
       | bqe wrote:
       | Awhile ago I wrote a Python library called LiveStats[1] that
       | computed any percentile for any amount of data using a fixed
       | amount of memory per percentile. It uses an algorithm I found in
       | an old paper[2] called P^2. It uses a polynomial to find good
       | approximations.
       | 
       | The reason I made this was an old Amazon interview question. The
       | question was basically, "Find the median of a huge data set
       | without sorting it," and the "correct" answer was to have a fixed
       | size sorted buffer and randomly evict items from it and then use
       | the median of the buffer. However, a candidate I was interviewing
       | had a really brilliant insight: if we estimate the median and
       | move it a small amount for each new data point, it would be
       | pretty close. I ended up doing some research on this and found
       | P^2, which is a more sophisticated version of that insight.
       | 
       | [1]: https://github.com/cxxr/LiveStats
       | 
       | [2]: https://www.cs.wustl.edu/~jain/papers/ftp/psqr.pdf
        
         | bo1024 wrote:
         | > if we estimate the median and move it a small amount for each
         | new data point, it would be pretty close.
         | 
         | Yeah, this is gradient descent on the absolute loss.
        
         | [deleted]
        
         | [deleted]
        
         | Cd00d wrote:
         | Did the candidate get an offer? Genuinely curious.
         | 
         | I had a basic screening call fail once because the expected
         | answer was (in my perspective) more naive than my answer. I'd
         | love it if generating curiosity were an interview +1.
        
           | williamdclt wrote:
           | Whether they got it or not probably isn't useful information.
           | Having a good/brilliant answer probably isn't the only point
           | of the question, this probably wasn't the only question of
           | the interview, and this probably wasn't the only interview
        
         | [deleted]
        
         | cyral wrote:
         | There are some newer data structures that take this to the next
         | level such as T-Digest[1], which remains extremely accurate
         | even when determining percentiles at the very tail end (like
         | 99.999%)
         | 
         | [1]: https://arxiv.org/pdf/1902.04023.pdf /
         | https://github.com/tdunning/t-digest
        
           | fotta wrote:
           | yep I had to implement t-digest in a monitoring library.
           | another alternative (although older) that the prometheus
           | libraries use is CKMS quantiles [0].
           | 
           | [0] http://dimacs.rutgers.edu/~graham/pubs/papers/bquant-
           | icde.pd...
        
           | djk447 wrote:
           | NB: Post author here.
           | 
           | Yeah, that was one of the reasons we chose it as one of the
           | ones to implement, seemed like that was a really interesting
           | tradeoff, we also used uddsketch[1] which provides relative
           | error guarantees, which is pretty nifty. We thought they
           | provided different enough tradeoffs that we wanted to
           | implement both.
           | 
           | [1]: https://arxiv.org/abs/1908.10693
        
             | vvern wrote:
             | Is it using https://github.com/tvondra/tdigest under the
             | hood, or a separate implementation?
        
               | Lockerman wrote:
               | a separate implementation
        
               | mfreed wrote:
               | If folks are interested:
               | 
               | https://github.com/timescale/timescaledb-
               | toolkit/blob/main/e...
               | 
               | (The TimescaleDB Toolkit is also implemented in Rust)
        
               | rogers18445 wrote:
               | Facebook seems to have an even better performance
               | implementation using sqrt. Might make sense to port that
               | over to Rust. https://github.com/facebook/folly/blob/mast
               | er/folly/stats/TD...
        
             | _0ffh wrote:
             | Hi, in an unrelated nitpick: The relative error should be
             | calculated by dividing the error by the true value, not by
             | it's approximation. Still, very nice writeup!
        
               | djk447 wrote:
               | Thanks! messed up the formula but had it right in the
               | text :( Fixed now.
        
           | riskneutral wrote:
           | Also relevant: Single-Pass Online Statistics Algorithms
           | 
           | [1] http://www.numericalexpert.com/articles/single_pass_stat/
        
           | breuleux wrote:
           | That's pretty neat! Can these be used to efficiently compute
           | rolling percentiles (over windows of the data), or just
           | incremental?
        
             | [deleted]
        
             | WireBaron wrote:
             | The UDDSketch (default) implementation will allow rolling
             | percentiles, though we still need a bit of work on our end
             | to support it. There isn't a way to do this with TDigest
             | however.
        
               | jeffbee wrote:
               | Sure there is. You simply maintain N phases of digests,
               | and every T time you evict a phase and recompute the
               | summary (because T-digests are easily merged).
        
               | cyral wrote:
               | This is what I do, it's not a true rolling digest but it
               | works well enough for my purposes.
        
               | djk447 wrote:
               | I think this would be a tumbling window rather than a
               | true "rolling" tdigest. I suppose you could decrement the
               | buckets, but it gets a little weird as splits can't
               | really be unsplit. The tumbling window one would probably
               | work, though Tdigest is a little weird on merge etc as
               | it's not completely deterministic with respect to
               | ordering and merging (Uddsketch is) so it's _likely_ you
               | get something that is more than good enough, but wouldn
               | 't be the same as if you just calculated it directly so
               | it gets a little confusing and difficult.
               | 
               | (NB: Post author here).
        
           | convolvatron wrote:
           | i think the new ones started wtih Greenwald-Khanna. but i
           | definately agree - p^2 can be a little silly and misleading.
           | in particular it is really poor at finding those little modes
           | on the tail that correspond to interesting system behaviours.
        
             | cyral wrote:
             | That sounds familiar, I remember reading about Greenwald-
             | Khanna before I found T-Digest after I ran into the "how to
             | find a percentile of a massive data set" problem myself.
        
           | jeremysalwen wrote:
           | Not an expert on this topic but I noticed that the KLL
           | algorithm (published in 2016) was not mentioned in this
           | thread, which provides theoretically optimal performance for
           | streaming quantiles with guaranteed worst case performance:
           | http://courses.csail.mit.edu/6.854/20/sample-
           | projects/B/stre... (And is pretty fast in practice).
        
             | djk447 wrote:
             | NB: Post author here.
             | 
             | Interesting will have to take a look! Thanks for sharing!
        
         | djk447 wrote:
         | NB: Post author here.
         | 
         | Thanks for sharing! Hadn't heard of that algorithm, have seen a
         | number of other ones out there, we chose a couple that we knew
         | about / were requested by users. (And we are open to more user
         | requests if folks want to use other ones!
         | https://github.com/timescale/timescaledb-toolkit and open an
         | issue!)
        
         | hikerclimber1 wrote:
         | Smoking cannabis is also bad for you. It's just a ploy by
         | governments to get you to smoke so you die earlier.
        
         | eru wrote:
         | > The question was basically, "Find the median of a huge data
         | set without sorting it," [...]
         | 
         | You can use eg QuickSelect
         | (https://en.wikipedia.org/wiki/Quickselect) or Median of
         | Medians.
         | 
         | They don't sort the data, but they do need linear amount of
         | storage.
        
         | tantalor wrote:
         | > Find the median ... randomly evict items
         | 
         | So, not find, but approximate. That's a different thing.
        
           | rendaw wrote:
           | > without sorting it... have a fixed size sorted buffer
           | 
           | (that you sort yourself)
        
             | raxxorrax wrote:
             | That doesn't really make sense to me at all. Don't sort it,
             | just have it?
             | 
             | Is the storage restriction the point?
        
         | Izikiel43 wrote:
         | > The question was basically, "Find the median of a huge data
         | set without sorting it,"
         | 
         | Isn't this done using a min heap and a max heap in conjuction?
        
           | 5faulker wrote:
           | It's funny that this is often left out from a data &
           | algorithm class.
        
             | lanstin wrote:
             | Because unlike many dynamic programming algorithms, it is
             | something anyone running a large system will need.
        
             | thaumasiotes wrote:
             | Is the minheap-maxheap approach faster than sorting the
             | data? The obvious approach (process each element, one by
             | one, into the appropriate heap, and rebalance the heaps so
             | they are of equal size) takes n log n time and linear
             | space. You can use the same resources to just produce a
             | sorted copy of the input, which is a much better thing to
             | have than two heaps that center on the median.
        
               | Izikiel43 wrote:
               | The minheap-maxheap approach is better for streaming
               | data, to get the median as data comes in.
               | 
               | I agree that if you have the whole thing, doing heapsort
               | and pulling a[N/2] or a[1 + N/2] is simpler.
        
               | thaumasiotes wrote:
               | > The minheap-maxheap approach is better for streaming
               | data, to get the median as data comes in.
               | 
               | I see that it's better if you need to know "what is the
               | median of the amount of the list that I've consumed so
               | far?"
               | 
               | But if what you want is the median of the whole list,
               | which might be in a random order, the medians of random
               | prefixes of the list don't seem especially relevant. And
               | if you _do_ have an indefinite amount of data coming in,
               | so that you need a  "well, this is what we've seen so
               | far" data point, the minheap-maxheap approach doesn't
               | seem very well suited since it requires you to remember
               | the entirety of the data stream so far.
               | 
               | My first instinct is to divide the possible data values
               | into buckets, and just count the number of datapoints
               | that fall into each bucket. This gives you a histogram
               | with arbitrary resolution. You won't know the median
               | value, but you will know which bucket contains the median
               | value, and your storage requirements depend only on the
               | number of buckets you want to use.
        
           | ilikebits wrote:
           | The real constraint here is probably "find the median of a
           | huge data set without holding the entire data set in memory".
        
             | robotresearcher wrote:
             | 'Estimate the median of an arbitrary sized data set using a
             | constant amount of memory'.
        
           | ZephyrBlu wrote:
           | How does this get you the median?
        
           | tantalor wrote:
           | No, for 2 reasons,
           | 
           | 1. huge data set: heaps requires storing the whole set, but
           | "huge" means "more than you can store"
           | 
           | 2. without sorting it: heaps are basically semi-sorted, so
           | you are breaking the rules
        
             | xyzzyz wrote:
             | You can actually construct the heap from unsorted data in
             | O(n) time, so constructing the heap is definitely not
             | sorting. However, yeah, to actually use the heap to find
             | median in O(n) time, you need to do something similar to
             | magic-five (median of medians) algorithm.
        
               | eru wrote:
               | Something like QuickSelect is probably better in practice
               | than median-of-medians.
        
             | thaumasiotes wrote:
             | > but "huge" means "more than you can store"
             | 
             | Really? Where's it coming from?
        
               | namrog84 wrote:
               | I think they mean having more than you can store
               | simultaneously on a single device.
               | 
               | With a few exceptions this is pretty common scenario.
        
               | Dylan16807 wrote:
               | More than _you_ can store.
               | 
               | And possibly it's live data.
        
           | hnfong wrote:
           | The two heap method helps you maintain the current median
           | given an incoming stream of "online" data, and technically is
           | not "sorting".
           | 
           | The original problem is probably more accurately described as
           | "what if you have too much data to sort" though.
        
       | abnry wrote:
       | One way to think about why we tend to use averages instead of
       | medians is that it is related to a really deep theorem in
       | probability: The Central Limit Theorem.
       | 
       | But I think we can twist our heads and see in a way that this is
       | backwards. Mathematically, the mean is much easier to work with
       | because it is linear and we can do algebra with it. That's how we
       | got the Central Limit Theorem. Percentiles and the median, except
       | for symmetric distributions, are not as easy to work with. They
       | involve solving for the inverse of the cumulative function.
       | 
       | But in many ways, the median and percentiles are a more relevant
       | and intuitive number to think about. Especially in contexts where
       | linearity is inappropriate!
        
         | a-dub wrote:
         | i think of it as: if the data is gaussian, use a mean,
         | otherwise go non-parametric (medians/percentiles).
         | 
         | or put another way, if you can't model it, you're going to have
         | to sort, or estimate a sort, because that's all that's really
         | left to do.
         | 
         | this shows up in things from estimating centers with
         | means/percentiles to doing statistical tests with things like
         | the wilcoxon tests.
        
           | lanstin wrote:
           | Assume up front none of your measured latencies from a
           | software networked system will be Gaussian, or
           | <exaggereation> you will die a painful death </exaggeration>.
           | Even ping times over the internet have no mean. The only good
           | thing about means is you can combine them easily, but since
           | they are probably a mathematical fiction, combining them is
           | even worse. Use T-Digest or one of the other algorithms being
           | highlighted here.
        
             | mhh__ wrote:
             | This is why I try to plot a proper graph of times from any
             | "optimization" I see in a PR. Too many times I see people
             | making this assumption for example, and even if they're
             | right they usually forget to take the width of the gaussian
             | into account (i.e. wow your speedup is 5% of a standard
             | deviation!)
        
             | a-dub wrote:
             | yep, have made that mistake before. even turned in a write-
             | up for a measurement project in a graduate level systems
             | course that reported network performance dependent
             | measurements with means over trials with error bars from
             | standard deviations.
             | 
             | sadly, the instructor just gave it an A and moved on. (that
             | said, the amount of work that went into a single semester
             | project was a bit herculean, even if i do say so myself)
        
         | nancarrow wrote:
         | It's more related to the law of large numbers than the CLT
        
       | bluesmoon wrote:
       | This is exactly the algorithm we developed at LogNormal (now part
       | of Akamai) 10 years ago for doing fast, low-memory percentiles on
       | large datasets.
       | 
       | It's implemented in this Node library:
       | https://github.com/bluesmoon/node-faststats
       | 
       | Side note: I wish everyone would stop using the term Average to
       | refer to the Arithmetic mean. "Average" just means some statistic
       | used to summarize a dataset. It could be the Arithmetic Mean,
       | Median, Mode(s), Geometric Mean, Harmonic Mean, or any of a bunch
       | of other statistics. We're stuck with AVG because that's the
       | function used by early databases and Lotus 123.
        
         | JackFr wrote:
         | No we're stuck with it because average was used colloquially
         | for arithmetic mean for decades.
         | 
         | I wish people would stop bad-mouthing the arithmetic mean. If
         | you have to convey information about a distribution and you've
         | got only one number to do it, the arithmetic mean is for you.
        
           | bluesmoon wrote:
           | Yes, Lotus 123 came out 38 years ago :)
        
           | jrochkind1 wrote:
           | I think it still depends on the nature of the data and your
           | questions about it, and median is often the better choice if
           | you have to pick only one.
        
             | hnfong wrote:
             | "It depends" is always right but which function is better
             | for arbitrary, unknown data?
             | 
             | At least the arithmetic mean is fine for Gaussian
             | distributions, and coneys a sense about the data even on
             | non-Gaussian ones. but the median doesn't even work at all
             | on some common distributions like scores of very difficult
             | exams (where median=0)
             | 
             | For the mean, at least every data point contributes to the
             | final value.
             | 
             | Just my 2c
        
       | hnuser123456 wrote:
       | Gamers have an intuitive sense of this. Your average framerate
       | can be arbitrarily high, but if you have a big stutter every
       | second between the smooth moments, then a lower but more
       | consistent framerate may be preferable, typically expressed as
       | the 1% and 0.1% slowest frames, which at a relatively typical
       | 100fps, represents the slowest frame every second and every 10
       | seconds.
        
         | slaymaker1907 wrote:
         | Yep, I often find it useful to limit framerate to a number I
         | know my GPU can handle so that I don't get as much stuttering.
         | It's better to run at 60 FPS all the time than 80 FPS most of
         | the time but with stuttering.
        
         | djk447 wrote:
         | NB: Post author here.
         | 
         |  _Love_ this example. Might have to use that in a future post.
         | Feel like a lot of us are running into a similar thing with
         | remote work and video calls these days...
        
         | tzs wrote:
         | I have no hope of finding a cite for this, but a long time ago
         | I read some command line UI research that found if you had a
         | system where commands ranged from instant to taking a small but
         | noticeable time and you introduced delays in the faster
         | commands to make it so all commands took the same small but
         | noticeable time people would think that the system was now
         | faster overall.
        
           | wruza wrote:
           | I guess that's because our minds (and animal minds as well)
           | are always aware of the pace of repetitive events. If
           | something is off, the anxiety alarm rings. One old book on
           | the brain machinery described an example of a cat that was
           | relaxing near the metronome and became alert when it was
           | suddenly stopped. Unpredictable delays are disturbing,
           | because a mispredicted event means you may be in a dangerous
           | situation and have to recalibrate _now_.
        
             | im3w1l wrote:
             | I think the explanation may be even more low level than
             | that. Iirc, even with a single neuron (or maybe if it was
             | very small clusters, sorry recollection is a bit hazy) you
             | can see that it learns to tune out a repetitive signal.
        
           | barrenko wrote:
           | Presumably, also the reason we have those fake queue lines at
           | airports (not sure the correct word is for it).
        
         | [deleted]
        
         | elmalto wrote:
         | Similarly, I would rather have no wifi than bad/spotty wifi for
         | this reason
        
         | account42 wrote:
         | For games, one compounding factor is that you need to estimate
         | how long the next frame will take in order to know how much
         | time to simulate. If the variance is greater, the prediction
         | will be worse leading the perceived game "speed" to vary from
         | frame to frame.
        
         | kazinator wrote:
         | Gamers do not have a more intuitive sense for this than movie
         | watchers, video/voice talkers, not to mention users who type
         | text into bloated web browsers or lagged remote login sessions.
         | 
         | I would rather have a rock-steady consistent 500 ms reponse
         | time when typing text, than to have a 100 ms average response
         | time which randomly spikes to outliers that go past one second.
         | 
         | A rock-steady, though poor, event rate in a paint application
         | is better for drawing a freehand curve (especially if the
         | program interpolates well) than a really fast rate that
         | suddenly has a glitch in it, spoiling your work.
        
           | spiderice wrote:
           | > Gamers do not have a more intuitive sense for this than
           | movie watchers, video/voice talkers, not to mention users who
           | type text into bloated web browsers or lagged remote login
           | sessions
           | 
           | I would be shocked if gamers didn't on average have a more
           | intuitive sense of this than any of the groups you mentioned.
        
             | 9935c101ab17a66 wrote:
             | Yah, I don't think anyone's arguing that ONLY gamers will
             | observe the phenomena, but I would be shocked if they
             | weren't more sensitive to it than most groups.
        
               | kazinator wrote:
               | "more sensitive" != "more intuitive"
        
           | hnuser123456 wrote:
           | Maybe not all, but ones who create and consume analysis like
           | this
           | 
           | https://www.techpowerup.com/review/msi-geforce-
           | rtx-3090-gami...
        
         | exolymph wrote:
         | "Slow is smooth and smooth is fast"
        
       | rahimnathwani wrote:
       | For some things, you can't even sensibly measure the mean. For
       | example, if you're measuring the mean response time for a
       | service, a single failure/timeout makes the mean response time
       | infinite (because 100 years from now the response still hasn't
       | been received).
       | 
       | "Why Averages Suck and Percentiles are Great":
       | https://www.dynatrace.com/news/blog/why-averages-suck-and-pe...
        
         | djk447 wrote:
         | totally. that blog was also one of the sources I mentioned in
         | the post! Good stuff
         | 
         | NB: Post author here.
        
         | contravariant wrote:
         | It's indeed always worth pointing that a mean may or may not
         | exist. Same with variances/standard-deviation.
         | 
         | The central limit theorem seems to have given people the
         | slightly wrong idea that things will _always_ average out
         | eventually.
        
           | maxnoe wrote:
           | Coughs Cauchy
        
       | ruchin_k wrote:
       | Spent several years in venture capital investing and averages
       | were always misleading - as Nassim Taleb says "Never cross a
       | river that is on average 4 feet deep"
        
         | robbrown451 wrote:
         | Median/50 percentile isn't a whole lot better in that case.
        
       | fbinky wrote:
       | Thanks for the refresher! I am using the timescaledb-toolkit with
       | much success. LTTB, for example. Excellent.
        
       | ekianjo wrote:
       | Is there anyone who still uses averages ?
        
       | datavirtue wrote:
       | Can someone bake this down to a sentence? I think I understand
       | what they are saying since I have been faced with using these
       | metrics and in having been tempted to use average response times
       | I recognized that average is not a good baseline since it moves
       | in relation to the outliers (which are usually the problem
       | requests and there can be many or few).
       | 
       | How do you calculate these percentiles and use them as a trigger
       | for alerts?
        
       | aidenn0 wrote:
       | Be careful translating percentiles of requests to percentiles of
       | users; if less than 10% of your requests take over 1 second, but
       | a typical user makes 10 requests, it's possible that the majority
       | your users are seeing a request take over 1 second.
        
         | djk447 wrote:
         | NB: Post author here.
         | 
         | Yep! Briefly noted that in the post, but deserves re-stating!
         | it's definitely a more complex analysis to figure out the
         | percentage of users affected (though often more important)
         | could be majority could also be one user who has some data
         | scientist programmatically making hundreds of long API calls
         | for some task...(can you tell that I ran into that? Even worse
         | it was one of our own data scientists ;) ).
        
       ___________________________________________________________________
       (page generated 2021-09-15 23:02 UTC)