[HN Gopher] Latency Sneaks Up on You
       ___________________________________________________________________
        
       Latency Sneaks Up on You
        
       Author : luord
       Score  : 114 points
       Date   : 2021-08-27 12:27 UTC (10 hours ago)
        
 (HTM) web link (brooker.co.za)
 (TXT) w3m dump (brooker.co.za)
        
       | azundo wrote:
       | This principle applies as much to the work we schedule for
       | ourselves (or our teams) as it does to our servers.
       | 
       | As teams get pushed to efficiently utilize scarce and expensive
       | developer resources to their max they can also end up with huge
       | latency issues for unanticipated requests. Not always easy to
       | justify why planned work is way under a team's capacity though
       | even if it leads to better overall outcomes at the end of the
       | day.
        
         | mjb wrote:
         | Yes, for sure.
         | 
         | As another abstract example that's completely disconnected from
         | the real world: if we're running the world's capacity to make
         | n95 masks at high utilization, it may take a while to be able
         | to handle a sudden spike in demand.
        
       | shitlord wrote:
       | If you're interested, there's a whole branch of mathematics that
       | models these sorts of phenomena:
       | https://en.wikipedia.org/wiki/Queueing_theory
        
       | the_sleaze9 wrote:
       | Seems like a trivially simple article, and I remain unconvinced
       | of the conclusion. I think this is a confident beginner giving
       | holistic, overly prescriptive advice. That is to say: feel free
       | to skip and ignore.
       | 
       | In my experience, if you want monitoring (or measuring for
       | performance) to provide any value what so ever, you must measure
       | multiple different aspects of the system all at once.
       | Percentiles, averages, load, responses, i/o, memory, etc etc.
       | 
       | The only time you would need a single metric would possibly be
       | for alerting, and a good alert (IMHO) is one that triggers for
       | impending doom, which the article states percentiles are good
       | for. But I think alerts are outside of the scope of this article.
       | 
       | TLDR; Review of the article: `Duh`
        
         | MatteoFrigo wrote:
         | You characterization of Marc Brooker as a "confident beginner"
         | is incorrect. The guy is a senior principal engineer at AWS,
         | was the leader of EBS when I interacted with him, and has built
         | more systems than I care to mention. The phenomenon he is
         | describing is totally real. Of course the article is a
         | simplification that attempts to isolate the essence of a
         | terrifyingly complex problem.
        
           | LambdaComplex wrote:
           | And it even says as much in the sidebar on that page:
           | 
           | > I'm currently an engineer at Amazon Web Services (AWS) in
           | Seattle, where I lead engineering on AWS Lambda and our other
           | serverless products. Before that, I worked on EC2 and EBS.
        
         | christogreeff wrote:
         | https://brooker.co.za/blog/publications.html
        
       | wpietri wrote:
       | Mostly agreed, and I think the point about efficiency working
       | against latency is both important and widely ignored. And not
       | just in software, but software process.
       | 
       | There's a great book called Principles of Product Development
       | Flow. It carefully looks at the systems behind how things get
       | built. Key to any good feedback loop is low latency. So if we
       | want our software to get better for users over time, low
       | latencies from idea to release are vital. But most software
       | processes are tuned for keeping developers 100% busy (or more!),
       | which drastically increases system latency. That latency means we
       | get a gain in efficiency (as measured by how busy developers are)
       | but a loss in how effective the system is (as determined by
       | creation of user and business value).
        
       | ksec wrote:
       | OK. I am stupid. I dont understand the article.
       | 
       | >> If you must use latency to measure efficiency, use mean (avg)
       | latency. Yes, average latency
       | 
       | What is wrong with measuring latency at 99.99 percentile with a
       | clear guideline that optimising efficiency ( in this article
       | higher utilisation ) should not have trade off on latency?
       | 
       | Because latency is part of user experience. And UX comes first
       | before anything else.
       | 
       | Or does it imply that there are lot of people who dont know the
       | trade off between latency and utilisation? Because I dont know
       | anyone who has utilisation to 1 or even 0.5 in production.
        
         | srg0 wrote:
         | Percentiles are order statistics, they are robust and not
         | sensitive to outliers. This is why sometimes they are very
         | useful. And this is why they do not capture how big the
         | remaining 0.01% of the data are.
         | 
         | Let's take a median, which is also an order statistics. And a
         | sequence of latency measurements: 0.005 s, 0.010 s, 3600 s.
         | Median latency is 0.010 s, and this number does not tell how
         | bad latency can actually be. Mean latency is 1200.05 s, which
         | is more indicative how bad the worst case is.
         | 
         | In other words, percentiles show how often a problem happens
         | (does not happen). Mean values show the impact of the problem.
        
         | mjb wrote:
         | The other two answers you got are good. I will say that
         | monitoring p99 (or 99.9 or whatever) is a good thing,
         | especially if you're building human-interactive stuff. Here's
         | my colleague Andrew Certain talking about how Amazon came to
         | that conclusion: https://youtu.be/sKRdemSirDM?t=180
         | 
         | But p99 is just one summary statistic. Most importantly, it's a
         | robust statistic that rejects outliers. That's a very good
         | thing in some cases. It's also a very bad thing if you care
         | about throughput, because throughput is proportional to
         | 1/latency, and if you reject the outliers then you'll
         | overestimate throughput substantially.
         | 
         | p99 is one tool. A great and useful one, but not for every
         | purpose.
         | 
         | > Because I dont know anyone who has utilisation to 1 or even
         | 0.5 in production.
         | 
         | Many real systems like to run much hotter than that. High
         | utilization reduces costs, and reduces carbon footprint. Just
         | running at low utilization is a reasonable solution for a lot
         | of people in a lot of cases, but as margins get tighter and
         | businesses get bigger, pushing on utilization can be really
         | worthwhile.
        
           | bostik wrote:
           | In my previous job me and latency-sensitive engineering teams
           | in general mostly went with just four core latency
           | measurements.[ss]                   - p50, to see the
           | baseline         - p95, to see the most common latency peaks
           | - p99, to see what the "normal" waiting times under load were
           | - max, because that's what the most unfortunate customers
           | experienced
           | 
           | In a normal distributed system the spread between p99 and max
           | can be enormous, but the mental mindset of ensuring smooth
           | customer experience, with awareness that a real person had to
           | wait that long, is exceptionally useful. You need just _one_
           | slightly slower service for the worst-case latency to
           | skyrocket. In particular, GraphQL is exceptionally bad at
           | this without real discipline - the minimum request latency is
           | dictated by the SLOWEST downstream service.
           | 
           | To be fair, it was a real time gambling operation. And we
           | _were_ operating within the first Nielsen threshold.
           | 
           | ss: bucketing these by request route was quite useful.
           | 
           | EDIT: formatting
        
         | dastbe wrote:
         | every statistic is a summary and a lie. p50/p99 metrics are
         | good in the sense that they tell you a number someone actually
         | experienced and they put an upper bound on that experience.
         | they are bad because they won't tell you is how the experience
         | below that bound looks.
         | 
         | mean and all it's variants won't show you a number that someone
         | in your system necessarily experienced, but it will incorporate
         | the entire distribution and show when it has changed.
         | 
         | in the context of efficiency, mean is beneficial because it can
         | be used to measure concurrency in a system via littles law, and
         | will signal changes in your concurrency that a percentile
         | metric won't necessarily do.
        
       | dwohnitmok wrote:
       | I think the article is missing one big reason why we care about
       | 99.99% or 99.9% latency metrics and that is that we can have high
       | latency spikes even with low utilization.
       | 
       | The majority of computer systems do not deal with high
       | utilization. As has been pointed out many times, computers are
       | really fast these days, and many businesses may be able to get
       | away through their entire lifetime on a single machine if the
       | underlying software makes efficient use of the hardware
       | resources. And yet even with low utilization, we still have
       | occasional high latency that still occurs often enough to
       | frustrate a user. Why is that? Because a lot of software these
       | days is based on a design that intersperses low-latency
       | operations with occasional high-latency ones. This shows up
       | everywhere: garbage collection, disk and memory fragmentation,
       | growable arrays, eventual consistency, soft deletions followed by
       | actual hard deletions, etc.
       | 
       | What this article is advocating for is essentially an amortized
       | analysis of throughput and latency, in which case you do have a
       | nice and steady relationship between utilization and latency. But
       | in a system which may never come close to full utilization of its
       | underlying hardware resources (which is a large fraction of
       | software running on modern hardware), this amortized analysis is
       | not very valuable because even with very low utilization we can
       | still have very different latency distributions due to the
       | aforementioned software design and what tweaks you make to that.
       | 
       | This is why many software systems don't care about the median
       | latency or the average latency, but care about the 99 or 99.9
       | percentile latency: there is a utilization-independent component
       | to the statistical distribution of your latency over time and for
       | those many software systems which have low utilization of
       | hardware resources that is the main determinant of your overall
       | latency profile, not utilization.
        
         | MatteoFrigo wrote:
         | Even worse, the effects that you mention (garbage collection,
         | etc.) are morally equivalent to an increase in utilization,
         | which pushes you towards the latency singularity that the
         | article is talking about.
         | 
         | As an oversimplified example, suppose that your system is 10%
         | utilized and that $BAD_THING (gc, or whatever) happens that
         | effectively slows down the system by a factor of 10 at least
         | temporarily. Your latency does not go up by 10x---it grows
         | unbounded because now your effective utilization is 100%.
        
       | dvh wrote:
       | Grace Hopper explaining 1 nanosecond:
       | https://youtu.be/9eyFDBPk4Yw
        
       | MatteoFrigo wrote:
       | Great article, and a line of reasoning that ought to be more
       | widely known. There is a similar tradeoff between latency and
       | utilization in hash tables, for essentially the same reason.
       | 
       | The phenomenon described by the author can lead to interesting
       | social dynamics over time. The initial designer of a system
       | understands the latency/utilization tradeoff and dimensions the
       | system to be underutilized so as to meet latency goals. Then the
       | system is launched and successful, so people start questioning
       | the low utilization, and apply pressure to increase utilization
       | in order to reduce costs. Invariably latency goes up, customers
       | complain. Customers escalate, and projects are started to reduce
       | latency. People screw around at the margin changing number of
       | threads etc, but the fundamental tradeoff cannot be avoided.
       | Nobody is happy in the end. (Been through this cycle a few times
       | already.)
        
         | dharmab wrote:
         | In my org, we define latency targets early (based on our user's
         | needs where possible) and then our goal is to maximize
         | utilization within those constraints.
        
       | jrochkind1 wrote:
       | Mostly a reminder/clarification of things I knew, but a good and
       | welcome one well-stated, because I probably sometimes forget. (I
       | don't do performance work a lot).
       | 
       | But this:
       | 
       | > If you must use latency to measure efficiency, use mean (avg)
       | latency. Yes, average latency
       | 
       | Not sure if I ever thought about it before, but after following
       | the link[1] where OP talks more about it, they've convinced me.
       | Definitely want mean latency at least in addition to median, not
       | median alone.
       | 
       | [1]: https://brooker.co.za/blog/2017/12/28/mean.html
        
         | zekrioca wrote:
         | > If we're expecting 10 requests per second at peak this
         | holiday season, we're good.
         | 
         | Problem is, sometimes system engineers do not know what to
         | expect, but they still need to have a plan for this case.
        
           | [deleted]
        
         | marcosdumay wrote:
         | > at least in addition to median
         | 
         | There was an interesting article here not long ago that made
         | the point that median is basically useless. If you load 5
         | resources on a page load, the odds of all of them being faster
         | than the median (so it represents the user experience) is about
         | 3%. You need a very high rank to get any useful information,
         | probably with a number of 9s.
        
           | jrochkind1 wrote:
           | Median for a particular action/page might be more useful.
        
       ___________________________________________________________________
       (page generated 2021-08-27 23:01 UTC)