[HN Gopher] Latency Sneaks Up on You
___________________________________________________________________
Latency Sneaks Up on You
Author : luord
Score : 114 points
Date : 2021-08-27 12:27 UTC (10 hours ago)
(HTM) web link (brooker.co.za)
(TXT) w3m dump (brooker.co.za)
| azundo wrote:
| This principle applies as much to the work we schedule for
| ourselves (or our teams) as it does to our servers.
|
| As teams get pushed to efficiently utilize scarce and expensive
| developer resources to their max they can also end up with huge
| latency issues for unanticipated requests. Not always easy to
| justify why planned work is way under a team's capacity though
| even if it leads to better overall outcomes at the end of the
| day.
| mjb wrote:
| Yes, for sure.
|
| As another abstract example that's completely disconnected from
| the real world: if we're running the world's capacity to make
| n95 masks at high utilization, it may take a while to be able
| to handle a sudden spike in demand.
| shitlord wrote:
| If you're interested, there's a whole branch of mathematics that
| models these sorts of phenomena:
| https://en.wikipedia.org/wiki/Queueing_theory
| the_sleaze9 wrote:
| Seems like a trivially simple article, and I remain unconvinced
| of the conclusion. I think this is a confident beginner giving
| holistic, overly prescriptive advice. That is to say: feel free
| to skip and ignore.
|
| In my experience, if you want monitoring (or measuring for
| performance) to provide any value what so ever, you must measure
| multiple different aspects of the system all at once.
| Percentiles, averages, load, responses, i/o, memory, etc etc.
|
| The only time you would need a single metric would possibly be
| for alerting, and a good alert (IMHO) is one that triggers for
| impending doom, which the article states percentiles are good
| for. But I think alerts are outside of the scope of this article.
|
| TLDR; Review of the article: `Duh`
| MatteoFrigo wrote:
| You characterization of Marc Brooker as a "confident beginner"
| is incorrect. The guy is a senior principal engineer at AWS,
| was the leader of EBS when I interacted with him, and has built
| more systems than I care to mention. The phenomenon he is
| describing is totally real. Of course the article is a
| simplification that attempts to isolate the essence of a
| terrifyingly complex problem.
| LambdaComplex wrote:
| And it even says as much in the sidebar on that page:
|
| > I'm currently an engineer at Amazon Web Services (AWS) in
| Seattle, where I lead engineering on AWS Lambda and our other
| serverless products. Before that, I worked on EC2 and EBS.
| christogreeff wrote:
| https://brooker.co.za/blog/publications.html
| wpietri wrote:
| Mostly agreed, and I think the point about efficiency working
| against latency is both important and widely ignored. And not
| just in software, but software process.
|
| There's a great book called Principles of Product Development
| Flow. It carefully looks at the systems behind how things get
| built. Key to any good feedback loop is low latency. So if we
| want our software to get better for users over time, low
| latencies from idea to release are vital. But most software
| processes are tuned for keeping developers 100% busy (or more!),
| which drastically increases system latency. That latency means we
| get a gain in efficiency (as measured by how busy developers are)
| but a loss in how effective the system is (as determined by
| creation of user and business value).
| ksec wrote:
| OK. I am stupid. I dont understand the article.
|
| >> If you must use latency to measure efficiency, use mean (avg)
| latency. Yes, average latency
|
| What is wrong with measuring latency at 99.99 percentile with a
| clear guideline that optimising efficiency ( in this article
| higher utilisation ) should not have trade off on latency?
|
| Because latency is part of user experience. And UX comes first
| before anything else.
|
| Or does it imply that there are lot of people who dont know the
| trade off between latency and utilisation? Because I dont know
| anyone who has utilisation to 1 or even 0.5 in production.
| srg0 wrote:
| Percentiles are order statistics, they are robust and not
| sensitive to outliers. This is why sometimes they are very
| useful. And this is why they do not capture how big the
| remaining 0.01% of the data are.
|
| Let's take a median, which is also an order statistics. And a
| sequence of latency measurements: 0.005 s, 0.010 s, 3600 s.
| Median latency is 0.010 s, and this number does not tell how
| bad latency can actually be. Mean latency is 1200.05 s, which
| is more indicative how bad the worst case is.
|
| In other words, percentiles show how often a problem happens
| (does not happen). Mean values show the impact of the problem.
| mjb wrote:
| The other two answers you got are good. I will say that
| monitoring p99 (or 99.9 or whatever) is a good thing,
| especially if you're building human-interactive stuff. Here's
| my colleague Andrew Certain talking about how Amazon came to
| that conclusion: https://youtu.be/sKRdemSirDM?t=180
|
| But p99 is just one summary statistic. Most importantly, it's a
| robust statistic that rejects outliers. That's a very good
| thing in some cases. It's also a very bad thing if you care
| about throughput, because throughput is proportional to
| 1/latency, and if you reject the outliers then you'll
| overestimate throughput substantially.
|
| p99 is one tool. A great and useful one, but not for every
| purpose.
|
| > Because I dont know anyone who has utilisation to 1 or even
| 0.5 in production.
|
| Many real systems like to run much hotter than that. High
| utilization reduces costs, and reduces carbon footprint. Just
| running at low utilization is a reasonable solution for a lot
| of people in a lot of cases, but as margins get tighter and
| businesses get bigger, pushing on utilization can be really
| worthwhile.
| bostik wrote:
| In my previous job me and latency-sensitive engineering teams
| in general mostly went with just four core latency
| measurements.[ss] - p50, to see the
| baseline - p95, to see the most common latency peaks
| - p99, to see what the "normal" waiting times under load were
| - max, because that's what the most unfortunate customers
| experienced
|
| In a normal distributed system the spread between p99 and max
| can be enormous, but the mental mindset of ensuring smooth
| customer experience, with awareness that a real person had to
| wait that long, is exceptionally useful. You need just _one_
| slightly slower service for the worst-case latency to
| skyrocket. In particular, GraphQL is exceptionally bad at
| this without real discipline - the minimum request latency is
| dictated by the SLOWEST downstream service.
|
| To be fair, it was a real time gambling operation. And we
| _were_ operating within the first Nielsen threshold.
|
| ss: bucketing these by request route was quite useful.
|
| EDIT: formatting
| dastbe wrote:
| every statistic is a summary and a lie. p50/p99 metrics are
| good in the sense that they tell you a number someone actually
| experienced and they put an upper bound on that experience.
| they are bad because they won't tell you is how the experience
| below that bound looks.
|
| mean and all it's variants won't show you a number that someone
| in your system necessarily experienced, but it will incorporate
| the entire distribution and show when it has changed.
|
| in the context of efficiency, mean is beneficial because it can
| be used to measure concurrency in a system via littles law, and
| will signal changes in your concurrency that a percentile
| metric won't necessarily do.
| dwohnitmok wrote:
| I think the article is missing one big reason why we care about
| 99.99% or 99.9% latency metrics and that is that we can have high
| latency spikes even with low utilization.
|
| The majority of computer systems do not deal with high
| utilization. As has been pointed out many times, computers are
| really fast these days, and many businesses may be able to get
| away through their entire lifetime on a single machine if the
| underlying software makes efficient use of the hardware
| resources. And yet even with low utilization, we still have
| occasional high latency that still occurs often enough to
| frustrate a user. Why is that? Because a lot of software these
| days is based on a design that intersperses low-latency
| operations with occasional high-latency ones. This shows up
| everywhere: garbage collection, disk and memory fragmentation,
| growable arrays, eventual consistency, soft deletions followed by
| actual hard deletions, etc.
|
| What this article is advocating for is essentially an amortized
| analysis of throughput and latency, in which case you do have a
| nice and steady relationship between utilization and latency. But
| in a system which may never come close to full utilization of its
| underlying hardware resources (which is a large fraction of
| software running on modern hardware), this amortized analysis is
| not very valuable because even with very low utilization we can
| still have very different latency distributions due to the
| aforementioned software design and what tweaks you make to that.
|
| This is why many software systems don't care about the median
| latency or the average latency, but care about the 99 or 99.9
| percentile latency: there is a utilization-independent component
| to the statistical distribution of your latency over time and for
| those many software systems which have low utilization of
| hardware resources that is the main determinant of your overall
| latency profile, not utilization.
| MatteoFrigo wrote:
| Even worse, the effects that you mention (garbage collection,
| etc.) are morally equivalent to an increase in utilization,
| which pushes you towards the latency singularity that the
| article is talking about.
|
| As an oversimplified example, suppose that your system is 10%
| utilized and that $BAD_THING (gc, or whatever) happens that
| effectively slows down the system by a factor of 10 at least
| temporarily. Your latency does not go up by 10x---it grows
| unbounded because now your effective utilization is 100%.
| dvh wrote:
| Grace Hopper explaining 1 nanosecond:
| https://youtu.be/9eyFDBPk4Yw
| MatteoFrigo wrote:
| Great article, and a line of reasoning that ought to be more
| widely known. There is a similar tradeoff between latency and
| utilization in hash tables, for essentially the same reason.
|
| The phenomenon described by the author can lead to interesting
| social dynamics over time. The initial designer of a system
| understands the latency/utilization tradeoff and dimensions the
| system to be underutilized so as to meet latency goals. Then the
| system is launched and successful, so people start questioning
| the low utilization, and apply pressure to increase utilization
| in order to reduce costs. Invariably latency goes up, customers
| complain. Customers escalate, and projects are started to reduce
| latency. People screw around at the margin changing number of
| threads etc, but the fundamental tradeoff cannot be avoided.
| Nobody is happy in the end. (Been through this cycle a few times
| already.)
| dharmab wrote:
| In my org, we define latency targets early (based on our user's
| needs where possible) and then our goal is to maximize
| utilization within those constraints.
| jrochkind1 wrote:
| Mostly a reminder/clarification of things I knew, but a good and
| welcome one well-stated, because I probably sometimes forget. (I
| don't do performance work a lot).
|
| But this:
|
| > If you must use latency to measure efficiency, use mean (avg)
| latency. Yes, average latency
|
| Not sure if I ever thought about it before, but after following
| the link[1] where OP talks more about it, they've convinced me.
| Definitely want mean latency at least in addition to median, not
| median alone.
|
| [1]: https://brooker.co.za/blog/2017/12/28/mean.html
| zekrioca wrote:
| > If we're expecting 10 requests per second at peak this
| holiday season, we're good.
|
| Problem is, sometimes system engineers do not know what to
| expect, but they still need to have a plan for this case.
| [deleted]
| marcosdumay wrote:
| > at least in addition to median
|
| There was an interesting article here not long ago that made
| the point that median is basically useless. If you load 5
| resources on a page load, the odds of all of them being faster
| than the median (so it represents the user experience) is about
| 3%. You need a very high rank to get any useful information,
| probably with a number of 9s.
| jrochkind1 wrote:
| Median for a particular action/page might be more useful.
___________________________________________________________________
(page generated 2021-08-27 23:01 UTC)