[HN Gopher] FBDetect: Catching Tiny Performance Regressions at H...
___________________________________________________________________
FBDetect: Catching Tiny Performance Regressions at Hyperscale [pdf]
Author : pjmlp
Score : 51 points
Date : 2024-11-12 07:42 UTC (3 days ago)
(HTM) web link (tangchq74.github.io)
(TXT) w3m dump (tangchq74.github.io)
| dataflow wrote:
| Detecting a 0.005% regression means detecting that a 20s task now
| takes 20.001s.
|
| It's not even easy to reliably detect such a small performance
| regression on a single thread on a single machine.
|
| I suppose in theory having multiple machines _could_ actually
| improve the situation, by letting them average out the noise? But
| on the other hand, it 's not like you have identically
| distributed samples to work with - workloads have variance over
| time and space, so there's extra noise that isn't uniform across
| machines.
|
| Color me a little skeptical, but it's super cool if actually
| true.
| yuliyp wrote:
| The .005% is a bit of a headline-grabber for sure, but the idea
| makes sense given the context: They're monitoring large, multi-
| tenant/multi-use-case systems which have very large amounts of
| diverse traffic. In these cases, the regression may be .005% of
| the overall size, but you don't detect it like that, but rather
| by detecting a 0.5% regression in a use case which was 1% of
| the cost. They can and do slice data in various ways (group by
| endpoint, group by function, etc.) to improve the odds of
| detecting a regression.
| danielheath wrote:
| > They can and do slice data in various ways
|
| If you're looking for statistical validity , this is not the
| way to go about it; if you perform 20 analyses, there's too
| high a chance one of them will spuriously show p>.95 (see
| p-hacking).
| yuliyp wrote:
| They're not trying to write a paper with the identified.
| They're trying to identify when regressions creep in so
| they can be fixed. Half the paper talks about all the ways
| they filter signals into things which are likely
| actionable. Sure they'll still get some wrong, but as long
| as the precision is high enough for engineers to treat the
| regression reports seriously and the recall is high enough
| that costs stay down, that's all that matters.
| vlovich123 wrote:
| You're right to be skeptical. This entire space as far as I can
| tell is filled with people who overpromise and under deliver
| and use bad metrics to claim success. If you look at their
| false positive and false negative sections, they perform
| terribly but use words to claim that it's actually good and use
| flawed logic to extrapolate on missing data (eg assume our
| rates stay the same for non-response vs "people are tired of
| our tickets and ignore our system"). And as follow up work
| their solution is to keep tuning their parameters (ie keep
| fiddling to overfit past data"). You can even tell how it's
| perceived where they describe people not even bother to
| interact with their system during 2/4 high impact incidents
| examined and blaming the developer for one of them as "they
| didn't integrate the metrics". Like if a system can provide a
| meaningful cost/benefit the teams would be clamoring to adjust
| their processes. Until demonstrated clearly otherwise it's
| dressed up numerology.
|
| I saw a team at oculus fail to do this for highly constrained
| isolated environments with repeatable workloads and having the
| threshold be much more conservative (eg 1-10%) and failing.
| This paper is promulgating filtering your data all to hell to
| the point of overfitting.
| dataflow wrote:
| Thanks, I hadn't read that far into the paper. But I have to
| say I had what I feel is a good reason to be skeptical before
| even reading a single word in the paper, honestly. Which is
| that Facebook never felt so... blazing fast, shall we say, to
| make me believe anyone even _wanted_ to pay attention to tiny
| performance regressions, let alone the drive and tooling to
| do so.
| vlovich123 wrote:
| > blazing fast, shall we say, to make me believe anyone
| even wanted to pay attention to tiny performance
| regressions
|
| Important to distinguish frontend vs backend performance of
| course. This is about backend performance where they care
| about this stuff a lot because it multiplies at scale &
| starts costing them real money. Frontend performance has
| less of a direct impact on their numbers with the only data
| I know on that is the oft-cited Google stuff trying to
| claim that there's a direct correlation between lost
| revenue and latency (which I haven't seen anyone else
| bother to try to replicate & see if it holds up).
| dataflow wrote:
| > Important to distinguish frontend vs backend
| performance of course.
|
| I'm not sure what you're calling frontend in this context
| (client side? or client-facing "front-end" servers?), but
| I'm talking about their server side specifically. I've
| looked at their API calls and understand when slowness is
| coming from the client or the server. The client side is
| even slower for sure, but the server side also never felt
| like it was optimized to the point where deviations this
| small mattered.
| jacques_chester wrote:
| I think the confusion arises because of the difference
| between optimization and control, which are superficially
| similar.
|
| Having control lets you see if things changed.
| Optimization is changing things.
|
| This team seems to be focused on control. I assume
| optimization is left to the service teams.
| dataflow wrote:
| I think by control you mean observability?
|
| I get that they're different, but the whole point _is_
| optimization here. They 're not gathering performance
| metrics just to hang them up on a wall and marvel at the
| number of decimal points, right? They presumably invested
| all the effort into this infrastructure because they
| think this much precision has significant ROI on the
| optimization side.
| jeffbee wrote:
| Trying to do fine-grained regression detection in a
| controlled environment is indeed a fool's errand, but that's
| the opposite of what this paper is about.
| vlovich123 wrote:
| Your claim is that doing fine-grained detection in a more
| chaotic and dynamic environment with unrepeatable inputs is
| easier than fine-grained detection in a controlled
| environment with consistent inputs? Not sure I follow.
| jeffbee wrote:
| Yes, it often is. In a "controlled" environment you
| control what you control and not the stuff that you can't
| control or don't even know about. It's tedious and sort
| of a chore to setup and a source of ongoing woes
| afterwards. On the other hand natural experiments abound
| when you have large scale. Just having really large N on
| both arms of an experiment where the difference is the
| compiled program, and there are no systematic biases in
| traffic patterns, will reveal smaller regressions that
| weren't found in the controlled load tests.
| jacques_chester wrote:
| I don't agree. This is basically an elaborate form of
| statistical process control, which has been proving itself
| useful and effective for nearly a century. We can quibble
| about the thresholds and false positive rates, but I think
| the idea of automating regression detection is perfectly
| sound.
| protomolecule wrote:
| "...measuring CPU usage at the subroutine level rather than at
| the overall service level. However, if this 0.005% regression
| originates from a single subroutine that consumes 0.1% of the
| total CPU, the relative change at the subroutine level is
| 0.005% / 0.1% = 5%, which is much more substantial.
| Consequently, small regressions are easier to detect at the
| subroutine level."
| ein0p wrote:
| This is the company that famously runs their main website on ...
| drum roll ... PHP. I know it's not off-the-shelf PHP, and they do
| a lot of clever stuff with it, but to worry about single digit
| milliseconds when you have such a barnacle as the centerpiece of
| your business is hilarious.
| tekknolagi wrote:
| They are extraordinarily different languages and runtime
| systems.
| summerlight wrote:
| I think most of their performance critical infrastructures are
| written in native languages like C++ and even their PHP
| (actually Hack) parts are significantly modified and optimized.
| woadwarrior01 wrote:
| Hack superficially looks like PHP, but is a completely
| different beast under the hood.
| toast0 wrote:
| I know it's popular to hate on PHP. But when you do it right,
| you can turn pages out fast enough that single digit
| milliseconds are important. I'm not a fan of Hack or how FB
| structures their PHP; but some of their pages are pretty quick.
| Personally, with just regular PHP, my baseline is about 10 ms
| for a minimum page, and 50 ms for something that's complex and
| maybe has a few database queries. That's not the quickest thing
| in the world, and not all of my PHP goes that fast, but single
| digit milliseconds are still significant at that scale.
| yuliyp wrote:
| HHVM is an amazingly-optimized VM for their workload. That PHP
| runs many times faster than off-the-shelf PHP does. I suspect
| taking any other widely available VM and waving a magic wand to
| rewrite in another language would result in something that
| performed significantly worse.
| paxys wrote:
| There is no PHP at Facebook/Meta
___________________________________________________________________
(page generated 2024-11-15 23:01 UTC)