[HN Gopher] FBDetect: Catching Tiny Performance Regressions at H...
       ___________________________________________________________________
        
       FBDetect: Catching Tiny Performance Regressions at Hyperscale [pdf]
        
       Author : pjmlp
       Score  : 51 points
       Date   : 2024-11-12 07:42 UTC (3 days ago)
        
 (HTM) web link (tangchq74.github.io)
 (TXT) w3m dump (tangchq74.github.io)
        
       | dataflow wrote:
       | Detecting a 0.005% regression means detecting that a 20s task now
       | takes 20.001s.
       | 
       | It's not even easy to reliably detect such a small performance
       | regression on a single thread on a single machine.
       | 
       | I suppose in theory having multiple machines _could_ actually
       | improve the situation, by letting them average out the noise? But
       | on the other hand, it 's not like you have identically
       | distributed samples to work with - workloads have variance over
       | time and space, so there's extra noise that isn't uniform across
       | machines.
       | 
       | Color me a little skeptical, but it's super cool if actually
       | true.
        
         | yuliyp wrote:
         | The .005% is a bit of a headline-grabber for sure, but the idea
         | makes sense given the context: They're monitoring large, multi-
         | tenant/multi-use-case systems which have very large amounts of
         | diverse traffic. In these cases, the regression may be .005% of
         | the overall size, but you don't detect it like that, but rather
         | by detecting a 0.5% regression in a use case which was 1% of
         | the cost. They can and do slice data in various ways (group by
         | endpoint, group by function, etc.) to improve the odds of
         | detecting a regression.
        
           | danielheath wrote:
           | > They can and do slice data in various ways
           | 
           | If you're looking for statistical validity , this is not the
           | way to go about it; if you perform 20 analyses, there's too
           | high a chance one of them will spuriously show p>.95 (see
           | p-hacking).
        
             | yuliyp wrote:
             | They're not trying to write a paper with the identified.
             | They're trying to identify when regressions creep in so
             | they can be fixed. Half the paper talks about all the ways
             | they filter signals into things which are likely
             | actionable. Sure they'll still get some wrong, but as long
             | as the precision is high enough for engineers to treat the
             | regression reports seriously and the recall is high enough
             | that costs stay down, that's all that matters.
        
         | vlovich123 wrote:
         | You're right to be skeptical. This entire space as far as I can
         | tell is filled with people who overpromise and under deliver
         | and use bad metrics to claim success. If you look at their
         | false positive and false negative sections, they perform
         | terribly but use words to claim that it's actually good and use
         | flawed logic to extrapolate on missing data (eg assume our
         | rates stay the same for non-response vs "people are tired of
         | our tickets and ignore our system"). And as follow up work
         | their solution is to keep tuning their parameters (ie keep
         | fiddling to overfit past data"). You can even tell how it's
         | perceived where they describe people not even bother to
         | interact with their system during 2/4 high impact incidents
         | examined and blaming the developer for one of them as "they
         | didn't integrate the metrics". Like if a system can provide a
         | meaningful cost/benefit the teams would be clamoring to adjust
         | their processes. Until demonstrated clearly otherwise it's
         | dressed up numerology.
         | 
         | I saw a team at oculus fail to do this for highly constrained
         | isolated environments with repeatable workloads and having the
         | threshold be much more conservative (eg 1-10%) and failing.
         | This paper is promulgating filtering your data all to hell to
         | the point of overfitting.
        
           | dataflow wrote:
           | Thanks, I hadn't read that far into the paper. But I have to
           | say I had what I feel is a good reason to be skeptical before
           | even reading a single word in the paper, honestly. Which is
           | that Facebook never felt so... blazing fast, shall we say, to
           | make me believe anyone even _wanted_ to pay attention to tiny
           | performance regressions, let alone the drive and tooling to
           | do so.
        
             | vlovich123 wrote:
             | > blazing fast, shall we say, to make me believe anyone
             | even wanted to pay attention to tiny performance
             | regressions
             | 
             | Important to distinguish frontend vs backend performance of
             | course. This is about backend performance where they care
             | about this stuff a lot because it multiplies at scale &
             | starts costing them real money. Frontend performance has
             | less of a direct impact on their numbers with the only data
             | I know on that is the oft-cited Google stuff trying to
             | claim that there's a direct correlation between lost
             | revenue and latency (which I haven't seen anyone else
             | bother to try to replicate & see if it holds up).
        
               | dataflow wrote:
               | > Important to distinguish frontend vs backend
               | performance of course.
               | 
               | I'm not sure what you're calling frontend in this context
               | (client side? or client-facing "front-end" servers?), but
               | I'm talking about their server side specifically. I've
               | looked at their API calls and understand when slowness is
               | coming from the client or the server. The client side is
               | even slower for sure, but the server side also never felt
               | like it was optimized to the point where deviations this
               | small mattered.
        
               | jacques_chester wrote:
               | I think the confusion arises because of the difference
               | between optimization and control, which are superficially
               | similar.
               | 
               | Having control lets you see if things changed.
               | Optimization is changing things.
               | 
               | This team seems to be focused on control. I assume
               | optimization is left to the service teams.
        
               | dataflow wrote:
               | I think by control you mean observability?
               | 
               | I get that they're different, but the whole point _is_
               | optimization here. They 're not gathering performance
               | metrics just to hang them up on a wall and marvel at the
               | number of decimal points, right? They presumably invested
               | all the effort into this infrastructure because they
               | think this much precision has significant ROI on the
               | optimization side.
        
           | jeffbee wrote:
           | Trying to do fine-grained regression detection in a
           | controlled environment is indeed a fool's errand, but that's
           | the opposite of what this paper is about.
        
             | vlovich123 wrote:
             | Your claim is that doing fine-grained detection in a more
             | chaotic and dynamic environment with unrepeatable inputs is
             | easier than fine-grained detection in a controlled
             | environment with consistent inputs? Not sure I follow.
        
               | jeffbee wrote:
               | Yes, it often is. In a "controlled" environment you
               | control what you control and not the stuff that you can't
               | control or don't even know about. It's tedious and sort
               | of a chore to setup and a source of ongoing woes
               | afterwards. On the other hand natural experiments abound
               | when you have large scale. Just having really large N on
               | both arms of an experiment where the difference is the
               | compiled program, and there are no systematic biases in
               | traffic patterns, will reveal smaller regressions that
               | weren't found in the controlled load tests.
        
           | jacques_chester wrote:
           | I don't agree. This is basically an elaborate form of
           | statistical process control, which has been proving itself
           | useful and effective for nearly a century. We can quibble
           | about the thresholds and false positive rates, but I think
           | the idea of automating regression detection is perfectly
           | sound.
        
         | protomolecule wrote:
         | "...measuring CPU usage at the subroutine level rather than at
         | the overall service level. However, if this 0.005% regression
         | originates from a single subroutine that consumes 0.1% of the
         | total CPU, the relative change at the subroutine level is
         | 0.005% / 0.1% = 5%, which is much more substantial.
         | Consequently, small regressions are easier to detect at the
         | subroutine level."
        
       | ein0p wrote:
       | This is the company that famously runs their main website on ...
       | drum roll ... PHP. I know it's not off-the-shelf PHP, and they do
       | a lot of clever stuff with it, but to worry about single digit
       | milliseconds when you have such a barnacle as the centerpiece of
       | your business is hilarious.
        
         | tekknolagi wrote:
         | They are extraordinarily different languages and runtime
         | systems.
        
         | summerlight wrote:
         | I think most of their performance critical infrastructures are
         | written in native languages like C++ and even their PHP
         | (actually Hack) parts are significantly modified and optimized.
        
         | woadwarrior01 wrote:
         | Hack superficially looks like PHP, but is a completely
         | different beast under the hood.
        
         | toast0 wrote:
         | I know it's popular to hate on PHP. But when you do it right,
         | you can turn pages out fast enough that single digit
         | milliseconds are important. I'm not a fan of Hack or how FB
         | structures their PHP; but some of their pages are pretty quick.
         | Personally, with just regular PHP, my baseline is about 10 ms
         | for a minimum page, and 50 ms for something that's complex and
         | maybe has a few database queries. That's not the quickest thing
         | in the world, and not all of my PHP goes that fast, but single
         | digit milliseconds are still significant at that scale.
        
         | yuliyp wrote:
         | HHVM is an amazingly-optimized VM for their workload. That PHP
         | runs many times faster than off-the-shelf PHP does. I suspect
         | taking any other widely available VM and waving a magic wand to
         | rewrite in another language would result in something that
         | performed significantly worse.
        
         | paxys wrote:
         | There is no PHP at Facebook/Meta
        
       ___________________________________________________________________
       (page generated 2024-11-15 23:01 UTC)