[HN Gopher] Reverse Debugging at Scale
       ___________________________________________________________________
        
       Reverse Debugging at Scale
        
       Author : kiyanwang
       Score  : 57 points
       Date   : 2021-05-03 11:26 UTC (11 hours ago)
        
 (HTM) web link (engineering.fb.com)
 (TXT) w3m dump (engineering.fb.com)
        
       | [deleted]
        
       | jeffbee wrote:
       | It is fairly surprising to me that FB would pay a roughly 5%
       | throughput penalty to get this.
        
         | adamfeldman wrote:
         | Where does the 5% number come from? I didn't see it in the
         | article or on the linked page for Intel Processor Trace.
        
           | jeffbee wrote:
           | Pulled from my experience. How much stuff needs to be logged
           | depends on the structure of the program, and different
           | programs are more or less sensitive to having some of their
           | memory store bandwidth stolen.
        
             | adamfeldman wrote:
             | Thank you for sharing!
        
         | sacheendra wrote:
         | It is not that surprising. Facebook is complex ecosystem, and
         | therefore experiences lots of outages/problems.
         | 
         | Looking at their status page
         | (https://developers.facebook.com/status/dashboard/), they seem
         | to have a problems every week. These are only the public ones!
         | 
         | I guess they figured they are losing more money due to these
         | problems than the additional 5% they have to spend on
         | infrastructure.
        
           | jeffbee wrote:
           | I can see why an org would choose to do this, but the number
           | is still frightening. At Google, we were held to a limit of
           | at most 0.01% cost for sampling LBR events. 5% for debug-
           | ability just seems really high.
        
         | slver wrote:
         | In a car, you get the best speed when you press the pedal to
         | the metal and close your eyes. Yet we pay performance penalty
         | by driving it, instead.
        
           | r00fus wrote:
           | Car analogy is not applicable - think more instead of
           | lowering speed of the entire high-speed roadway by 5mph (e.g.
           | by road material/quality) - has a qualitative difference at
           | that scale.
        
             | slver wrote:
             | Point is slowing down a bit, allows us to see what's
             | happening, so we can make course corrections, and crash
             | less.
             | 
             | That's an understandable tradeoff for car driving.
             | 
             | It's an understandable tradeoff for Facebook debugging.
             | 
             | Ergo, we do it.
        
         | DSingularity wrote:
         | They must have a reason. Probably helps them resolve otherwise
         | costly failures in good time.
        
       | bluedino wrote:
       | Sort of surprised to see VScode and LLDB mentioned. So Java or
       | C++? Rust?
        
         | Veserv wrote:
         | The technology they are describing is largely language-agnostic
         | as it is just reconstructing the sequence of hardware
         | instructions that executed. So, in principle you can apply the
         | underlying technique to any language as long as you can
         | determine the source line that corresponds to a hardware
         | instruction at a point in time. Which is already done by any
         | standard debugger, at least for AOT compiled languages, as this
         | is how a debugger can use the hardware instruction the
         | processor stopped at to tell you which source code line you are
         | stopped at. For JIT or interpreted languages it is slightly
         | more complex, but still a solved problem.
        
           | roca wrote:
           | It won't work for anything with a JIT or interpreter, not
           | without _significantly_ more work.
        
             | Veserv wrote:
             | Assuming that a Java debugger can convert a breakpoint to
             | its corresponding source line, it must maintain some sort
             | of source<->assembly mapping that transforms over time to
             | do that lookup. As long as you record those changes, namely
             | the introduction or destruction of any branches that Intel
             | PT would record, the same underlying approach should work.
             | The primary complexities there would be making sure those
             | JIT records are ordered correctly with respect to branches
             | in the actual program, and if the JIT deletes the original
             | program text as that might require actually reversing the
             | execution and JIT history to recover the instructions at
             | the time of recording. This would require adding some
             | instrumentation to the JIT to record branches that were
             | inserted or deleted, but that seems like something that can
             | be implemented as a post-processing step at a relatively
             | minor performance cost, so it seems quite doable. If there
             | are no deletions then you could just use the final JIT
             | state for the source<->assembly mapping. Is there something
             | that I am missing beyond glossing over the potential
             | difficulties of engaging with a giant code base that might
             | not be amenable to changes?
             | 
             | As for an interpreter I have not really thought about it
             | too hard. It might be harder than I was originally
             | considering because I was thinking in the context of a full
             | data trace which would just let you re-run the program +
             | interpreter. With just an instruction trace you might need
             | a lot more support from the interpreter. Alternatively, you
             | might be able to do it if the interpreter internals
             | properly separate out handling for the interpreted
             | instructions and you could use that to reverse engineer
             | what the interpreted program executed. Though that would
             | probably require a fair bit of language/interpreter-
             | specific work. Also, given the expected relative execution
             | speeds of probably ~10x, it would probably not be so great
             | since you get so much less execution per unit of storage.
        
       | Veserv wrote:
       | From what I can tell, they are just using standard instruction
       | trace rather than a full trace, so they can only inspect
       | execution history rather than full data history that most other
       | time travel debugging solutions provide. The advantages of their
       | approach of just using the standard hardware instruction trace
       | functionality is that it functions even on shared-memory
       | multithreaded applications at "full" speed unlike most other time
       | travel debugging solutions. The disadvantages being that it
       | requires higher storage bandwidth, Intel does not seem to support
       | data trace, and even if it did support data trace would require
       | significantly more storage bandwidth (something like 10x-100x).
        
       | inglor wrote:
       | Ok, so how is thousands of servers 0.1%? That implies they have
       | millions of servers or one for every 9000 people on earth - are
       | companies this size really that wasteful in terms of servers
       | needed?
        
         | akiselev wrote:
         | If they have two million servers that would mean about 1000
         | daily active users per server. Assuming the average user makes
         | 2000 requests (API calls, images, videos, etc.) a day
         | mindlessly browsing the infinite feed, that works out to about
         | 1 request per second.
         | 
         | Facebook makes its money from advertisers so that's likely
         | where most of the compute resources are going - users just see
         | the ads at the end of all that computation. Combined with the
         | mandatory over provisioning, the overhead of a massive
         | distributed systems, tracing, etc, I'm not surprised those are
         | the numbers.
         | 
         | Assuming each server cost an average of $20k, that's $40
         | billion which is two quarters worth of revenue but amortized
         | over 5+ years. It's really not all that much.
        
         | packetslave wrote:
         | The official public answer is "millions of servers" (see, for
         | example, https://engineering.fb.com/2018/07/19/data-
         | infrastructure/lo...).
         | 
         | Keep in mind that this includes Instagram and Whatsapp too, as
         | far as I know. As for "wasteful", well..                   1.88
         | billion DAU (Q1 earning report)           / 86400            =
         | 21759 "users per second" (note: I made this up)
         | 
         | Multiply that by N where N is the number of frontend and
         | backend queries it takes to service one user, and you have a
         | _lot_ of QPS. Now add in headroom, redundancy, edge POPs and
         | CDN to put servers as close to users as possible, etc.
         | 
         | It's hard to fathom just how _big_ traffic to FAANG services
         | can be, until you see what it takes to serve it. Is there some
         | waste? Sure, probably, but not as much as you 'd think.
        
       ___________________________________________________________________
       (page generated 2021-05-03 23:02 UTC)