[HN Gopher] Hash-based bisect debugging in compilers and runtimes
       ___________________________________________________________________
        
       Hash-based bisect debugging in compilers and runtimes
        
       Author : rsc
       Score  : 236 points
       Date   : 2024-07-18 14:36 UTC (1 days ago)
        
 (HTM) web link (research.swtch.com)
 (TXT) w3m dump (research.swtch.com)
        
       | tekknolagi wrote:
       | This is very cool and seems similar to what we do/did in Cinder:
       | https://bernsteinbear.com/blog/cinder-jit-bisect/
       | 
       | EDIT: Oops, this tool is mentioned in the post already (but the
       | post is not, so here it is if you want to read about it). Neat!
        
         | rsc wrote:
         | Thanks for that link. I've changed the link beneath the Cinder
         | mention to that page instead of the GitHub page.
        
       | tabbott wrote:
       | This is so cool!
       | 
       | It reminds me a bit of one my favorite debugging techniques:
       | Running an individual test with coverage reporting enabled, and
       | then clicking around the HTML coverage report to see exactly what
       | code path the test followed, without needing to use either print
       | statements or a specialized debugger.
       | 
       | Very helpful for answering questions of the form "Why does this
       | test not exercise the code I thought it did?".
        
       | chc4 wrote:
       | When I read https://marcan.st/2017/12/debugging-an-evil-go-
       | runtime-bug/ and saw the hash bisection trick for the first time
       | I was super impressed, it really does sound incredibly slick :) I
       | imagine that's how first coming across normal git bisect must
       | feel to new engineers.
        
       | carry_bit wrote:
       | A closely related technique for debugging optimization passes is
       | that of "optimization fuel". Each rewrite decreases the fuel by
       | one, and when the fuel is gone no more rewrites happen. You can
       | then perform binary search on the optimization fuel to find a
       | specific rewrite instance that breaks things.
        
         | rsc wrote:
         | Yes, I believe LLVM has a flag for "run only the first N
         | optimization passes" that gets used with binary search this
         | way.
         | 
         | A global optimization fuel that worked at finer granularity
         | would be even more precise but you'd have to have the compiler
         | run single-threaded to make sure the numbering always matches.
         | At least in the Go compiler, we compile different functions in
         | different threads, so that fouls up any global numbering.
        
       | camgunz wrote:
       | This is pretty close to what I built for maintaining demo
       | compatibility in Doom engines. Basically it runs a demo and dumps
       | a save game to a file every frame. As soon as there's a
       | divergence it says what the difference is (monster 17 is at (4,
       | 22); should be (4, 21)) and bails. Not a ton of difference
       | between that and diffing the stack.
       | 
       | https://github.com/camgunz/democomp
        
         | rsc wrote:
         | What you are describing sounds like comparing traces of two
         | supposed-to-be-identical programs to see where they first
         | diverge. That's not what this is describing. There is no trace
         | at all, nor any decision about what to trace or what not to
         | trace.
         | 
         | There is only a decision about whether to use the old or new
         | implementation based on the hash of the call stack at that
         | moment. Then you binary search on hash values to identify the
         | exact call stack hash for which the new implementation causes a
         | problem. Then you run the program once more with the
         | instructions "print the stack with this hash".
        
         | IshKebab wrote:
         | What you implemented is just checking output against a
         | reference. That's an extremely common testing method.
         | 
         | This is significantly different because it actually changes the
         | program that is being run. It's not common at all - probably
         | because there aren't too many situations where it applies. You
         | need to be processing some large input that you don't
         | understand, and commonly change bits of your code in ways that
         | you can turn on and off dynamically during a single run.
        
       | Scaevolus wrote:
       | Aside: bisecting flakes doesn't _have_ to involve repeated runs.
       | You can reformulate bisection as an information probing
       | operation, expanding the scope to support noisy benchmarks or
       | low-probability flakes. Bayesian inference narrows down the
       | probable range of the failure for each new observation, and you
       | can choose new probes to maximize information gain-- or even run
       | them in parallel to minimize the total time.
       | 
       | You _do_ have to provide flake rate probability to do the
       | probability estimates, but even roughly correct rates work fine.
       | Running bisects assuming a 5% chance of getting a false positive
       | or negative barely adds more steps and greatly improves
       | robustness.
       | 
       | The math is pretty simple too-- my old prototype might still be
       | in Google's monorepo; I should reimplement it for the open source
       | world.
        
         | rsc wrote:
         | Indeed. Something like this is what I intend to do when I get
         | some time.
        
         | eru wrote:
         | Could you briefly sketch out the math (or point to a sketch) so
         | that other people can pick it up? I'm quite interested!
         | 
         | (I suspect you could get pretty far with a Monte Carlo
         | simulation, and that would let you bypass most of the math
         | anyway.)
        
           | ajb wrote:
           | Here is a doc explaining a Bayesian search algorithm; not
           | sure if it's precisely what GP has in mind.
           | 
           | https://github.com/Ealdwulf/BBChop/blob/master/BBChop/doc/Ba.
           | ..
           | 
           | (Edit- had wrong link originally)
        
             | eru wrote:
             | Thanks!
        
           | dwattttt wrote:
           | For Bayesian Inference, the example in the Wikipedia article
           | is good (who doesn't like cookies?):
           | https://en.m.wikipedia.org/wiki/Bayesian_inference#Examples
           | 
           | When gathering more evidence, you'd use your new belief about
           | which cookie bowl Fred has as P(H1)=0.6 and P(H2)=0.4
        
             | eru wrote:
             | That just explains Bayesian inference in general, which is
             | useful, but not what I'm after.
             | 
             | I was specifically interested in the application to binary
             | search / bisection in the presence of flaky tests.
        
           | efdb wrote:
           | J
        
           | efdb wrote:
           | H
        
           | jhfdbkofdchk wrote:
           | The Probabilistic Bisection Algorithm by Horstein, original
           | paper is "Sequential transmission using noiseless feedback,"
           | IEEE Trans. Inform. Theory, 9(3):136-143.
           | https://ieeexplore.ieee.org/document/1057832
           | 
           | Here's a nice presentation on an analysis of the method,
           | references start on slide 47
           | 
           | https://people.orie.cornell.edu/pfrazier/Presentations/2014..
           | ..
           | 
           | Paraphrasing the theorem from Jedynak, Frazier, Sznitman
           | (2011):
           | 
           | Suppose [flake probability] is constant, known, and bounded
           | away from 1/2, and we use the entropy loss function. ... The
           | policy that chooses [next test point] at the median of
           | [current posterior distribution] is optimal.
           | 
           | And some python code that implements a version of the
           | analysis
           | 
           | https://github.com/choderalab/thresholds/blob/master/thresho.
           | ..
        
             | eru wrote:
             | Thanks!
        
       | drivebycomment wrote:
       | In my former life, I used to maintain a script that can be given
       | two sets of objects files, one compiled with optimization, and
       | one without, and the script will effectively do a binary search
       | by choosing which object files you link and run the executable to
       | determine success/fail. Each iteration is quick, since linking
       | step is usually fast. This was useful when troubleshooting a big
       | binary, since optimized build back then was often quite slow for
       | a large executable.
        
       | millipede wrote:
       | Are there any non-compiler use cases for this technique?
        
         | saagarjha wrote:
         | You can use this for any codebase of sufficient complexity
         | where you want to identify which change is causing the effect
         | you're observing.
        
           | IshKebab wrote:
           | Not really, it's only suitable where your code is processing
           | some other large input that you don't understand, _and_ when
           | your changes are usually easily toggleable at runtime and can
           | be applied to only part of the input.
        
             | saagarjha wrote:
             | It works just fine for me what that wasn't the case.
        
             | rsc wrote:
             | I would frame it differently: stack hash bisect is only
             | suitable where your code is linked into some other large
             | program that you don't understand, and when your changes
             | can be toggled on a per-invocation basis. Then you can use
             | it to identify the exact stack through the larger program
             | that leads to your code and works with the old behavior but
             | breaks with the new behavior.
             | 
             | (saagarjha seems to be talking about 'git bisect', which is
             | not really what the post is about.)
        
         | rsc wrote:
         | Anyone working on libraries that are used by large programs can
         | use them, like in the sort and timer cases described toward the
         | end of the paper. When you work on libraries used by other
         | larger programs you inevitably break them accidentally. This
         | technique pinpoints the exact context of the breakage.
        
         | Twirrim wrote:
         | All sorts of stuff, I git bisect reasonably regularly.
         | 
         | You can even do things like automatically git bisect the linux
         | kernel, with a little care. I wrote this up a few years ago
         | https://paulgraydon.co.uk/posts/2020-12-27-automated-
         | kernel-....
         | 
         | I can't talk about the actual case that lead me to ever need to
         | git bisect a kernel in the first place, but at the same time I
         | also learned how to make a minimalist one-C-file initrd,
         | because what I needed to catch was visible under a /sys or /dev
         | mount, or something like that. I later realised it's possible
         | to do the same thing with Go in slightly easier syntax, if you
         | cajole it in to producing a statically compiled binary!
        
           | IshKebab wrote:
           | He means the runtime bisection that this article is about,
           | not `git bisect`.
        
       | saagarjha wrote:
       | I'm running one of these now, interestingly enough. Apparently
       | something's broken in OpenJDK if you build with the new Xcode. So
       | I'm bisecting on all the files (choosing either the old or new
       | compiler) trying to see which one is breaking things.
        
       | kragen wrote:
       | this is a wonderful post!
       | 
       | the algorithms for using binary search to efficiently reduce a
       | set satisfying some predicate to a locally minimal satisfying
       | subset* are new to me (though cox says zeller published a
       | slightly buggy version in 01999! and meta's cinder a correct one
       | in 02021), and seem brilliant; their applications are not limited
       | to debugging. i wonder how it relates to hypothesis's test-case
       | reduction algorithm; can one of them be considered an application
       | of the other?
       | 
       | also, this idea of binary-search debugging of the program's call
       | tree rather than your revision history (or program input, or a
       | set of data instances) is also a brilliant one. and although they
       | published it a decade ago, i hadn't heard about it until now
       | 
       | the examples of asynctimerchan=1, changing optimization settings,
       | and changing sort algorithms have in common that in some sense
       | they are behavior-preserving, so you can toggle them on and off
       | at will during execution without breaking anything. i wonder how
       | to apply this call-tree debugging if the change you're trying to
       | narrow down is a change that has to be consistent throughout the
       | program's execution. for example, suppose some code using your
       | hash tables breaks when you switch to a new hash function, maybe
       | because it inadvertently depended on enumeration order. if you
       | change the hash function partway through the program, you won't
       | be able to find things in your hash tables after that. you could
       | change the algorithm per table, of course, and narrow it down to
       | a particular table, but that won't give you the particular line
       | of code
       | 
       | i need to think a bit more about this issue of 'hashing a list of
       | program counters'. you could of course number the sequence of all
       | subroutine invocations during a (deterministic! single-threaded!)
       | execution, as gas does for macro invocations, and binary-search
       | that dense numbering. (this is a variant of the technique
       | carry_bit is calling 'optimization fuel', but one that requires
       | support from a compiler or debugger.) but, since you're toggling
       | options on and off that will change the number of subroutine
       | calls, the numbering won't be stable; so this will tend to only
       | reliably find single-culprit failures
       | 
       | you could possibly get a stable-enough numbering using pathnames
       | like /3/5/1, meaning the the 1st subroutine called from the 5th
       | subroutine called from the 3rd subroutine called from main().
       | that seems like it might in some sense be _stabler_ than hashing
       | the _entire_ list of return addresses, and it would certainly
       | permit a lower-overhead implementation using a debugger and
       | breakpoints rather than a check in every leaf call. plausibly i
       | 'm overlooking a flaw in this form of 'sequential numbering'?
       | does the hashed list get truncated at some point for stability?
       | 
       | often when you have a change that is in some sense behavior-
       | preserving, which is to say, you have two ways to do the same
       | thing, you can use generative testing systems like hypothesis to
       | detect bugs in either of them: process the same input through
       | both paths and verify that the results are equivalent in the
       | appropriate sense. this doesn't require the instrumentation
       | infrastructure russ is using here, but it does depend on you
       | being able to identify the relevant 'input', which can be very
       | hard
       | 
       | in itself that doesn't help with the kinds of bugs he's talking
       | about here, though: bugs where both the old and new code is
       | 'equivalent' by your lights, but some other client code that
       | calls it doesn't find it equivalent. this suggests a different
       | generative-testing approach: generatively inject behavioral
       | perturbations which don't violate equivalence, attempting to
       | provoke failures in client code. aslr and hash-table seed
       | randomization are doing this for us for some purposes, but unlike
       | generative-testing frameworks, they provoke outages in
       | production, don't do test-case minimization, and don't record
       | failing cases to make bisection easy and prevent later
       | regressions. and they don't do things like shuffling the input to
       | a non-stable sort subroutine
       | 
       | binary-search debugging does indeed feel magical. scaevolus seems
       | to be saying there's a bayesian generalization of it for
       | nondeterministic bugs that are effectively random? you can of
       | course run the test 5 (or 1000) times on each revision you're
       | binary-searching over, but it feels like, if the number of
       | revisions you're searching over is several thousand, you ought to
       | be able to get some additional advantage out of running the test
       | once on each of 5 (or 1000) revisions. can you solve this just by
       | maximizing the expected shannon information of each test?
       | 
       | on a side note, it's pretty appalling that 30 years ago the plan9
       | group had `yesterday -d -n 7 anyfilename` to see what changed in
       | the last week, thanks to their optical jukebox, while in the
       | mainstream we still struggle with accidental file deletion and
       | overwrites despite routinely carrying around terabytes in our
       | pockets
       | 
       | on an even more marginally relevant note, earlier this week i was
       | perusing the 7th edition unix kernel, in which the subroutine
       | that switches stacks (the one with the well-known comment in 6th
       | edition) is called swtch(). and tonight i just realized why russ
       | cox uses that domain name
       | 
       | ______
       | 
       | * conventionally this is just called a 'minimal satisfying
       | subset', because it's 'minimal' in the partial-order sense, but i
       | think cox's term 'locally minimal' is clearer
        
         | derdi wrote:
         | > this suggests a different generative-testing approach:
         | generatively inject behavioral perturbations which don't
         | violate equivalence, attempting to provoke failures in client
         | code
         | 
         | There is a compiler fuzzing method called "equivalence modulo
         | inputs" that does something like this. Take a fuzzed program
         | with fixed inputs and profile it to find parts that are never
         | executed. Change only code in those never executed parts. If
         | the original and modified programs behave differently, you
         | found a compiler bug: The compiler got confused by code that it
         | can see but that (unknown to the compiler) cannot dynamically
         | influence the program's behavior on the given inputs.
         | 
         | Like so many things in fuzzing, this sounds silly but actually
         | finds bugs in practice.
        
         | rsc wrote:
         | One small correction: we were doing bisect over compiler
         | optimizations a decade ago, but bisect over call trees only
         | happened in the past year or so.
         | 
         | For the hash table case, deciding the function per-table as you
         | suggest is probably good enough. In a large program there are
         | going to be tons of hash tables, and if bisect gets you to
         | "it's this specific hash table allocated by this call stack
         | that matters", that's a huge win.
         | 
         | The timer change is much like the hash table one, in that we
         | toggle the "timer kind" at creation time, but it's only later
         | operations that actually change behavior. Still, identifying
         | the specific timer and call stack that created it turned out to
         | be good enough in all cases.
        
           | kragen wrote:
           | thank you very much for the correction and the clarifications
           | --and for the excellent post
           | 
           | does the list of callsite return counters being hashed get
           | truncated at some point for stability? i'd think that hashing
           | the call stack f-g-j-k-m-n-o-p to a hash unrelated to f-g-h-
           | j-k-m-n-o-p would tend to interfere with the 'forced' set,
           | but maybe this is only being used for things where the
           | behavior is so closely equivalent that this problem doesn't
           | arise
        
             | rsc wrote:
             | The assumption is that for a given test program, the call
             | stack used at the specific moment where the failure begins
             | is the same each run. If that's not true, there will be
             | problems.
             | 
             | If there are two different call stacks that get to the
             | function in question, but only one of them is where the
             | failure begins, you absolutely want to distinguish those.
             | 
             | In practice we do cut off the stacks, using just 16 frames,
             | but that's to avoid O(stack depth) overhead, not to
             | coalesce different things.
        
       | MatzeBraun wrote:
       | LLVM can bisect down to individual transformation steps within a
       | pass (for passes that implement the interface):
       | https://llvm.org/docs/OptBisect.html
       | 
       | And there is a script bisecting functions when given assembly
       | produced by working baseline and "bad" compiler to compare:
       | https://github.com/llvm/llvm-project/blob/main/llvm/utils/ab...
        
         | tekknolagi wrote:
         | This is pretty cool. Thanks for the pointer.
         | 
         | Also, hi Matthias!
        
       | mirrorlake wrote:
       | In the event that this is added to the standard library, I'm
       | going to be really curious to see what a "hello world"
       | project/example would look like.
       | 
       | I went so far as to find the commit where David Chase added for
       | loopvar on Mar 6, 2023 (github: golang/go/commit/c20d959) to try
       | to design my own hello world with x/tools/cmd/bisect, but I'm out
       | of my depth.
       | 
       | The hash tree is a great visualization. I wouldn't have grasped
       | the importance of the hash suffix until I saw the tree. Awesome
       | stuff.
        
         | rsc wrote:
         | If we add it to the standard library it will be for runtime
         | uses (identify the call stack for which swapping in the new
         | implementation or behavior breaks the test). It is pretty
         | simple. You call 'godebug.New' in a global init to get a lookup
         | keyed to a particular setting, and then you look at that
         | setting at runtime. For example take a look at
         | https://go.dev/src/crypto/x509/x509.go#L881 and look at the
         | uses of x509sha1 a few lines later.
         | 
         | You just call x509sha1.Value() and make a decision about old or
         | new behavior based on that. Bisect can then toggle that result
         | based on the hash of the call stack to narrow down the exact
         | path that leads to the critical decision.
         | 
         | You can ignore the IncNonDefault call a few lines later. That's
         | about providing monitoring for whether any non-default (old)
         | behaviors are being executed in a program, so that people can
         | hook it up to something like prometheus and see whether they
         | would be affected by changing to "always default behavior".
        
       ___________________________________________________________________
       (page generated 2024-07-19 23:10 UTC)