[HN Gopher] Hash-based bisect debugging in compilers and runtimes
___________________________________________________________________
Hash-based bisect debugging in compilers and runtimes
Author : rsc
Score : 236 points
Date : 2024-07-18 14:36 UTC (1 days ago)
(HTM) web link (research.swtch.com)
(TXT) w3m dump (research.swtch.com)
| tekknolagi wrote:
| This is very cool and seems similar to what we do/did in Cinder:
| https://bernsteinbear.com/blog/cinder-jit-bisect/
|
| EDIT: Oops, this tool is mentioned in the post already (but the
| post is not, so here it is if you want to read about it). Neat!
| rsc wrote:
| Thanks for that link. I've changed the link beneath the Cinder
| mention to that page instead of the GitHub page.
| tabbott wrote:
| This is so cool!
|
| It reminds me a bit of one my favorite debugging techniques:
| Running an individual test with coverage reporting enabled, and
| then clicking around the HTML coverage report to see exactly what
| code path the test followed, without needing to use either print
| statements or a specialized debugger.
|
| Very helpful for answering questions of the form "Why does this
| test not exercise the code I thought it did?".
| chc4 wrote:
| When I read https://marcan.st/2017/12/debugging-an-evil-go-
| runtime-bug/ and saw the hash bisection trick for the first time
| I was super impressed, it really does sound incredibly slick :) I
| imagine that's how first coming across normal git bisect must
| feel to new engineers.
| carry_bit wrote:
| A closely related technique for debugging optimization passes is
| that of "optimization fuel". Each rewrite decreases the fuel by
| one, and when the fuel is gone no more rewrites happen. You can
| then perform binary search on the optimization fuel to find a
| specific rewrite instance that breaks things.
| rsc wrote:
| Yes, I believe LLVM has a flag for "run only the first N
| optimization passes" that gets used with binary search this
| way.
|
| A global optimization fuel that worked at finer granularity
| would be even more precise but you'd have to have the compiler
| run single-threaded to make sure the numbering always matches.
| At least in the Go compiler, we compile different functions in
| different threads, so that fouls up any global numbering.
| camgunz wrote:
| This is pretty close to what I built for maintaining demo
| compatibility in Doom engines. Basically it runs a demo and dumps
| a save game to a file every frame. As soon as there's a
| divergence it says what the difference is (monster 17 is at (4,
| 22); should be (4, 21)) and bails. Not a ton of difference
| between that and diffing the stack.
|
| https://github.com/camgunz/democomp
| rsc wrote:
| What you are describing sounds like comparing traces of two
| supposed-to-be-identical programs to see where they first
| diverge. That's not what this is describing. There is no trace
| at all, nor any decision about what to trace or what not to
| trace.
|
| There is only a decision about whether to use the old or new
| implementation based on the hash of the call stack at that
| moment. Then you binary search on hash values to identify the
| exact call stack hash for which the new implementation causes a
| problem. Then you run the program once more with the
| instructions "print the stack with this hash".
| IshKebab wrote:
| What you implemented is just checking output against a
| reference. That's an extremely common testing method.
|
| This is significantly different because it actually changes the
| program that is being run. It's not common at all - probably
| because there aren't too many situations where it applies. You
| need to be processing some large input that you don't
| understand, and commonly change bits of your code in ways that
| you can turn on and off dynamically during a single run.
| Scaevolus wrote:
| Aside: bisecting flakes doesn't _have_ to involve repeated runs.
| You can reformulate bisection as an information probing
| operation, expanding the scope to support noisy benchmarks or
| low-probability flakes. Bayesian inference narrows down the
| probable range of the failure for each new observation, and you
| can choose new probes to maximize information gain-- or even run
| them in parallel to minimize the total time.
|
| You _do_ have to provide flake rate probability to do the
| probability estimates, but even roughly correct rates work fine.
| Running bisects assuming a 5% chance of getting a false positive
| or negative barely adds more steps and greatly improves
| robustness.
|
| The math is pretty simple too-- my old prototype might still be
| in Google's monorepo; I should reimplement it for the open source
| world.
| rsc wrote:
| Indeed. Something like this is what I intend to do when I get
| some time.
| eru wrote:
| Could you briefly sketch out the math (or point to a sketch) so
| that other people can pick it up? I'm quite interested!
|
| (I suspect you could get pretty far with a Monte Carlo
| simulation, and that would let you bypass most of the math
| anyway.)
| ajb wrote:
| Here is a doc explaining a Bayesian search algorithm; not
| sure if it's precisely what GP has in mind.
|
| https://github.com/Ealdwulf/BBChop/blob/master/BBChop/doc/Ba.
| ..
|
| (Edit- had wrong link originally)
| eru wrote:
| Thanks!
| dwattttt wrote:
| For Bayesian Inference, the example in the Wikipedia article
| is good (who doesn't like cookies?):
| https://en.m.wikipedia.org/wiki/Bayesian_inference#Examples
|
| When gathering more evidence, you'd use your new belief about
| which cookie bowl Fred has as P(H1)=0.6 and P(H2)=0.4
| eru wrote:
| That just explains Bayesian inference in general, which is
| useful, but not what I'm after.
|
| I was specifically interested in the application to binary
| search / bisection in the presence of flaky tests.
| efdb wrote:
| J
| efdb wrote:
| H
| jhfdbkofdchk wrote:
| The Probabilistic Bisection Algorithm by Horstein, original
| paper is "Sequential transmission using noiseless feedback,"
| IEEE Trans. Inform. Theory, 9(3):136-143.
| https://ieeexplore.ieee.org/document/1057832
|
| Here's a nice presentation on an analysis of the method,
| references start on slide 47
|
| https://people.orie.cornell.edu/pfrazier/Presentations/2014..
| ..
|
| Paraphrasing the theorem from Jedynak, Frazier, Sznitman
| (2011):
|
| Suppose [flake probability] is constant, known, and bounded
| away from 1/2, and we use the entropy loss function. ... The
| policy that chooses [next test point] at the median of
| [current posterior distribution] is optimal.
|
| And some python code that implements a version of the
| analysis
|
| https://github.com/choderalab/thresholds/blob/master/thresho.
| ..
| eru wrote:
| Thanks!
| drivebycomment wrote:
| In my former life, I used to maintain a script that can be given
| two sets of objects files, one compiled with optimization, and
| one without, and the script will effectively do a binary search
| by choosing which object files you link and run the executable to
| determine success/fail. Each iteration is quick, since linking
| step is usually fast. This was useful when troubleshooting a big
| binary, since optimized build back then was often quite slow for
| a large executable.
| millipede wrote:
| Are there any non-compiler use cases for this technique?
| saagarjha wrote:
| You can use this for any codebase of sufficient complexity
| where you want to identify which change is causing the effect
| you're observing.
| IshKebab wrote:
| Not really, it's only suitable where your code is processing
| some other large input that you don't understand, _and_ when
| your changes are usually easily toggleable at runtime and can
| be applied to only part of the input.
| saagarjha wrote:
| It works just fine for me what that wasn't the case.
| rsc wrote:
| I would frame it differently: stack hash bisect is only
| suitable where your code is linked into some other large
| program that you don't understand, and when your changes
| can be toggled on a per-invocation basis. Then you can use
| it to identify the exact stack through the larger program
| that leads to your code and works with the old behavior but
| breaks with the new behavior.
|
| (saagarjha seems to be talking about 'git bisect', which is
| not really what the post is about.)
| rsc wrote:
| Anyone working on libraries that are used by large programs can
| use them, like in the sort and timer cases described toward the
| end of the paper. When you work on libraries used by other
| larger programs you inevitably break them accidentally. This
| technique pinpoints the exact context of the breakage.
| Twirrim wrote:
| All sorts of stuff, I git bisect reasonably regularly.
|
| You can even do things like automatically git bisect the linux
| kernel, with a little care. I wrote this up a few years ago
| https://paulgraydon.co.uk/posts/2020-12-27-automated-
| kernel-....
|
| I can't talk about the actual case that lead me to ever need to
| git bisect a kernel in the first place, but at the same time I
| also learned how to make a minimalist one-C-file initrd,
| because what I needed to catch was visible under a /sys or /dev
| mount, or something like that. I later realised it's possible
| to do the same thing with Go in slightly easier syntax, if you
| cajole it in to producing a statically compiled binary!
| IshKebab wrote:
| He means the runtime bisection that this article is about,
| not `git bisect`.
| saagarjha wrote:
| I'm running one of these now, interestingly enough. Apparently
| something's broken in OpenJDK if you build with the new Xcode. So
| I'm bisecting on all the files (choosing either the old or new
| compiler) trying to see which one is breaking things.
| kragen wrote:
| this is a wonderful post!
|
| the algorithms for using binary search to efficiently reduce a
| set satisfying some predicate to a locally minimal satisfying
| subset* are new to me (though cox says zeller published a
| slightly buggy version in 01999! and meta's cinder a correct one
| in 02021), and seem brilliant; their applications are not limited
| to debugging. i wonder how it relates to hypothesis's test-case
| reduction algorithm; can one of them be considered an application
| of the other?
|
| also, this idea of binary-search debugging of the program's call
| tree rather than your revision history (or program input, or a
| set of data instances) is also a brilliant one. and although they
| published it a decade ago, i hadn't heard about it until now
|
| the examples of asynctimerchan=1, changing optimization settings,
| and changing sort algorithms have in common that in some sense
| they are behavior-preserving, so you can toggle them on and off
| at will during execution without breaking anything. i wonder how
| to apply this call-tree debugging if the change you're trying to
| narrow down is a change that has to be consistent throughout the
| program's execution. for example, suppose some code using your
| hash tables breaks when you switch to a new hash function, maybe
| because it inadvertently depended on enumeration order. if you
| change the hash function partway through the program, you won't
| be able to find things in your hash tables after that. you could
| change the algorithm per table, of course, and narrow it down to
| a particular table, but that won't give you the particular line
| of code
|
| i need to think a bit more about this issue of 'hashing a list of
| program counters'. you could of course number the sequence of all
| subroutine invocations during a (deterministic! single-threaded!)
| execution, as gas does for macro invocations, and binary-search
| that dense numbering. (this is a variant of the technique
| carry_bit is calling 'optimization fuel', but one that requires
| support from a compiler or debugger.) but, since you're toggling
| options on and off that will change the number of subroutine
| calls, the numbering won't be stable; so this will tend to only
| reliably find single-culprit failures
|
| you could possibly get a stable-enough numbering using pathnames
| like /3/5/1, meaning the the 1st subroutine called from the 5th
| subroutine called from the 3rd subroutine called from main().
| that seems like it might in some sense be _stabler_ than hashing
| the _entire_ list of return addresses, and it would certainly
| permit a lower-overhead implementation using a debugger and
| breakpoints rather than a check in every leaf call. plausibly i
| 'm overlooking a flaw in this form of 'sequential numbering'?
| does the hashed list get truncated at some point for stability?
|
| often when you have a change that is in some sense behavior-
| preserving, which is to say, you have two ways to do the same
| thing, you can use generative testing systems like hypothesis to
| detect bugs in either of them: process the same input through
| both paths and verify that the results are equivalent in the
| appropriate sense. this doesn't require the instrumentation
| infrastructure russ is using here, but it does depend on you
| being able to identify the relevant 'input', which can be very
| hard
|
| in itself that doesn't help with the kinds of bugs he's talking
| about here, though: bugs where both the old and new code is
| 'equivalent' by your lights, but some other client code that
| calls it doesn't find it equivalent. this suggests a different
| generative-testing approach: generatively inject behavioral
| perturbations which don't violate equivalence, attempting to
| provoke failures in client code. aslr and hash-table seed
| randomization are doing this for us for some purposes, but unlike
| generative-testing frameworks, they provoke outages in
| production, don't do test-case minimization, and don't record
| failing cases to make bisection easy and prevent later
| regressions. and they don't do things like shuffling the input to
| a non-stable sort subroutine
|
| binary-search debugging does indeed feel magical. scaevolus seems
| to be saying there's a bayesian generalization of it for
| nondeterministic bugs that are effectively random? you can of
| course run the test 5 (or 1000) times on each revision you're
| binary-searching over, but it feels like, if the number of
| revisions you're searching over is several thousand, you ought to
| be able to get some additional advantage out of running the test
| once on each of 5 (or 1000) revisions. can you solve this just by
| maximizing the expected shannon information of each test?
|
| on a side note, it's pretty appalling that 30 years ago the plan9
| group had `yesterday -d -n 7 anyfilename` to see what changed in
| the last week, thanks to their optical jukebox, while in the
| mainstream we still struggle with accidental file deletion and
| overwrites despite routinely carrying around terabytes in our
| pockets
|
| on an even more marginally relevant note, earlier this week i was
| perusing the 7th edition unix kernel, in which the subroutine
| that switches stacks (the one with the well-known comment in 6th
| edition) is called swtch(). and tonight i just realized why russ
| cox uses that domain name
|
| ______
|
| * conventionally this is just called a 'minimal satisfying
| subset', because it's 'minimal' in the partial-order sense, but i
| think cox's term 'locally minimal' is clearer
| derdi wrote:
| > this suggests a different generative-testing approach:
| generatively inject behavioral perturbations which don't
| violate equivalence, attempting to provoke failures in client
| code
|
| There is a compiler fuzzing method called "equivalence modulo
| inputs" that does something like this. Take a fuzzed program
| with fixed inputs and profile it to find parts that are never
| executed. Change only code in those never executed parts. If
| the original and modified programs behave differently, you
| found a compiler bug: The compiler got confused by code that it
| can see but that (unknown to the compiler) cannot dynamically
| influence the program's behavior on the given inputs.
|
| Like so many things in fuzzing, this sounds silly but actually
| finds bugs in practice.
| rsc wrote:
| One small correction: we were doing bisect over compiler
| optimizations a decade ago, but bisect over call trees only
| happened in the past year or so.
|
| For the hash table case, deciding the function per-table as you
| suggest is probably good enough. In a large program there are
| going to be tons of hash tables, and if bisect gets you to
| "it's this specific hash table allocated by this call stack
| that matters", that's a huge win.
|
| The timer change is much like the hash table one, in that we
| toggle the "timer kind" at creation time, but it's only later
| operations that actually change behavior. Still, identifying
| the specific timer and call stack that created it turned out to
| be good enough in all cases.
| kragen wrote:
| thank you very much for the correction and the clarifications
| --and for the excellent post
|
| does the list of callsite return counters being hashed get
| truncated at some point for stability? i'd think that hashing
| the call stack f-g-j-k-m-n-o-p to a hash unrelated to f-g-h-
| j-k-m-n-o-p would tend to interfere with the 'forced' set,
| but maybe this is only being used for things where the
| behavior is so closely equivalent that this problem doesn't
| arise
| rsc wrote:
| The assumption is that for a given test program, the call
| stack used at the specific moment where the failure begins
| is the same each run. If that's not true, there will be
| problems.
|
| If there are two different call stacks that get to the
| function in question, but only one of them is where the
| failure begins, you absolutely want to distinguish those.
|
| In practice we do cut off the stacks, using just 16 frames,
| but that's to avoid O(stack depth) overhead, not to
| coalesce different things.
| MatzeBraun wrote:
| LLVM can bisect down to individual transformation steps within a
| pass (for passes that implement the interface):
| https://llvm.org/docs/OptBisect.html
|
| And there is a script bisecting functions when given assembly
| produced by working baseline and "bad" compiler to compare:
| https://github.com/llvm/llvm-project/blob/main/llvm/utils/ab...
| tekknolagi wrote:
| This is pretty cool. Thanks for the pointer.
|
| Also, hi Matthias!
| mirrorlake wrote:
| In the event that this is added to the standard library, I'm
| going to be really curious to see what a "hello world"
| project/example would look like.
|
| I went so far as to find the commit where David Chase added for
| loopvar on Mar 6, 2023 (github: golang/go/commit/c20d959) to try
| to design my own hello world with x/tools/cmd/bisect, but I'm out
| of my depth.
|
| The hash tree is a great visualization. I wouldn't have grasped
| the importance of the hash suffix until I saw the tree. Awesome
| stuff.
| rsc wrote:
| If we add it to the standard library it will be for runtime
| uses (identify the call stack for which swapping in the new
| implementation or behavior breaks the test). It is pretty
| simple. You call 'godebug.New' in a global init to get a lookup
| keyed to a particular setting, and then you look at that
| setting at runtime. For example take a look at
| https://go.dev/src/crypto/x509/x509.go#L881 and look at the
| uses of x509sha1 a few lines later.
|
| You just call x509sha1.Value() and make a decision about old or
| new behavior based on that. Bisect can then toggle that result
| based on the hash of the call stack to narrow down the exact
| path that leads to the critical decision.
|
| You can ignore the IncNonDefault call a few lines later. That's
| about providing monitoring for whether any non-default (old)
| behaviors are being executed in a program, so that people can
| hook it up to something like prometheus and see whether they
| would be affected by changing to "always default behavior".
___________________________________________________________________
(page generated 2024-07-19 23:10 UTC)