[HN Gopher] How Antithesis finds bugs
___________________________________________________________________
How Antithesis finds bugs
Author : wwilson
Score : 92 points
Date : 2024-04-17 18:07 UTC (4 hours ago)
(HTM) web link (antithesis.com)
(TXT) w3m dump (antithesis.com)
| wwilson wrote:
| This is Will (I gave the talk linked in the post). Happy to
| answer any questions about this work, or how it generalizes to
| testing things that aren't Nintendo games.
| thsksbd wrote:
| How does this distinguish between a bug and a "feature"?
|
| Edit: i mean this is the spirit of Knuth's quip that when he
| dies all the bugs in tex will become features
| wwilson wrote:
| There was a LOT of discussion of this in the Q&A after the
| talk. Currently we have 4 main approaches:
|
| (1) There's some stuff that's pretty much a bug for every
| program. If it segfaults, exits with a nonzero code, OOMs,
| triggers a TSAN error, fills the disk with fatal error
| messages, etc., etc., that's pretty easy to qualify.
|
| (2) You can use our SDK to define additional custom test
| properties. Think like a normal assertions library, but you
| can also do existential quantification ("this code is
| reachable/this situation can happen") and soon temporal
| assertions ("this should never happen without this other
| thing happening first, possibly on a different node").
|
| (3) We store all the output of your system in every timeline
| in a giant analytic database and support ad-hoc querying
| against it. Think "pre-observability", observability but for
| mirror universes. You can then do all the spelunking and
| analysis you would do with your production traces, but before
| your real customers are exposed to any issue.
|
| (4) We have some very cool ML approaches in the pipeline that
| I can't talk about quite yet.
| ajb wrote:
| Can you define equivalence classes (mutations that
| shouldn't change the result) eg timing, order of events,
| idempotence, etc? So that you can use (3) to define the
| correct result for all members of the class
|
| (Sorry if this is explained in the talk - I'll watch it but
| it's now too late in the day in my timezone)
| cperciva wrote:
| Why is Mario so jumpy?
| wwilson wrote:
| This is actually an incredibly deep and difficult question to
| answer. I would expect no less from cperciva. :-)
|
| As I mention in the talk, you get very bad tactical
| performance from taking a uniform random distribution and
| piping it into the emulator. The fuzzer is exponentially
| unlikely to hold the jump button for many successive frames
| without a break. In the fully general case, I think instead
| of maximum entropy, you want something more like Marcus
| Hutter's AIXI where you spend energy on inputs inversely
| proportional to their Kolmogorov complexity. Unfortunately,
| that's uncomputable, but it turns out that just switching to
| toggling bits with low probability does a lot better than
| pure randomness. The approach is analogous to swarm testing
| (https://users.cs.utah.edu/~regehr/papers/swarm12.pdf).
|
| All of which is to say, the result that we show here is
| vastly _less_ jumpy than our first tries. The reason it 's
| still more jumpy than a human player is that our platform has
| no idea where it is in the game, or even that it's playing a
| game. So if a jump doesn't harm it in the exploration
| process, there's some chance the first input getting
| somewhere new will involve a jump, and that will then get
| locked in.
|
| We do have the capability to do optimization on inputs (what
| conventional PBT calls "shrinking"), and indeed if you apply
| this to Mario you can get it to jump a lot less and complete
| levels a lot faster. That capability didn't exist yet when
| this video was recorded. We should totally do a another post
| on this topic!
| someplaceguy wrote:
| > Unfortunately, that's uncomputable
|
| Minor nitpick, but while Kolmogorov complexity as typically
| defined is uncomputable, I would argue that this result is
| only a theoretical curiosity and mostly irrelevant.
|
| That is, the "uncomputable" Kolmogorov complexity
| computation presupposes that you have a Turing machine,
| i.e. a machine with _literally_ infinite memory, which is
| not possible to construct in our universe. Or
| alternatively, it presupposes that "computable function"
| is one that can be computed by a machine with an infinite
| amount of storage, which amounts to the same thing as
| having a Turing machine.
|
| You could probably define some version of Kolmogorov
| complexity that is parametrized by the memory size (e.g. of
| a linear bounded automaton or similar model that better
| represents a computer with finite resources), which should
| make it computable. That said, in practice it would
| probably take an unreasonable amount of time to perform
| this computation (but that is orthogonal to whether it's
| computable or not).
| wwilson wrote:
| Thank you. Comments like this are why I love HN.
| anyfoo wrote:
| While reading your comment, I kept having the vague
| feeling that your idea of "theoretically computable"
| (within the given memory bounds) will still leave some
| exponential time and/or space complexity, and so
| effectively will still be "practically not computable".
|
| Your last paragraph then seemed to confirm that, i.e.
| there was no particular shortcut for this specific case
| that would make it any different from general Kolmogorov
| complexity.
|
| In that sense, isn't your comment itself also "only a
| theoretical curiosity"? As we went to from
| "uncomputable", to "theoretically computable under the
| given constraints", to "practically uncomputable"?
|
| While suffering from lack of rigor, I think a lot of
| times--probably even the majority of times outside purely
| cs-theoretical treatments--when we colloquially speak of
| "uncomputable", we are always talking about practical
| computers without an infinite touring tape, and so
| actually mean "practically uncomputable".
|
| Because, yes, while everyone immediately understands that
| actual computers don't have infinite memory, at the same
| time everyone understands that "exponential time" is
| still "never" in practical terms.
| gwern wrote:
| Most RL agents are 'jumpy' or jitter a lot, because it makes
| no difference to the reward, and where it does make a
| difference, that tends to only matter close to convergence
| where slight speedups are all that's left. If you want to
| reduce that, you have to reward-shape it to penalize excess
| movement. (Which is relevant in robotics, where 'jitter' can
| be very bad for the machinery in a way not reflected in the
| simple tasks' reward functions.)
| anyfoo wrote:
| Case in point: Watch an actual Mario speed running world
| record video, and the human players there are jumping
| around a lot while running around flat sections with no
| obstacles.
|
| As I understand, the jumping (at least in this case) does
| nothing to Mario's horizontal speed at all, so they
| basically just do it for "fun".
|
| The difference is that the human players know that they're
| doing something inconsequential for variety, while the
| fuzzer has no "idea" (i.e. no concept) that the jumping
| does not matter, or even that it is "jumping". More
| generally, it does not know nor care that this particular
| part of the input is irrelevant. It just found an input
| that works, and sometimes that happens to include a useless
| jump.
| mustknow2201 wrote:
| How close is or isn't this to a genetic algorithm in practice?
| It seems like as soon as you start scaling out to multiple
| threads/nodes you'd benefit from crossover, selection
| techniques and so on?
| Taikonerd wrote:
| Super Mario is such a fun example. Well chosen.
| mrkmarron wrote:
| FYI playing Super Mario with fuzzing (AFL) was done in a fun 2020
| S&P paper. Also finds bugs and security issues.
|
| "IJON: Exploring Deep State Spaces via Fuzzing"
| https://casa.rub.de/fileadmin/img/Publikationen_PDFs/2020_IJ...
| wwilson wrote:
| Thanks for flagging this! The work we're announcing today was
| completed in 2018, and we have since moved on to much more
| challenging problems both in the Nintendo domain and elsewhere.
| Totally not looking to pick a fight over priority though. This
| is such a hilariously understudied and under-explored area, we
| really value anybody else who's trying to work on these
| problems.
| mrkmarron wrote:
| I think you underestimate the level to which this area has
| been studied. And I wish you would talk about these new
| results then instead of announcing 5+ year old results then.
|
| It would be great to see progress in this area (not my
| primary area of work BTW) but I am not seeing anything here,
| technically, that is going to make that happen -- maybe it is
| just getting all the parts in place and magic happens. It
| just makes me scratch my head a bit.
| wwilson wrote:
| It's possible you did not make it to the end of the talk
| where I explain this, but the thing that excites me is that
| we can now apply fuzzing and related techniques to things
| which are neither Nintendo games nor tiny stateless
| libraries and parsers, because of this:
| https://antithesis.com/blog/deterministic_hypervisor/
|
| As for getting to the newer stuff, yeah, totally, just give
| us some time. There's a bit of a backlog. :-)
| mrkmarron wrote:
| I just rewatched the end of the video to make sure I
| didn't miss anything. Deterministic execution and replay
| is very-very well-known and understood. It is possible
| that your packaging and market fit is right on. Lots of
| cottage industry in DB testing and bug finding -- but not
| clear how this generalizes and why something like Coyote
| [1] (to pick one) wouldn't work as well.
|
| So, fuzzing has been applied to very stateful and very
| large industrial systems for some time. And yes it is
| very cool but I feel like I am seeing more "sizzle than
| steak" so to speak. Great engineering though, hypervisor
| work is very challenging.
|
| [1] https://www.microsoft.com/en-us/research/blog/coyote-
| making-...
| wwilson wrote:
| It is absolutely possible to write a large stateful
| system from the ground up so that autonomous testing
| techniques can be applied to it. FoundationDB and Tiger
| Beetle are both examples of this, I think Resonate might
| be another one, and Toby Bell's talk at Strange Loop last
| year is a great guide on how to do so.
|
| What's much harder is to take an arbitrary system,
| written in an arbitrary way, without these techniques in
| mind, and make it amenable to this sort of testing. From
| the start of our company, we believed that unless this
| was possible, the market would be too hard to crack,
| because most human beings are not very foresightful and
| not able to justify a bunch of extra work.
|
| Hypervisor-based snapshot fuzzing like Nyx-Net and
| deterministic userspaces like Facebook's now-discontinued
| Hermit project are the other ways I know of accomplishing
| that goal. We believe that both of them have some pretty
| serious practical limitations which our approach does not
| share.
|
| EDIT: Maybe the way to get to the crux of the
| disagreement is for me to turn the question around. Why
| do you believe that the vast majority of stateful and
| concurrent systems are not tested with fuzzing?
| mrkmarron wrote:
| At the end of the day you have 2 problems (1) how to make
| execution deterministic within some boundary be it a
| process, hypervisor, or distributed system and (2) how
| you handle non-determinism when data crosses this
| boundary. You can move the boundaries and effort around
| but the problems always exist. So, if you are claiming
| that you have a sweet spot on this tradeoff then I could
| certainly believe that, if you claim that you eliminated
| this boundary issue then I am highly credulous.
|
| I'll agree with you on indeterminate behaviors though, I
| suspect they will eventually be seen like the "billion
| dollar" mistake of null pointers.
| mrkmarron wrote:
| Just saw the edit. I have 2 answers:
|
| 1) Fuzzing is under-utilized even for simple code. AFL is
| dead easy to use and, even so, most projects don't have
| it in CI runs. So, despite how much I like it, in general
| it seems people do not see value in this type of testing.
|
| 2) The effort to handle external state (say a restful
| call to get stock ticker info) needs to be mocked --
| which is deeply unpopular -- or handled by record/replay
| which works ok-ish but eventually breaks down with
| divergences. Outside of well-chosen domains it these
| eventually pop-up and add an additional pain point that
| builds on item 1.
| chc4 wrote:
| A lot of fuzzers use Mario or other simple games as an internal
| testcase. I'm aware of a hypervisor fuzzer from 2016 that did
| it, and I'm positive there are others (both before and since).
| Hell, tom7 has a fuzzer for exploring program states that uses
| Super Mario Bros as the example from 2013
| (https://www.cs.cmu.edu/~tom7/mario/mario.pdf, plus a youtube
| video https://youtu.be/xOCurBYI_gY), and he's definitely not
| the first either.
| mrkmarron wrote:
| Thanks for sharing, I felt like there were earlier but the
| x,y trick jumped out at me and that was the one I remembered
| off the top of my head.
| wwilson wrote:
| We are huge fans of tom7 and that paper was one of our
| inspirations for using NES as a domain for researching
| autonomous state space search! I think he does a very good
| job of explaining why the problem is hard.
| chc4 wrote:
| Right, in case it wasn't clear to readers this isn't a bad
| thing. Lots of people use games because they're good
| analogs for other programs and evocatively show fuzzing
| exploration progress. Not being the first to point a fuzzer
| at Mario doesn't matter.
| infogulch wrote:
| Have any of these methods found clips and speedrunning
| shortcuts? Examples: Clip the base of the flagpole to skip some
| animation time. Clip into and walk below the floor to run past
| obstacles. Etc.
| wwilson wrote:
| We find a ton of glitches in Mario, and even more in other
| games. See, e.g.: https://vimeo.com/807601164
| wwilson wrote:
| Hmm... looks like Vimeo is currently having some kind of
| outage. If anybody at Vimeo is reading this and wants some
| help with testing backend systems, email is in bio.
| yanniszark wrote:
| This is fascinating! I thought only Reinforcement Learning was
| doing things like this but you're saying you can do this via
| fuzzying? What does this mean exactly? How is it able to learn to
| advance through all these levels? Is there an underlying learning
| mechanism at play?
| vojev wrote:
| There's no learning exactly, as the post explains the fuzzer is
| aware of various RAM addresses (as well as having a tactic for
| how it "presses" buttons in the game). It's just trying to
| explore the space of Mario's level + his x and y coordinates.
|
| (I'm an Antithesis employee.)
| bbor wrote:
| At what point can we start suing companies on behalf of the
| commons for taking words from the lexicon? I miss the days when
| this would be called "Wilson & Co.'s automated testing solution"
| instead of such a beautiful, philosophically meaningful word.
| Same thoughts on that Devin.AI scam taking the name "Cognition"
| and Vercel somehow bribing their way into claiming the "ai" name
| on NPM.
|
| Technically awesome post tho! Love the heatmap esp. Maybe bring
| up changing your name to investors because some rando online
| doesn't like it though, please.
| m3kw9 wrote:
| Doesn't explain how it finds bugs it's just had the AI play Mario
| bros
| anyfoo wrote:
| I think everyone in the target audience of this blog post is
| immediately able to make the connection.
___________________________________________________________________
(page generated 2024-04-17 23:00 UTC)