[HN Gopher] How Antithesis finds bugs
       ___________________________________________________________________
        
       How Antithesis finds bugs
        
       Author : wwilson
       Score  : 92 points
       Date   : 2024-04-17 18:07 UTC (4 hours ago)
        
 (HTM) web link (antithesis.com)
 (TXT) w3m dump (antithesis.com)
        
       | wwilson wrote:
       | This is Will (I gave the talk linked in the post). Happy to
       | answer any questions about this work, or how it generalizes to
       | testing things that aren't Nintendo games.
        
         | thsksbd wrote:
         | How does this distinguish between a bug and a "feature"?
         | 
         | Edit: i mean this is the spirit of Knuth's quip that when he
         | dies all the bugs in tex will become features
        
           | wwilson wrote:
           | There was a LOT of discussion of this in the Q&A after the
           | talk. Currently we have 4 main approaches:
           | 
           | (1) There's some stuff that's pretty much a bug for every
           | program. If it segfaults, exits with a nonzero code, OOMs,
           | triggers a TSAN error, fills the disk with fatal error
           | messages, etc., etc., that's pretty easy to qualify.
           | 
           | (2) You can use our SDK to define additional custom test
           | properties. Think like a normal assertions library, but you
           | can also do existential quantification ("this code is
           | reachable/this situation can happen") and soon temporal
           | assertions ("this should never happen without this other
           | thing happening first, possibly on a different node").
           | 
           | (3) We store all the output of your system in every timeline
           | in a giant analytic database and support ad-hoc querying
           | against it. Think "pre-observability", observability but for
           | mirror universes. You can then do all the spelunking and
           | analysis you would do with your production traces, but before
           | your real customers are exposed to any issue.
           | 
           | (4) We have some very cool ML approaches in the pipeline that
           | I can't talk about quite yet.
        
             | ajb wrote:
             | Can you define equivalence classes (mutations that
             | shouldn't change the result) eg timing, order of events,
             | idempotence, etc? So that you can use (3) to define the
             | correct result for all members of the class
             | 
             | (Sorry if this is explained in the talk - I'll watch it but
             | it's now too late in the day in my timezone)
        
         | cperciva wrote:
         | Why is Mario so jumpy?
        
           | wwilson wrote:
           | This is actually an incredibly deep and difficult question to
           | answer. I would expect no less from cperciva. :-)
           | 
           | As I mention in the talk, you get very bad tactical
           | performance from taking a uniform random distribution and
           | piping it into the emulator. The fuzzer is exponentially
           | unlikely to hold the jump button for many successive frames
           | without a break. In the fully general case, I think instead
           | of maximum entropy, you want something more like Marcus
           | Hutter's AIXI where you spend energy on inputs inversely
           | proportional to their Kolmogorov complexity. Unfortunately,
           | that's uncomputable, but it turns out that just switching to
           | toggling bits with low probability does a lot better than
           | pure randomness. The approach is analogous to swarm testing
           | (https://users.cs.utah.edu/~regehr/papers/swarm12.pdf).
           | 
           | All of which is to say, the result that we show here is
           | vastly _less_ jumpy than our first tries. The reason it 's
           | still more jumpy than a human player is that our platform has
           | no idea where it is in the game, or even that it's playing a
           | game. So if a jump doesn't harm it in the exploration
           | process, there's some chance the first input getting
           | somewhere new will involve a jump, and that will then get
           | locked in.
           | 
           | We do have the capability to do optimization on inputs (what
           | conventional PBT calls "shrinking"), and indeed if you apply
           | this to Mario you can get it to jump a lot less and complete
           | levels a lot faster. That capability didn't exist yet when
           | this video was recorded. We should totally do a another post
           | on this topic!
        
             | someplaceguy wrote:
             | > Unfortunately, that's uncomputable
             | 
             | Minor nitpick, but while Kolmogorov complexity as typically
             | defined is uncomputable, I would argue that this result is
             | only a theoretical curiosity and mostly irrelevant.
             | 
             | That is, the "uncomputable" Kolmogorov complexity
             | computation presupposes that you have a Turing machine,
             | i.e. a machine with _literally_ infinite memory, which is
             | not possible to construct in our universe. Or
             | alternatively, it presupposes that  "computable function"
             | is one that can be computed by a machine with an infinite
             | amount of storage, which amounts to the same thing as
             | having a Turing machine.
             | 
             | You could probably define some version of Kolmogorov
             | complexity that is parametrized by the memory size (e.g. of
             | a linear bounded automaton or similar model that better
             | represents a computer with finite resources), which should
             | make it computable. That said, in practice it would
             | probably take an unreasonable amount of time to perform
             | this computation (but that is orthogonal to whether it's
             | computable or not).
        
               | wwilson wrote:
               | Thank you. Comments like this are why I love HN.
        
               | anyfoo wrote:
               | While reading your comment, I kept having the vague
               | feeling that your idea of "theoretically computable"
               | (within the given memory bounds) will still leave some
               | exponential time and/or space complexity, and so
               | effectively will still be "practically not computable".
               | 
               | Your last paragraph then seemed to confirm that, i.e.
               | there was no particular shortcut for this specific case
               | that would make it any different from general Kolmogorov
               | complexity.
               | 
               | In that sense, isn't your comment itself also "only a
               | theoretical curiosity"? As we went to from
               | "uncomputable", to "theoretically computable under the
               | given constraints", to "practically uncomputable"?
               | 
               | While suffering from lack of rigor, I think a lot of
               | times--probably even the majority of times outside purely
               | cs-theoretical treatments--when we colloquially speak of
               | "uncomputable", we are always talking about practical
               | computers without an infinite touring tape, and so
               | actually mean "practically uncomputable".
               | 
               | Because, yes, while everyone immediately understands that
               | actual computers don't have infinite memory, at the same
               | time everyone understands that "exponential time" is
               | still "never" in practical terms.
        
           | gwern wrote:
           | Most RL agents are 'jumpy' or jitter a lot, because it makes
           | no difference to the reward, and where it does make a
           | difference, that tends to only matter close to convergence
           | where slight speedups are all that's left. If you want to
           | reduce that, you have to reward-shape it to penalize excess
           | movement. (Which is relevant in robotics, where 'jitter' can
           | be very bad for the machinery in a way not reflected in the
           | simple tasks' reward functions.)
        
             | anyfoo wrote:
             | Case in point: Watch an actual Mario speed running world
             | record video, and the human players there are jumping
             | around a lot while running around flat sections with no
             | obstacles.
             | 
             | As I understand, the jumping (at least in this case) does
             | nothing to Mario's horizontal speed at all, so they
             | basically just do it for "fun".
             | 
             | The difference is that the human players know that they're
             | doing something inconsequential for variety, while the
             | fuzzer has no "idea" (i.e. no concept) that the jumping
             | does not matter, or even that it is "jumping". More
             | generally, it does not know nor care that this particular
             | part of the input is irrelevant. It just found an input
             | that works, and sometimes that happens to include a useless
             | jump.
        
         | mustknow2201 wrote:
         | How close is or isn't this to a genetic algorithm in practice?
         | It seems like as soon as you start scaling out to multiple
         | threads/nodes you'd benefit from crossover, selection
         | techniques and so on?
        
       | Taikonerd wrote:
       | Super Mario is such a fun example. Well chosen.
        
       | mrkmarron wrote:
       | FYI playing Super Mario with fuzzing (AFL) was done in a fun 2020
       | S&P paper. Also finds bugs and security issues.
       | 
       | "IJON: Exploring Deep State Spaces via Fuzzing"
       | https://casa.rub.de/fileadmin/img/Publikationen_PDFs/2020_IJ...
        
         | wwilson wrote:
         | Thanks for flagging this! The work we're announcing today was
         | completed in 2018, and we have since moved on to much more
         | challenging problems both in the Nintendo domain and elsewhere.
         | Totally not looking to pick a fight over priority though. This
         | is such a hilariously understudied and under-explored area, we
         | really value anybody else who's trying to work on these
         | problems.
        
           | mrkmarron wrote:
           | I think you underestimate the level to which this area has
           | been studied. And I wish you would talk about these new
           | results then instead of announcing 5+ year old results then.
           | 
           | It would be great to see progress in this area (not my
           | primary area of work BTW) but I am not seeing anything here,
           | technically, that is going to make that happen -- maybe it is
           | just getting all the parts in place and magic happens. It
           | just makes me scratch my head a bit.
        
             | wwilson wrote:
             | It's possible you did not make it to the end of the talk
             | where I explain this, but the thing that excites me is that
             | we can now apply fuzzing and related techniques to things
             | which are neither Nintendo games nor tiny stateless
             | libraries and parsers, because of this:
             | https://antithesis.com/blog/deterministic_hypervisor/
             | 
             | As for getting to the newer stuff, yeah, totally, just give
             | us some time. There's a bit of a backlog. :-)
        
               | mrkmarron wrote:
               | I just rewatched the end of the video to make sure I
               | didn't miss anything. Deterministic execution and replay
               | is very-very well-known and understood. It is possible
               | that your packaging and market fit is right on. Lots of
               | cottage industry in DB testing and bug finding -- but not
               | clear how this generalizes and why something like Coyote
               | [1] (to pick one) wouldn't work as well.
               | 
               | So, fuzzing has been applied to very stateful and very
               | large industrial systems for some time. And yes it is
               | very cool but I feel like I am seeing more "sizzle than
               | steak" so to speak. Great engineering though, hypervisor
               | work is very challenging.
               | 
               | [1] https://www.microsoft.com/en-us/research/blog/coyote-
               | making-...
        
               | wwilson wrote:
               | It is absolutely possible to write a large stateful
               | system from the ground up so that autonomous testing
               | techniques can be applied to it. FoundationDB and Tiger
               | Beetle are both examples of this, I think Resonate might
               | be another one, and Toby Bell's talk at Strange Loop last
               | year is a great guide on how to do so.
               | 
               | What's much harder is to take an arbitrary system,
               | written in an arbitrary way, without these techniques in
               | mind, and make it amenable to this sort of testing. From
               | the start of our company, we believed that unless this
               | was possible, the market would be too hard to crack,
               | because most human beings are not very foresightful and
               | not able to justify a bunch of extra work.
               | 
               | Hypervisor-based snapshot fuzzing like Nyx-Net and
               | deterministic userspaces like Facebook's now-discontinued
               | Hermit project are the other ways I know of accomplishing
               | that goal. We believe that both of them have some pretty
               | serious practical limitations which our approach does not
               | share.
               | 
               | EDIT: Maybe the way to get to the crux of the
               | disagreement is for me to turn the question around. Why
               | do you believe that the vast majority of stateful and
               | concurrent systems are not tested with fuzzing?
        
               | mrkmarron wrote:
               | At the end of the day you have 2 problems (1) how to make
               | execution deterministic within some boundary be it a
               | process, hypervisor, or distributed system and (2) how
               | you handle non-determinism when data crosses this
               | boundary. You can move the boundaries and effort around
               | but the problems always exist. So, if you are claiming
               | that you have a sweet spot on this tradeoff then I could
               | certainly believe that, if you claim that you eliminated
               | this boundary issue then I am highly credulous.
               | 
               | I'll agree with you on indeterminate behaviors though, I
               | suspect they will eventually be seen like the "billion
               | dollar" mistake of null pointers.
        
               | mrkmarron wrote:
               | Just saw the edit. I have 2 answers:
               | 
               | 1) Fuzzing is under-utilized even for simple code. AFL is
               | dead easy to use and, even so, most projects don't have
               | it in CI runs. So, despite how much I like it, in general
               | it seems people do not see value in this type of testing.
               | 
               | 2) The effort to handle external state (say a restful
               | call to get stock ticker info) needs to be mocked --
               | which is deeply unpopular -- or handled by record/replay
               | which works ok-ish but eventually breaks down with
               | divergences. Outside of well-chosen domains it these
               | eventually pop-up and add an additional pain point that
               | builds on item 1.
        
         | chc4 wrote:
         | A lot of fuzzers use Mario or other simple games as an internal
         | testcase. I'm aware of a hypervisor fuzzer from 2016 that did
         | it, and I'm positive there are others (both before and since).
         | Hell, tom7 has a fuzzer for exploring program states that uses
         | Super Mario Bros as the example from 2013
         | (https://www.cs.cmu.edu/~tom7/mario/mario.pdf, plus a youtube
         | video https://youtu.be/xOCurBYI_gY), and he's definitely not
         | the first either.
        
           | mrkmarron wrote:
           | Thanks for sharing, I felt like there were earlier but the
           | x,y trick jumped out at me and that was the one I remembered
           | off the top of my head.
        
           | wwilson wrote:
           | We are huge fans of tom7 and that paper was one of our
           | inspirations for using NES as a domain for researching
           | autonomous state space search! I think he does a very good
           | job of explaining why the problem is hard.
        
             | chc4 wrote:
             | Right, in case it wasn't clear to readers this isn't a bad
             | thing. Lots of people use games because they're good
             | analogs for other programs and evocatively show fuzzing
             | exploration progress. Not being the first to point a fuzzer
             | at Mario doesn't matter.
        
         | infogulch wrote:
         | Have any of these methods found clips and speedrunning
         | shortcuts? Examples: Clip the base of the flagpole to skip some
         | animation time. Clip into and walk below the floor to run past
         | obstacles. Etc.
        
           | wwilson wrote:
           | We find a ton of glitches in Mario, and even more in other
           | games. See, e.g.: https://vimeo.com/807601164
        
             | wwilson wrote:
             | Hmm... looks like Vimeo is currently having some kind of
             | outage. If anybody at Vimeo is reading this and wants some
             | help with testing backend systems, email is in bio.
        
       | yanniszark wrote:
       | This is fascinating! I thought only Reinforcement Learning was
       | doing things like this but you're saying you can do this via
       | fuzzying? What does this mean exactly? How is it able to learn to
       | advance through all these levels? Is there an underlying learning
       | mechanism at play?
        
         | vojev wrote:
         | There's no learning exactly, as the post explains the fuzzer is
         | aware of various RAM addresses (as well as having a tactic for
         | how it "presses" buttons in the game). It's just trying to
         | explore the space of Mario's level + his x and y coordinates.
         | 
         | (I'm an Antithesis employee.)
        
       | bbor wrote:
       | At what point can we start suing companies on behalf of the
       | commons for taking words from the lexicon? I miss the days when
       | this would be called "Wilson & Co.'s automated testing solution"
       | instead of such a beautiful, philosophically meaningful word.
       | Same thoughts on that Devin.AI scam taking the name "Cognition"
       | and Vercel somehow bribing their way into claiming the "ai" name
       | on NPM.
       | 
       | Technically awesome post tho! Love the heatmap esp. Maybe bring
       | up changing your name to investors because some rando online
       | doesn't like it though, please.
        
       | m3kw9 wrote:
       | Doesn't explain how it finds bugs it's just had the AI play Mario
       | bros
        
         | anyfoo wrote:
         | I think everyone in the target audience of this blog post is
         | immediately able to make the connection.
        
       ___________________________________________________________________
       (page generated 2024-04-17 23:00 UTC)