[HN Gopher] Debugging in the Multiverse
       ___________________________________________________________________
        
       Debugging in the Multiverse
        
       Author : wwilson
       Score  : 161 points
       Date   : 2024-09-10 13:05 UTC (9 hours ago)
        
 (HTM) web link (antithesis.com)
 (TXT) w3m dump (antithesis.com)
        
       | gguergabo wrote:
       | Antithesis employee here. Happy to jump in and answer any burning
       | questions people might have about multiverse debugging.
        
         | yellow_lead wrote:
         | > Let's get more concrete. Let's use this to solve a real
         | problem. My server has crashed and its process has exited! No
         | worries, I'll just rewind time, attach a debugger to the
         | process, and set a breakpoint or capture a thread dump:
         | 
         | Is this kind of stuff only possible in an Antithesis
         | Environment?
        
           | wwilson wrote:
           | Yes, unfortunately we have not figured out how to rewind time
           | in the real world yet. When we do, there are a lot of choices
           | I'm going to revisit...
        
             | qarl wrote:
             | > Yes, unfortunately we have not figured out how to rewind
             | time in the real world yet.
             | 
             | 10 bucks says you get complaints for not implementing the
             | "real world" feature.
        
             | abeppu wrote:
             | ... but the intro makes it sound like this system is
             | valuable in investigating bugs that occurred in prod
             | systems:
             | 
             | > I've been involved in too many production outages and
             | emergencies whose aftermath felt just like that. Eventually
             | all the alerts and alarms get resolved and the error rates
             | creep back down. And then what? Cordon the servers off with
             | yellow police tape? The bug that caused the outage is there
             | in your code somewhere, but it may have taken some
             | outrageously specific circumstances to trigger it.
             | 
             | So practically, if a production outage (where I think
             | "production" means it cannot be in a simulated environment,
             | since the customers you're serving are real) is caused by
             | very specific circumstances, and your production system
             | records some, but not every attribute of its inputs and
             | state ... how does one make use of antithesis? Concretely,
             | when you have a fully-deterministic system that can help
             | your investigation, but you have only a partial view of the
             | conditions that caused the bug ... how do you proceed?
             | 
             | I feel like this post is over-promising but perhaps there's
             | something I just don't understand since I've never worked
             | with a tool set like this.
        
               | yellow_lead wrote:
               | This was my thinking as well. Prod environments can be
               | extremely complicated and issues often come down to
               | specific configuration or data issues in production. So I
               | had a lot of trouble understanding how the premise is
               | connected to the product here.
        
               | jackschu wrote:
               | (I work at Antithesis)
               | 
               | I think you're right that the framing leans towards
               | providing value in prod issues, but we left out how we
               | provide value there. I think you're also right that we're
               | just used to experiencing the value here, but it needs
               | some explanation.
               | 
               | Basically this is where guided, tree-based fuzzing comes
               | in. If something in the real world is caused by very
               | specific circumstances, we're well positions to have also
               | generated those specific circumstances. This is thanks to
               | parallelism, intelligent exploration, fault injection,
               | our ability to revisit interesting states in the past
               | with fast snapshots, etc.
               | 
               | We've had some super notable instances of a customer
               | finds a bug in prod, recalls its that weird bug they've
               | been ignoring that we surfaced a month ago, and then uses
               | this approach to debug.
               | 
               | The best docs on this are probably here: https://antithes
               | is.com/docs/introduction/how_antithesis_work...
        
         | abeppu wrote:
         | The intro mentions that ordinarily, we have to pay a high
         | upfront cost to record info that we might need to debug later.
         | 
         | > When we succeed at this, we collect huge volumes of logs
         | "just in case" they provide some crucial clue, incurring
         | equally huge storage costs.
         | 
         | The 'packets from the past' section says we can just
         | retroactively decide what we should have recorded.
         | 
         | Doesn't that mean we're effectively recording everything
         | always? What's the cost of this? Or is all of this under the
         | assumption that we never have to debug something that happened
         | outside of the simulation environment, e.g. in response to an
         | actual in-bound request from a customer? If this is just saying
         | we can afford to save everything in our development environment
         | ... well in that context recording the logs probably wasn't a
         | "huge storage cost" either, right? Or am I missing something
         | basic here?
        
           | wwilson wrote:
           | You're right that if you tried to do something like this
           | using record/replay, you would pay an enormous cost.
           | Antithesis does not use record/replay, but rather a
           | deterministic hypervisor
           | (https://antithesis.com/blog/deterministic_hypervisor/). So
           | all we have to remember is the set of inputs/changes to
           | entropy that got us somewhere, not the result of every system
           | operation.
        
             | slippy wrote:
             | The classic time space tradeoff question: If I run
             | Antithesis for X time, say 4 hours, do you take periodic
             | snapshot / deltas of state so that I don't have to re-run
             | the capture for O(4 hours) again, from scratch just to go
             | back 5 seconds?
        
               | wwilson wrote:
               | Yes! See Alex's talk here:
               | https://www.youtube.com/watch?v=0E6GBg13P60
               | 
               | In fact, we just made a radical upgrade to this
               | functionality. Expect a blog post about that soon.
        
         | Veserv wrote:
         | Is the hypervisor multicore? How do you handle shared memory
         | non-determinism? What is the runtime slowdown for shared memory
         | multicore (lets say 16 cores if you need a concrete example)
         | execution?
        
           | Veserv wrote:
           | Found the answer in a different post [1]. The hypervisor and
           | virtual machines are single-core only. The talk also
           | indicates that all I/O operations need to be manually
           | rewritten to use the instrumented mechanism, so it demands a
           | highly paravirtualized guest OS. Logically, that means there
           | are probably no cross-VM shared memory interfaces either. So,
           | no shared memory and thus no need to deal with shared memory
           | non-determinism.
           | 
           | This is just a standard replay engine from what I can tell.
           | 
           | [1] https://news.ycombinator.com/item?id=41501577
        
             | wwilson wrote:
             | No, we don't require any paravirtualization at all, and
             | nothing needs to be manually rewritten. I'm not sure where
             | you got that impression.
             | 
             | It also is not in any sense a replay engine. We don't need
             | to record anything except the inputs!
        
               | Veserv wrote:
               | At timestamp 23:40 in the video by Alex Pshenichkin from
               | 2024-06-10 it says data ingestion comes via VMCALL
               | interactions. As such a call is literal nonsense if you
               | are not virtualized, any such call inherently means you
               | are using a paravirtualized interface. Now maybe FreeBSD
               | has enough standardized paravirtualized drivers similar
               | to virtio that you can just link it up, but that would
               | still be paravirtualization solution with manual
               | rewrites, just somebody else already did the manual
               | rewrites. Has the fundamental design changed in the last
               | 3 months?
               | 
               | This is exactly a replay engine (or I guess you could say
               | replay engines are deterministic simulators). How do you
               | think you replay a recording except with a deterministic
               | execution system that injects the non-deterministic
               | inputs at precise execution points? This is literally how
               | all replay engines work. Furthermore, how do you think
               | recordings work except by recording the inputs? That is
               | literally how all recording systems designed to feed
               | replay engines work. The only distinction is what
               | constitutes non-determinism in a given context. At the
               | whole hypervisor level, it is just I/O into the guest; at
               | the process level, it is just system calls that write
               | into the process; at the threading level, it is all
               | writes into the process. These distinctions are somewhat
               | interesting at a implementation level, but do not change
               | the fundamental character of the solution which is that
               | they are all a replay engine or deterministic simulator,
               | whatever you want to call it.
        
       | vikR0001 wrote:
       | This looks very interesting! Is it possible to implement this in
       | a node.js web app? Does it work with any build tool? How much
       | latency does it add to a production server?
        
         | wwilson wrote:
         | The simulation is a completely generic Linux system, so we can
         | run anything (including NodeJS). If your build tool can produce
         | Docker containers, then it will work with us.
         | 
         | We don't run this on your production server, but in the same
         | simulation that we use to find your bugs. See also:
         | https://antithesis.com/product/how_does_antithesis_work/
        
       | ripped_britches wrote:
       | How do you handle side effects that interact with third party
       | systems? In my own tests, I use network request mocks. Do you
       | need to provide a test mode flag to indicate that mocks should be
       | used?
        
         | wwilson wrote:
         | Any third party service does need to be mocked or stubbed out.
         | We have a partnership with Localstack that lets us provide very
         | polished AWS mocks that require zero configuration on your part
         | (https://antithesis.com/docs/using_antithesis/environment.htm..
         | .).
         | 
         | If you need something else, reach out and ask us about it,
         | because we have a few of them in the pipeline.
        
       | casept wrote:
       | Does anything like the Antithesis hypervisor exist as open
       | source?
       | 
       | The closest I've seen is Qemu record/replay, but that's very slow
       | (no KVM acceleration, no multicore), and broken in current Qemu
       | versions (replayed system just gets stuck).
        
         | spencerflem wrote:
         | There's languages that support time travel debugging, like RR
         | for GDB, or smalltalk, but no open source system wide thing
         | like Antithesis that I know of yet.
        
           | dzaima wrote:
           | rr can record process trees; i.e. basically any
           | part/descendant of a process you spawn will be recorded and
           | can be replayed (userspace CPU & memory, that is); won't
           | record the entire OS though.
        
             | casept wrote:
             | My experience with RR is that the chance of it working
             | without hitting a missing syscall or desync is only about
             | 50%, which is why I want a different solution that doesn't
             | rely on the fragile syscall recording approach.
        
               | dzaima wrote:
               | Huh. In my experience it works nearly flawlessly,
               | certainly far above 50%. And even when there are spurious
               | failures in replaying it's easy enough to just re-replay
               | (though I do wish there was some way to export the
               | current position & checkpoints with instruction-level
               | precision to import in a fresh replay). I suppose it
               | depends massively on the recorded program (most of mine
               | are simple C programs, but also a decent bit of Java for
               | JIT inspection or FFI, and I've also recorded an Electron
               | app a couple times, and for fun Factorio)
        
               | vlovich123 wrote:
               | Same, I haven't had it have too many problems but it's
               | not perfect & missing support for io_uring is a problem
               | (they'll add it eventually I suspect once someone ponies
               | up the money for it).
        
         | eatonphil wrote:
         | https://github.com/facebookexperimental/hermit but it hasn't
         | worked for me and is now unmaintained.
        
       | mattgreenrocks wrote:
       | It is really interesting to me that this sort of thing didn't
       | come from programming language folks like I'd expect. You'd think
       | PLs are in the absolute perfect spot to implement things, because
       | they define the semantics and runtime. And there are a few PLs
       | who have time-travel demos, but they've never really been seen as
       | more than a cool tech demo.
       | 
       | Perhaps the language is too small a vantage point to really get
       | into what's happening when debugging.
        
         | barumrho wrote:
         | Elm debugger did something like this, but it's much more
         | limited in scope.
        
           | the_sleaze_ wrote:
           | As does Vue - in 5+ years I've never used it
        
         | munificent wrote:
         | I know time-travel debugging is very very close to Gilad
         | Bracha's heart and something he was really hoping would make
         | its way into Dart.
         | 
         | I don't know to what degree this is true for other language
         | teams but one thing I've observed is that language designers,
         | compiler people, VM people, and IDE/debugger people have more
         | distinct cultures than you might expect. That can make it hard
         | to ship features that cut across those domains. I think we've
         | gotten a lot better at doing that kind of holistic design on
         | the Dart team, but it took years of team-building to get there.
        
           | vlovich123 wrote:
           | There's reverse debugging and then there's what antithesis
           | does which is a deterministic guarantee of the state. So for
           | example, if you rewind, you'll get the exact same disk &
           | network I/O happening across each call. And it supports
           | arbitrary OS operations whereas typically at the PL level
           | you'll be left at the mercy of whichever OS APIs the PL
           | chooses to support for recording (i.e. similar to rr in terms
           | of what it'll be able to do). Often times, PLs don't even
           | bother with recording state across OS calls since they don't
           | actually know what are OS calls vs normal function calls.
        
         | GuB-42 wrote:
         | From the little I have seen, most programming language folks
         | don't seem to care much about debugging. They care a lot about
         | bugs not happening in the first place, which is good,
         | testability is sometimes taken in consideration, but not much
         | about what to do after a bug happened.
         | 
         | No language will prevent you from misimplementing the specs,
         | but languages can be designed in such a way that it easy to
         | trace back why the button is green and not red.
         | 
         | It seems like those who are the most serious about debugging
         | are from the video game industry. They get all the cool stuff
         | with time travel, hot reload, etc... So much that I expected to
         | see something about video games, and was surprised it wasn't.
        
           | corysama wrote:
           | Coming out of the games industry, I am constantly amazed by
           | how rarely people outside of games use debuggers. And, how
           | slow they are to debug everything because of that...
        
         | vlovich123 wrote:
         | > Perhaps the language is too small a vantage point to really
         | get into what's happening when debugging
         | 
         | A little bit. The big thing that others are missing is that
         | it's basically impossible for a PL to accomplish this.
         | Antithesis is basically recording all the state including I/O,
         | network I/O, all RNGS (including the OS) and the big one which
         | everyone has trouble with which is time. So basically you don't
         | need to set up your code and how it interfaces with its
         | environment to be deterministic - you can run within a
         | deterministic container instead which flips the problem on its
         | head and makes it much easier. I'm sure there are tradeoffs. A
         | noteable one is how expensive and slow this approach is vs
         | making your code deterministic. But given how basically no one
         | bothers to make their code deterministic and this is a drop-in
         | solution for scenarios like that, it's really worth it.
         | Additionally, unlike approaches like rr which offer similar
         | capabilities, this is even more generic & not dependent on
         | adding support for every OS interface (e.g. rr doesn't support
         | io_uring yet but I believe antithesis would since it's running
         | at the VM level)
        
         | jackschu wrote:
         | Yeah it's fun that we get to do this at the hypervisor level.
         | This opens up time-traveling in systems where there's cross-
         | machine or inter-process communication, which really widens
         | what we're able to do.
         | 
         | (I work at Antithesis, if youre interested in chatting more
         | once this thread has gone cold come join discord.gg/antithesis)
        
         | jbverschoor wrote:
         | Emm.. so many tools are/were available for C and Java. But
         | yeah, we need to reinvent wheels every so often
        
       | 1970-01-01 wrote:
       | >Seems obvious enough. But maybe, just maybe, the brake lines
       | were cut by somebody who wanted the driver dead. Or what if he
       | was drugged? Can we distinguish that scenario from him being
       | sleepy?
       | 
       | If this is prod, your job is going to be finding what combination
       | of these things caused it this time.
        
       | vngzs wrote:
       | There's a binary analysis time travel debugger similar to this,
       | Qira [0][1].
       | 
       | [0]: https://www.usenix.org/conference/enigma2016/conference-
       | prog...
       | 
       | [1]: https://qira.me/
        
       | nynx wrote:
       | Pretty much no software, even when run deterministically, is
       | bijective. There are almost always cases where two different
       | states map to the same state.
       | 
       | How does this tooling deal with that?
        
         | wwilson wrote:
         | This makes the mapping "injective":
         | https://antithesis.com/blog/deterministic_hypervisor/
         | 
         | The "onto" direction doesn't really matter.
        
           | nynx wrote:
           | How can it reverse time? Does it record a stack of every
           | decision point?
        
             | intuitionist wrote:
             | You don't need to reverse time if you can deterministically
             | reproduce everything that led up to the point of interest.
             | (In practice we save a snapshot of your system at some
             | intermediate point and replay from there.)
        
       | shoggouth wrote:
       | Is this like UndoDB[0]?
       | 
       | [0]: https://undo.io/products/udb/
        
       | quickgist wrote:
       | I've enjoyed reading many of the blog posts by Antithesis, really
       | cool work.
       | 
       | I don't really see a fit for the automated testing product in our
       | stack at the moment, but I would love to use a time traveling
       | hypervisor that I can hop into whenever I'd like.
       | 
       | Currently, it seems your pricing is pretty focused on the
       | automated testing service. Do you have pricing or plans that
       | offer just the deterministic dev environment?
        
         | bobafett-9902 wrote:
         | (antithesis employee here) We don't currently just offer the
         | deterministic dev environment, but we do offer extended 30 day
         | demos for prospects interested in trying out the tech and
         | seeing how it works. If you're interested contact us directly!
         | contact@antithesis.com
        
       | __0x01 wrote:
       | Is this designed to be run in production?
        
         | terpimost wrote:
         | No, it is to test your system before the production
        
       | bluelightning2k wrote:
       | I know I'm taking the wrong thing from this - but I really
       | struggle to read this site. Something about the contrast and
       | aggro gradients.
        
         | terpimost wrote:
         | Hey, designer here. Thank you for this feedback. Do you prefer
         | dark or light theme usually? And do you find reading text on
         | this background here https://antithesis.com/security/manifesto/
         | is any easier?
        
       | emeryberger wrote:
       | Would love to hear a technical comparison between this and King
       | et al.'s classic paper on Time-Traveling VMs from USENIX ATC
       | 2006: "Debugging operating systems with time-traveling virtual
       | machines"
       | (https://www.usenix.org/legacy/events/usenix05/tech/general/k...,
       | 505 citations).
        
         | Terr_ wrote:
         | Seguing to talks regarding _literal_ time traveling VMs, I 'm
         | reminded of Damian Conway's "Temporally Quaquaversal Virtual
         | Nanomachine Programming In Multiple Topologically Connected
         | Quantum-Relativistic Parallel Spacetimes... Made Easy!"
         | presentation.
         | 
         | Essentially, it involves a series of sci-fi concepts, and then
         | showing the kind of program (in modified perl) that someone
         | might write to take advantage of those capabilities.
        
       | grumbelbart wrote:
       | I was once working in a company producing software / operating
       | systems for smart cards (such as the chips on your credit cards).
       | We developed a simulator for the hardware that logged all changes
       | to registers, memory and other states in a very large ring
       | buffer, allowing us to undo / step backwards through code. With
       | RAM being large, those chips being slow, and some snapshotting,
       | we were usually able to undo back to the reset of the card. That
       | was a game changer regarding debugging the OS.
        
       ___________________________________________________________________
       (page generated 2024-09-10 23:00 UTC)