[HN Gopher] Debugging in the Multiverse
___________________________________________________________________
Debugging in the Multiverse
Author : wwilson
Score : 161 points
Date : 2024-09-10 13:05 UTC (9 hours ago)
(HTM) web link (antithesis.com)
(TXT) w3m dump (antithesis.com)
| gguergabo wrote:
| Antithesis employee here. Happy to jump in and answer any burning
| questions people might have about multiverse debugging.
| yellow_lead wrote:
| > Let's get more concrete. Let's use this to solve a real
| problem. My server has crashed and its process has exited! No
| worries, I'll just rewind time, attach a debugger to the
| process, and set a breakpoint or capture a thread dump:
|
| Is this kind of stuff only possible in an Antithesis
| Environment?
| wwilson wrote:
| Yes, unfortunately we have not figured out how to rewind time
| in the real world yet. When we do, there are a lot of choices
| I'm going to revisit...
| qarl wrote:
| > Yes, unfortunately we have not figured out how to rewind
| time in the real world yet.
|
| 10 bucks says you get complaints for not implementing the
| "real world" feature.
| abeppu wrote:
| ... but the intro makes it sound like this system is
| valuable in investigating bugs that occurred in prod
| systems:
|
| > I've been involved in too many production outages and
| emergencies whose aftermath felt just like that. Eventually
| all the alerts and alarms get resolved and the error rates
| creep back down. And then what? Cordon the servers off with
| yellow police tape? The bug that caused the outage is there
| in your code somewhere, but it may have taken some
| outrageously specific circumstances to trigger it.
|
| So practically, if a production outage (where I think
| "production" means it cannot be in a simulated environment,
| since the customers you're serving are real) is caused by
| very specific circumstances, and your production system
| records some, but not every attribute of its inputs and
| state ... how does one make use of antithesis? Concretely,
| when you have a fully-deterministic system that can help
| your investigation, but you have only a partial view of the
| conditions that caused the bug ... how do you proceed?
|
| I feel like this post is over-promising but perhaps there's
| something I just don't understand since I've never worked
| with a tool set like this.
| yellow_lead wrote:
| This was my thinking as well. Prod environments can be
| extremely complicated and issues often come down to
| specific configuration or data issues in production. So I
| had a lot of trouble understanding how the premise is
| connected to the product here.
| jackschu wrote:
| (I work at Antithesis)
|
| I think you're right that the framing leans towards
| providing value in prod issues, but we left out how we
| provide value there. I think you're also right that we're
| just used to experiencing the value here, but it needs
| some explanation.
|
| Basically this is where guided, tree-based fuzzing comes
| in. If something in the real world is caused by very
| specific circumstances, we're well positions to have also
| generated those specific circumstances. This is thanks to
| parallelism, intelligent exploration, fault injection,
| our ability to revisit interesting states in the past
| with fast snapshots, etc.
|
| We've had some super notable instances of a customer
| finds a bug in prod, recalls its that weird bug they've
| been ignoring that we surfaced a month ago, and then uses
| this approach to debug.
|
| The best docs on this are probably here: https://antithes
| is.com/docs/introduction/how_antithesis_work...
| abeppu wrote:
| The intro mentions that ordinarily, we have to pay a high
| upfront cost to record info that we might need to debug later.
|
| > When we succeed at this, we collect huge volumes of logs
| "just in case" they provide some crucial clue, incurring
| equally huge storage costs.
|
| The 'packets from the past' section says we can just
| retroactively decide what we should have recorded.
|
| Doesn't that mean we're effectively recording everything
| always? What's the cost of this? Or is all of this under the
| assumption that we never have to debug something that happened
| outside of the simulation environment, e.g. in response to an
| actual in-bound request from a customer? If this is just saying
| we can afford to save everything in our development environment
| ... well in that context recording the logs probably wasn't a
| "huge storage cost" either, right? Or am I missing something
| basic here?
| wwilson wrote:
| You're right that if you tried to do something like this
| using record/replay, you would pay an enormous cost.
| Antithesis does not use record/replay, but rather a
| deterministic hypervisor
| (https://antithesis.com/blog/deterministic_hypervisor/). So
| all we have to remember is the set of inputs/changes to
| entropy that got us somewhere, not the result of every system
| operation.
| slippy wrote:
| The classic time space tradeoff question: If I run
| Antithesis for X time, say 4 hours, do you take periodic
| snapshot / deltas of state so that I don't have to re-run
| the capture for O(4 hours) again, from scratch just to go
| back 5 seconds?
| wwilson wrote:
| Yes! See Alex's talk here:
| https://www.youtube.com/watch?v=0E6GBg13P60
|
| In fact, we just made a radical upgrade to this
| functionality. Expect a blog post about that soon.
| Veserv wrote:
| Is the hypervisor multicore? How do you handle shared memory
| non-determinism? What is the runtime slowdown for shared memory
| multicore (lets say 16 cores if you need a concrete example)
| execution?
| Veserv wrote:
| Found the answer in a different post [1]. The hypervisor and
| virtual machines are single-core only. The talk also
| indicates that all I/O operations need to be manually
| rewritten to use the instrumented mechanism, so it demands a
| highly paravirtualized guest OS. Logically, that means there
| are probably no cross-VM shared memory interfaces either. So,
| no shared memory and thus no need to deal with shared memory
| non-determinism.
|
| This is just a standard replay engine from what I can tell.
|
| [1] https://news.ycombinator.com/item?id=41501577
| wwilson wrote:
| No, we don't require any paravirtualization at all, and
| nothing needs to be manually rewritten. I'm not sure where
| you got that impression.
|
| It also is not in any sense a replay engine. We don't need
| to record anything except the inputs!
| Veserv wrote:
| At timestamp 23:40 in the video by Alex Pshenichkin from
| 2024-06-10 it says data ingestion comes via VMCALL
| interactions. As such a call is literal nonsense if you
| are not virtualized, any such call inherently means you
| are using a paravirtualized interface. Now maybe FreeBSD
| has enough standardized paravirtualized drivers similar
| to virtio that you can just link it up, but that would
| still be paravirtualization solution with manual
| rewrites, just somebody else already did the manual
| rewrites. Has the fundamental design changed in the last
| 3 months?
|
| This is exactly a replay engine (or I guess you could say
| replay engines are deterministic simulators). How do you
| think you replay a recording except with a deterministic
| execution system that injects the non-deterministic
| inputs at precise execution points? This is literally how
| all replay engines work. Furthermore, how do you think
| recordings work except by recording the inputs? That is
| literally how all recording systems designed to feed
| replay engines work. The only distinction is what
| constitutes non-determinism in a given context. At the
| whole hypervisor level, it is just I/O into the guest; at
| the process level, it is just system calls that write
| into the process; at the threading level, it is all
| writes into the process. These distinctions are somewhat
| interesting at a implementation level, but do not change
| the fundamental character of the solution which is that
| they are all a replay engine or deterministic simulator,
| whatever you want to call it.
| vikR0001 wrote:
| This looks very interesting! Is it possible to implement this in
| a node.js web app? Does it work with any build tool? How much
| latency does it add to a production server?
| wwilson wrote:
| The simulation is a completely generic Linux system, so we can
| run anything (including NodeJS). If your build tool can produce
| Docker containers, then it will work with us.
|
| We don't run this on your production server, but in the same
| simulation that we use to find your bugs. See also:
| https://antithesis.com/product/how_does_antithesis_work/
| ripped_britches wrote:
| How do you handle side effects that interact with third party
| systems? In my own tests, I use network request mocks. Do you
| need to provide a test mode flag to indicate that mocks should be
| used?
| wwilson wrote:
| Any third party service does need to be mocked or stubbed out.
| We have a partnership with Localstack that lets us provide very
| polished AWS mocks that require zero configuration on your part
| (https://antithesis.com/docs/using_antithesis/environment.htm..
| .).
|
| If you need something else, reach out and ask us about it,
| because we have a few of them in the pipeline.
| casept wrote:
| Does anything like the Antithesis hypervisor exist as open
| source?
|
| The closest I've seen is Qemu record/replay, but that's very slow
| (no KVM acceleration, no multicore), and broken in current Qemu
| versions (replayed system just gets stuck).
| spencerflem wrote:
| There's languages that support time travel debugging, like RR
| for GDB, or smalltalk, but no open source system wide thing
| like Antithesis that I know of yet.
| dzaima wrote:
| rr can record process trees; i.e. basically any
| part/descendant of a process you spawn will be recorded and
| can be replayed (userspace CPU & memory, that is); won't
| record the entire OS though.
| casept wrote:
| My experience with RR is that the chance of it working
| without hitting a missing syscall or desync is only about
| 50%, which is why I want a different solution that doesn't
| rely on the fragile syscall recording approach.
| dzaima wrote:
| Huh. In my experience it works nearly flawlessly,
| certainly far above 50%. And even when there are spurious
| failures in replaying it's easy enough to just re-replay
| (though I do wish there was some way to export the
| current position & checkpoints with instruction-level
| precision to import in a fresh replay). I suppose it
| depends massively on the recorded program (most of mine
| are simple C programs, but also a decent bit of Java for
| JIT inspection or FFI, and I've also recorded an Electron
| app a couple times, and for fun Factorio)
| vlovich123 wrote:
| Same, I haven't had it have too many problems but it's
| not perfect & missing support for io_uring is a problem
| (they'll add it eventually I suspect once someone ponies
| up the money for it).
| eatonphil wrote:
| https://github.com/facebookexperimental/hermit but it hasn't
| worked for me and is now unmaintained.
| mattgreenrocks wrote:
| It is really interesting to me that this sort of thing didn't
| come from programming language folks like I'd expect. You'd think
| PLs are in the absolute perfect spot to implement things, because
| they define the semantics and runtime. And there are a few PLs
| who have time-travel demos, but they've never really been seen as
| more than a cool tech demo.
|
| Perhaps the language is too small a vantage point to really get
| into what's happening when debugging.
| barumrho wrote:
| Elm debugger did something like this, but it's much more
| limited in scope.
| the_sleaze_ wrote:
| As does Vue - in 5+ years I've never used it
| munificent wrote:
| I know time-travel debugging is very very close to Gilad
| Bracha's heart and something he was really hoping would make
| its way into Dart.
|
| I don't know to what degree this is true for other language
| teams but one thing I've observed is that language designers,
| compiler people, VM people, and IDE/debugger people have more
| distinct cultures than you might expect. That can make it hard
| to ship features that cut across those domains. I think we've
| gotten a lot better at doing that kind of holistic design on
| the Dart team, but it took years of team-building to get there.
| vlovich123 wrote:
| There's reverse debugging and then there's what antithesis
| does which is a deterministic guarantee of the state. So for
| example, if you rewind, you'll get the exact same disk &
| network I/O happening across each call. And it supports
| arbitrary OS operations whereas typically at the PL level
| you'll be left at the mercy of whichever OS APIs the PL
| chooses to support for recording (i.e. similar to rr in terms
| of what it'll be able to do). Often times, PLs don't even
| bother with recording state across OS calls since they don't
| actually know what are OS calls vs normal function calls.
| GuB-42 wrote:
| From the little I have seen, most programming language folks
| don't seem to care much about debugging. They care a lot about
| bugs not happening in the first place, which is good,
| testability is sometimes taken in consideration, but not much
| about what to do after a bug happened.
|
| No language will prevent you from misimplementing the specs,
| but languages can be designed in such a way that it easy to
| trace back why the button is green and not red.
|
| It seems like those who are the most serious about debugging
| are from the video game industry. They get all the cool stuff
| with time travel, hot reload, etc... So much that I expected to
| see something about video games, and was surprised it wasn't.
| corysama wrote:
| Coming out of the games industry, I am constantly amazed by
| how rarely people outside of games use debuggers. And, how
| slow they are to debug everything because of that...
| vlovich123 wrote:
| > Perhaps the language is too small a vantage point to really
| get into what's happening when debugging
|
| A little bit. The big thing that others are missing is that
| it's basically impossible for a PL to accomplish this.
| Antithesis is basically recording all the state including I/O,
| network I/O, all RNGS (including the OS) and the big one which
| everyone has trouble with which is time. So basically you don't
| need to set up your code and how it interfaces with its
| environment to be deterministic - you can run within a
| deterministic container instead which flips the problem on its
| head and makes it much easier. I'm sure there are tradeoffs. A
| noteable one is how expensive and slow this approach is vs
| making your code deterministic. But given how basically no one
| bothers to make their code deterministic and this is a drop-in
| solution for scenarios like that, it's really worth it.
| Additionally, unlike approaches like rr which offer similar
| capabilities, this is even more generic & not dependent on
| adding support for every OS interface (e.g. rr doesn't support
| io_uring yet but I believe antithesis would since it's running
| at the VM level)
| jackschu wrote:
| Yeah it's fun that we get to do this at the hypervisor level.
| This opens up time-traveling in systems where there's cross-
| machine or inter-process communication, which really widens
| what we're able to do.
|
| (I work at Antithesis, if youre interested in chatting more
| once this thread has gone cold come join discord.gg/antithesis)
| jbverschoor wrote:
| Emm.. so many tools are/were available for C and Java. But
| yeah, we need to reinvent wheels every so often
| 1970-01-01 wrote:
| >Seems obvious enough. But maybe, just maybe, the brake lines
| were cut by somebody who wanted the driver dead. Or what if he
| was drugged? Can we distinguish that scenario from him being
| sleepy?
|
| If this is prod, your job is going to be finding what combination
| of these things caused it this time.
| vngzs wrote:
| There's a binary analysis time travel debugger similar to this,
| Qira [0][1].
|
| [0]: https://www.usenix.org/conference/enigma2016/conference-
| prog...
|
| [1]: https://qira.me/
| nynx wrote:
| Pretty much no software, even when run deterministically, is
| bijective. There are almost always cases where two different
| states map to the same state.
|
| How does this tooling deal with that?
| wwilson wrote:
| This makes the mapping "injective":
| https://antithesis.com/blog/deterministic_hypervisor/
|
| The "onto" direction doesn't really matter.
| nynx wrote:
| How can it reverse time? Does it record a stack of every
| decision point?
| intuitionist wrote:
| You don't need to reverse time if you can deterministically
| reproduce everything that led up to the point of interest.
| (In practice we save a snapshot of your system at some
| intermediate point and replay from there.)
| shoggouth wrote:
| Is this like UndoDB[0]?
|
| [0]: https://undo.io/products/udb/
| quickgist wrote:
| I've enjoyed reading many of the blog posts by Antithesis, really
| cool work.
|
| I don't really see a fit for the automated testing product in our
| stack at the moment, but I would love to use a time traveling
| hypervisor that I can hop into whenever I'd like.
|
| Currently, it seems your pricing is pretty focused on the
| automated testing service. Do you have pricing or plans that
| offer just the deterministic dev environment?
| bobafett-9902 wrote:
| (antithesis employee here) We don't currently just offer the
| deterministic dev environment, but we do offer extended 30 day
| demos for prospects interested in trying out the tech and
| seeing how it works. If you're interested contact us directly!
| contact@antithesis.com
| __0x01 wrote:
| Is this designed to be run in production?
| terpimost wrote:
| No, it is to test your system before the production
| bluelightning2k wrote:
| I know I'm taking the wrong thing from this - but I really
| struggle to read this site. Something about the contrast and
| aggro gradients.
| terpimost wrote:
| Hey, designer here. Thank you for this feedback. Do you prefer
| dark or light theme usually? And do you find reading text on
| this background here https://antithesis.com/security/manifesto/
| is any easier?
| emeryberger wrote:
| Would love to hear a technical comparison between this and King
| et al.'s classic paper on Time-Traveling VMs from USENIX ATC
| 2006: "Debugging operating systems with time-traveling virtual
| machines"
| (https://www.usenix.org/legacy/events/usenix05/tech/general/k...,
| 505 citations).
| Terr_ wrote:
| Seguing to talks regarding _literal_ time traveling VMs, I 'm
| reminded of Damian Conway's "Temporally Quaquaversal Virtual
| Nanomachine Programming In Multiple Topologically Connected
| Quantum-Relativistic Parallel Spacetimes... Made Easy!"
| presentation.
|
| Essentially, it involves a series of sci-fi concepts, and then
| showing the kind of program (in modified perl) that someone
| might write to take advantage of those capabilities.
| grumbelbart wrote:
| I was once working in a company producing software / operating
| systems for smart cards (such as the chips on your credit cards).
| We developed a simulator for the hardware that logged all changes
| to registers, memory and other states in a very large ring
| buffer, allowing us to undo / step backwards through code. With
| RAM being large, those chips being slow, and some snapshotting,
| we were usually able to undo back to the reset of the card. That
| was a game changer regarding debugging the OS.
___________________________________________________________________
(page generated 2024-09-10 23:00 UTC)