https://antithesis.com/blog/multiverse_debugging/ Product What is Antithesis? How we're different How it works Demo Solutions Problems we solve Case studies Working with Antithesis Security approach Company Backstory Leadership Careers Brand Pricing Docs Blog Let's talk - Blog [001_will] Will Wilson CEO Debugging in the Multiverse September 10, 2024 Would figuring out your bugs and outages be easier if you had a time machine? We are now making a time machine directly available to all of our customers. [cQMH4Qc9r9-1424] Sometimes, being a software engineer is a lot like being a crime scene investigator. Picture the situation: a car has crashed on an icy road early in the morning. Seems obvious enough. But maybe, just maybe, the brake lines were cut by somebody who wanted the driver dead. Or what if he was drugged? Can we distinguish that scenario from him being sleepy? Our best bet is to surround the scene with that yellow police tape so nobody disturbs it, and to hope that time and chance haven't obliterated the evidence we'd need to figure it out. What if there was a better way? I've been involved in too many production outages and emergencies whose aftermath felt just like that. Eventually all the alerts and alarms get resolved and the error rates creep back down. And then what? Cordon the servers off with yellow police tape? The bug that caused the outage is there in your code somewhere, but it may have taken some outrageously specific circumstances to trigger it. You better pray that somebody added exactly the piece of logging or telemetry that you needed to figure it out, because it could be impossible to reproduce the issue in a controlled way. This puts us in an awkward spot. When writing the code, we need to somehow anticipate what our future selves, who are investigating some disaster, will wish that we had done in the past. When we succeed at this, we collect huge volumes of logs "just in case" they provide some crucial clue, incurring equally huge storage costs. But I am not a very foresightful person, so I usually don't even get that far. In these cases, what I really wish I had is a time machine that would let me rewind to 5 seconds before the crash, freeze time, and give the car (or my servers) a good look. Bring back my files Obviously the first feature I want from my time machine is the same one I want whenever I accidentally delete data from my harddrive, install malware, or say something dumb in a sensitive conversation: sleep -5. We can do that! (See video.) What exactly is happening here? Antithesis simulates a purely deterministic universe. The reasons we do that are to find bugs faster, and to make them perfectly reproducible once found. But if you can perfectly simulate something, then you can also perfectly simulate it up until 5 seconds from the end.^1 Then we can crack open that universe and give you a bash terminal inside of it. The resulting universe is still deterministic, just dependent on what you decided to do. 1. In practice, we never need to replay the simulation from the beginning, because our hypervisor also supports fast and efficient snapshotting of the state of the guest system. See this talk by Alex Pshenichkin. Information from the past Let's get more concrete. Let's use this to solve a real problem. My server has crashed and its process has exited! No worries, I'll just rewind time, attach a debugger to the process, and set a breakpoint or capture a thread dump: Packets from the past Or you know what? I can't count the number of times I was trying to figure out where my consensus protocol went wrong and wished I had a dump of all the network traffic. No biggie, I'll just go back in time and decide I was capturing the traffic all along: What was slow? Strange and transient performance problems are a snap. Once Antithesis has found them for me, I can rewind time and enable profiling for the period of interest. I don't need to worry about figuring out how to trigger the pathology again: Back to the future Like any good time machine, we can travel to the future too. The nice thing about a deterministic universe is that if the thing you're simulating is mostly idle, you can just simulate it faster. Here we are waiting for 10 minutes to pass in just a few seconds. This kind of time compression is very useful when debugging networked systems: Change the past But let's be real: if you or I had an actual time machine, we wouldn't be able to resist the temptation to go back to some historic event, change it, and then return to the present and see what's different. But that's a pretty useful technique when debugging too! We call it "multiverse debugging." Let's rewind time, turn off fault injection on our Kafka cluster, and see if the NPE still happens: Imagine an extreme version of this. You could rewind a second, explore a thousand tiny variations of the past, and compute the proportion of them that still see the bug. Then you could rewind two seconds and do the same thing. Do that enough times and you've just recreated the Antithesis bug report. But with this new tool we're giving you, you could have invented that bug report yourself. What else could you invent? A reactive multiverse You may be wondering what's up with this interface I'm showing you. It's just a browser-based reactive notebook. But by connecting the notebook to a deterministic hypervisor, we get access to a definitionally side effect-free world. This means there's no room to smuggle state, so we can make it truly reactive, even when it's running commands on the Linux system running inside the hypervisor. When you change the text in the notebook it immediately reacts, and the hypervisor reacts too. The UI is just the inevitable result of the notebook text causing a determistic computation. UI = f(code), even when that code is injecting commands into a distributed system. It's available now We haven't even begun to scratch the surface of this capability, but this post is already too long. We are rolling this out to all of our existing customers today. If you're already a customer, get started with our new documentation. If you're not already a customer, contact us. We'll find your bugs and then give you a universe-hopping time machine to fix them. (c) Antithesis Operations LLC Privacy policy Terms of use Security [ ] [ ] [Subscribe]