[HN Gopher] Debugging operating systems with time-traveling virt...
       ___________________________________________________________________
        
       Debugging operating systems with time-traveling virtual machines
       [pdf]
        
       Author : Intralexical
       Score  : 41 points
       Date   : 2024-08-18 18:28 UTC (4 hours ago)
        
 (HTM) web link (www.usenix.org)
 (TXT) w3m dump (www.usenix.org)
        
       | Intralexical wrote:
       | Abstract:                  Operating systems are difficult to
       | debug with traditional cyclic debugging. They are non-
       | deterministic; they run for long periods of time; they interact
       | directly with hard-ware devices; and their state is easily
       | perturbed by the act of debugging. This paper describes a time-
       | traveling virtual machine that overcomes many of the difficulties
       | associated with debugging operating systems. Time travel enables
       | a programmer to navigate backward and forward arbitrarily through
       | the execution history of a particular run and to replay arbitrary
       | segments of the past execution. We integrate time travel into a
       | general-purpose debugger to enable a programmer to debug an OS in
       | reverse, implementing commands such as reverse breakpoint,
       | reverse watchpoint, and reverse single step. The space and time
       | overheads needed to support time travel are reasonable for
       | debugging, and movements in time are fast enough to support
       | interactive debugging. We demonstrate the value of our time-
       | traveling virtual machine by using it to understand and fix
       | several OS bugs that are difficult to find with standard
       | debugging tools. Reverse debugging is especially helpful in
       | finding bugs that are fragile due to non-determinism, bugs in
       | device drivers, bugs that require long runs to trigger, bugs that
       | corrupt the stack, and bugs that are detected after the relevant
       | stack frame is popped.
       | 
       | Related work:
       | 
       | "ConSnap: Taking continuous snapshots for running state
       | protection of virtual machines"
       | 
       | https://web.archive.org/web/20151014000129id_/http://act.bua...
       | ...In this paper, we present ConSnap, a system designed to enable
       | taking fine-grained continuous snapshots of virtual machines...
       | decrease the snapshot interval to dozens of milliseconds. We have
       | implemented ConSnap on QEMU/KVM.... Compared with the stop-and-
       | copy based incremental snapshots, ConSnap reduces the performance
       | loss by 71.1% ~ 10.2% under Compilation workload, and 14.5% ~
       | 4.7% for the Ftp workload, when the interval varies from 1s to
       | 60s.
       | 
       | "ReVirt: Enabling Intrusion Analysis Through Virtual-Machine
       | Logging and Replay"
       | 
       | https://www.usenix.org/legacy/events/osdi02/tech/full_papers...
       | ...ReVirt logs enough information to replay a long-term execution
       | of the virtual machine instruction-by-instruction. This enables
       | it to provide arbitrarily detailed observations about what
       | transpired on the system, even in the presence of non-
       | deterministic attacks and executions. ReVirt adds reasonable time
       | and space overhead. Overheads due to virtualization are
       | imperceptible for interactive use and CPU-bound workloads, and
       | 13-58% for kernel-intensive workloads. Logging adds 0-8%
       | overhead, and logging traffic for our workloads can be stored on
       | a single disk for several months.
        
       | drewg123 wrote:
       | This needs a [2005] qualifier
        
       | Veserv wrote:
       | A history of other time traveling debugging papers and products
       | (including this one):
       | 
       | https://jakob.engbloms.se/archives/1554
       | 
       | https://jakob.engbloms.se/archives/1564
        
       | fatcunt wrote:
       | Microsoft has, or had, a similar technology they use internally,
       | called TKO:
       | https://www.microsoft.com/security/blog/2020/05/04/mitigatin...
       | 
       | It's written in Rust and is based around a version of Bochs
       | modified for deterministic execution. It's got time-travel
       | debugging (with WinDbg), which works by replaying forward from
       | the nearest snapshot to the point at which the user is asking to
       | move backwards to.
       | 
       | The primary author of this software wanted to open source it, but
       | the higher-ups at MSFT refused. He's been working on similar
       | projects in a personal capacity though, e.g.
       | https://gamozolabs.github.io/fuzzing/2020/12/06/fuzzos.html
        
         | Cyph0n wrote:
         | Watched some of his streams a while back. One of the most
         | talented/productive devs I have ever seen tbh.
        
         | grepfru_it wrote:
         | The trick at Microsoft is to start working on your project in
         | your spare time. Then incorporate it into your project at MSFT.
         | You get the clout associated with having an open source project
         | but then you also get to use it internally as a sanctioned tool
        
       | roca wrote:
       | I wonder if this inspired the VMWare VM record-and-replay
       | functionality that came out in 2008. They discontinued it in
       | 2011, but it's important to me because we used it at Mozilla to
       | great effect and that made it easier for me to get Mozilla to
       | support the development of rr, which started in 2011.
        
         | icholy wrote:
         | I don't get how rr isn't more popular.
        
       | userbinator wrote:
       | Debuggers have had history tracing functionality for a long time,
       | but being extremely slow and consuming a lot of storage meant it
       | was rarely used except for very specific cases. Now that CPUs are
       | faster and the average machine has a lot more RAM, it becomes
       | more feasible to do this.
        
       ___________________________________________________________________
       (page generated 2024-08-18 23:00 UTC)