[HN Gopher] Debugging operating systems with time-traveling virt...
       ___________________________________________________________________
        
       Debugging operating systems with time-traveling virtual machines
       (2005) [pdf]
        
       Author : Intralexical
       Score  : 140 points
       Date   : 2024-08-18 18:28 UTC (1 days ago)
        
 (HTM) web link (www.usenix.org)
 (TXT) w3m dump (www.usenix.org)
        
       | drewg123 wrote:
       | This needs a [2005] qualifier
        
       | Veserv wrote:
       | A history of other time traveling debugging papers and products
       | (including this one):
       | 
       | https://jakob.engbloms.se/archives/1554
       | 
       | https://jakob.engbloms.se/archives/1564
        
       | fatcunt wrote:
       | Microsoft has, or had, a similar technology they use internally,
       | called TKO:
       | https://www.microsoft.com/security/blog/2020/05/04/mitigatin...
       | 
       | It's written in Rust and is based around a version of Bochs
       | modified for deterministic execution. It's got time-travel
       | debugging (with WinDbg), which works by replaying forward from
       | the nearest snapshot to the point at which the user is asking to
       | move backwards to.
       | 
       | The primary author of this software wanted to open source it, but
       | the higher-ups at MSFT refused. He's been working on similar
       | projects in a personal capacity though, e.g.
       | https://gamozolabs.github.io/fuzzing/2020/12/06/fuzzos.html
        
         | Cyph0n wrote:
         | Watched some of his streams a while back. One of the most
         | talented/productive devs I have ever seen tbh.
        
         | grepfru_it wrote:
         | The trick at Microsoft is to start working on your project in
         | your spare time. Then incorporate it into your project at MSFT.
         | You get the clout associated with having an open source project
         | but then you also get to use it internally as a sanctioned tool
        
           | DaiPlusPlus wrote:
           | > The trick at Microsoft is to start working on your project
           | in your spare time. Then incorporate it into your project at
           | MSFT.
           | 
           | (Ex-msftie here)
           | 
           | Microsoft is great for being amongst the tiny number of
           | software companies that not only allow their FTEs to have
           | their own software projects without attempting to claim any
           | kind of ownership (compare with Apple where I'm told having
           | even a public GitHub account without manager-approval is
           | grounds for termination...) - but MS even actively encourages
           | it too (though I cynically note that was when the Windows
           | Phone App Store's app-shortage was thought-of as a serious
           | problem)
           | 
           | BUUUUT - they do require you to strictly avoid cross-over
           | with their own products: you can't just re-use your own code
           | in a company project, or even reimplement it from-memory.
           | Yes, there are ways of doing this properly (involving weeks
           | of meetings with LCA arranging a permanent irrevocable - and
           | possibly exclusive - rights transfer), wasting time better
           | spent on other things) - so your project had better be
           | something special _and_ you'd need to give-up hope of getting
           | rich from it (spare a very modest bonus the year you do the
           | LCA paperwork if your skiplevel sees any value in it).
        
             | pjmlp wrote:
             | Yeah, and that is how stuff like C++/CX gets killed,
             | replaced by a side project that was never at the same
             | tooling level in Visual Studio, even though there was a
             | CppCon talk promising otherwise, and currently is in
             | maintenance state because "it achived its goals", who cares
             | about those paying customers that had to rewrite their code
             | to a lesser capable tool, now abandonend.
        
         | ttecho2021 wrote:
         | Wow I stumbled on Gamozolabs' youtube when looking into
         | determinism and hypervisors. Didn't know he also worked on TKO
         | for MSFT. Glad he is making a similar project!
        
         | ashkankiani wrote:
         | I wrote a similar snapshot system for our deterministic trading
         | engine. It's hard to imagine a system that doesn't do that
         | unless you actually enforce every single event in your log to
         | be reversible/non-destructive/non-aliasing. Even then, you
         | still want snapshots to jump to a point in time. The only
         | annoying thing is the case where you want to step back one,
         | meaning a naive implementation would jump back to the last
         | snapshot and play forward.
         | 
         | An 80% solution is to keep the last N states in memory.
         | Snapshots compress well within a small time frame, so whenever
         | we "paused" the playback, we could stash deltas from the pause
         | point to reconstruct stuff (I sadly never got around to
         | implementing this part before I left since it wasn't high
         | enough priority).
        
           | mark_undoio wrote:
           | > Even then, you still want snapshots to jump to a point in
           | time. The only annoying thing is the case where you want to
           | step back one, meaning a naive implementation would jump back
           | to the last snapshot and play forward.
           | 
           | At Undo.io we use similar techniques to time travel debug
           | C/C++ (and other languages).
           | 
           | > An 80% solution is to keep the last N states in memory.
           | 
           | We use a slightly more specialised version of what you
           | describe - we maintain a set of snapshots at wide spacings
           | throughout history (for big time jumps the user might want to
           | do) and ones just a small gap before the "current time" the
           | user is viewing, so small steps backwards can be fast too.
           | 
           | It's been a complex process figuring out an algorithm that
           | balances these concerns without using up unreasonable amounts
           | of RAM.
           | 
           | The _other_ way to tackle slowness - though this is for
           | larger jumps in history - is to search the timeline in
           | parallel. Since you 've got deterministic re-execution you
           | can play several chunks of history at once while looking for
           | a key event. It can't help for small jumps in time but if you
           | are looking far into the past it can be a significant speed-
           | up.
        
       | roca wrote:
       | I wonder if this inspired the VMWare VM record-and-replay
       | functionality that came out in 2008. They discontinued it in
       | 2011, but it's important to me because we used it at Mozilla to
       | great effect and that made it easier for me to get Mozilla to
       | support the development of rr, which started in 2011.
        
         | icholy wrote:
         | I don't get how rr isn't more popular.
        
         | soamv wrote:
         | It might have, I remember attending a talk by Peter Chen when I
         | was at VMware around that time, and I know there was some kind
         | of collaboration. (I wasn't involved in record-replay at
         | VMware, but I was interested in it due to some ancient work I
         | did with userspace record-replay debugging,
         | https://lizard.sf.net).
         | 
         | rr is fantastic work, mad props! And the multiprocess stuff in
         | pernosco looks super neat.
        
         | purple-dragon wrote:
         | Yes. There's a whole family of papers by Chen, his students,
         | and other VMware folks in this vein.
        
         | gwd wrote:
         | Yes, at some point when I was a student I went out to VMWare
         | and gave a presentation to a team of their engineers.
         | 
         | The thing about execution replay is that, in the general case,
         | it's _incredibly fragile_. I happened to chat to some people
         | from VMWare in 2008 and they said that there was a team of 10
         | people whose only job it was to fix execution replay when it
         | broke. The main thing I learned from my PhD on execution replay
         | was how to debug insane bugs.
         | 
         | Presumably rr, focusing on processes, offers a more constrained
         | environment that's easier to log & replay.
        
           | mark_undoio wrote:
           | > Presumably rr, focusing on processes, offers a more
           | constrained environment that's easier to log & replay.
           | 
           | As someone who's worked elsewhere on time travel debug, I'm
           | really curious on @roca's take on this - because I'd have
           | expected a full-VM solution to be easier to make reliable.
           | 
           | Hardware-level behaviour _sounds_ harder but it 's well-
           | constrained. The behaviour an OS can rely on from the
           | hardware is large but well-documented and slow to change.
           | 
           | In contrast, the process boundary is really ill-defined and
           | permeable. Also, when you need things to be _precisely_ the
           | same, you notice bits of kernel behaviour leaking into the
           | user space ABI in unexpected ways.
        
       | userbinator wrote:
       | Debuggers have had history tracing functionality for a long time,
       | but being extremely slow and consuming a lot of storage meant it
       | was rarely used except for very specific cases. Now that CPUs are
       | faster and the average machine has a lot more RAM, it becomes
       | more feasible to do this.
        
       ___________________________________________________________________
       (page generated 2024-08-19 23:01 UTC)