[HN Gopher] Debugging operating systems with time-traveling virt...
___________________________________________________________________
Debugging operating systems with time-traveling virtual machines
(2005) [pdf]
Author : Intralexical
Score : 140 points
Date : 2024-08-18 18:28 UTC (1 days ago)
(HTM) web link (www.usenix.org)
(TXT) w3m dump (www.usenix.org)
| drewg123 wrote:
| This needs a [2005] qualifier
| Veserv wrote:
| A history of other time traveling debugging papers and products
| (including this one):
|
| https://jakob.engbloms.se/archives/1554
|
| https://jakob.engbloms.se/archives/1564
| fatcunt wrote:
| Microsoft has, or had, a similar technology they use internally,
| called TKO:
| https://www.microsoft.com/security/blog/2020/05/04/mitigatin...
|
| It's written in Rust and is based around a version of Bochs
| modified for deterministic execution. It's got time-travel
| debugging (with WinDbg), which works by replaying forward from
| the nearest snapshot to the point at which the user is asking to
| move backwards to.
|
| The primary author of this software wanted to open source it, but
| the higher-ups at MSFT refused. He's been working on similar
| projects in a personal capacity though, e.g.
| https://gamozolabs.github.io/fuzzing/2020/12/06/fuzzos.html
| Cyph0n wrote:
| Watched some of his streams a while back. One of the most
| talented/productive devs I have ever seen tbh.
| grepfru_it wrote:
| The trick at Microsoft is to start working on your project in
| your spare time. Then incorporate it into your project at MSFT.
| You get the clout associated with having an open source project
| but then you also get to use it internally as a sanctioned tool
| DaiPlusPlus wrote:
| > The trick at Microsoft is to start working on your project
| in your spare time. Then incorporate it into your project at
| MSFT.
|
| (Ex-msftie here)
|
| Microsoft is great for being amongst the tiny number of
| software companies that not only allow their FTEs to have
| their own software projects without attempting to claim any
| kind of ownership (compare with Apple where I'm told having
| even a public GitHub account without manager-approval is
| grounds for termination...) - but MS even actively encourages
| it too (though I cynically note that was when the Windows
| Phone App Store's app-shortage was thought-of as a serious
| problem)
|
| BUUUUT - they do require you to strictly avoid cross-over
| with their own products: you can't just re-use your own code
| in a company project, or even reimplement it from-memory.
| Yes, there are ways of doing this properly (involving weeks
| of meetings with LCA arranging a permanent irrevocable - and
| possibly exclusive - rights transfer), wasting time better
| spent on other things) - so your project had better be
| something special _and_ you'd need to give-up hope of getting
| rich from it (spare a very modest bonus the year you do the
| LCA paperwork if your skiplevel sees any value in it).
| pjmlp wrote:
| Yeah, and that is how stuff like C++/CX gets killed,
| replaced by a side project that was never at the same
| tooling level in Visual Studio, even though there was a
| CppCon talk promising otherwise, and currently is in
| maintenance state because "it achived its goals", who cares
| about those paying customers that had to rewrite their code
| to a lesser capable tool, now abandonend.
| ttecho2021 wrote:
| Wow I stumbled on Gamozolabs' youtube when looking into
| determinism and hypervisors. Didn't know he also worked on TKO
| for MSFT. Glad he is making a similar project!
| ashkankiani wrote:
| I wrote a similar snapshot system for our deterministic trading
| engine. It's hard to imagine a system that doesn't do that
| unless you actually enforce every single event in your log to
| be reversible/non-destructive/non-aliasing. Even then, you
| still want snapshots to jump to a point in time. The only
| annoying thing is the case where you want to step back one,
| meaning a naive implementation would jump back to the last
| snapshot and play forward.
|
| An 80% solution is to keep the last N states in memory.
| Snapshots compress well within a small time frame, so whenever
| we "paused" the playback, we could stash deltas from the pause
| point to reconstruct stuff (I sadly never got around to
| implementing this part before I left since it wasn't high
| enough priority).
| mark_undoio wrote:
| > Even then, you still want snapshots to jump to a point in
| time. The only annoying thing is the case where you want to
| step back one, meaning a naive implementation would jump back
| to the last snapshot and play forward.
|
| At Undo.io we use similar techniques to time travel debug
| C/C++ (and other languages).
|
| > An 80% solution is to keep the last N states in memory.
|
| We use a slightly more specialised version of what you
| describe - we maintain a set of snapshots at wide spacings
| throughout history (for big time jumps the user might want to
| do) and ones just a small gap before the "current time" the
| user is viewing, so small steps backwards can be fast too.
|
| It's been a complex process figuring out an algorithm that
| balances these concerns without using up unreasonable amounts
| of RAM.
|
| The _other_ way to tackle slowness - though this is for
| larger jumps in history - is to search the timeline in
| parallel. Since you 've got deterministic re-execution you
| can play several chunks of history at once while looking for
| a key event. It can't help for small jumps in time but if you
| are looking far into the past it can be a significant speed-
| up.
| roca wrote:
| I wonder if this inspired the VMWare VM record-and-replay
| functionality that came out in 2008. They discontinued it in
| 2011, but it's important to me because we used it at Mozilla to
| great effect and that made it easier for me to get Mozilla to
| support the development of rr, which started in 2011.
| icholy wrote:
| I don't get how rr isn't more popular.
| soamv wrote:
| It might have, I remember attending a talk by Peter Chen when I
| was at VMware around that time, and I know there was some kind
| of collaboration. (I wasn't involved in record-replay at
| VMware, but I was interested in it due to some ancient work I
| did with userspace record-replay debugging,
| https://lizard.sf.net).
|
| rr is fantastic work, mad props! And the multiprocess stuff in
| pernosco looks super neat.
| purple-dragon wrote:
| Yes. There's a whole family of papers by Chen, his students,
| and other VMware folks in this vein.
| gwd wrote:
| Yes, at some point when I was a student I went out to VMWare
| and gave a presentation to a team of their engineers.
|
| The thing about execution replay is that, in the general case,
| it's _incredibly fragile_. I happened to chat to some people
| from VMWare in 2008 and they said that there was a team of 10
| people whose only job it was to fix execution replay when it
| broke. The main thing I learned from my PhD on execution replay
| was how to debug insane bugs.
|
| Presumably rr, focusing on processes, offers a more constrained
| environment that's easier to log & replay.
| mark_undoio wrote:
| > Presumably rr, focusing on processes, offers a more
| constrained environment that's easier to log & replay.
|
| As someone who's worked elsewhere on time travel debug, I'm
| really curious on @roca's take on this - because I'd have
| expected a full-VM solution to be easier to make reliable.
|
| Hardware-level behaviour _sounds_ harder but it 's well-
| constrained. The behaviour an OS can rely on from the
| hardware is large but well-documented and slow to change.
|
| In contrast, the process boundary is really ill-defined and
| permeable. Also, when you need things to be _precisely_ the
| same, you notice bits of kernel behaviour leaking into the
| user space ABI in unexpected ways.
| userbinator wrote:
| Debuggers have had history tracing functionality for a long time,
| but being extremely slow and consuming a lot of storage meant it
| was rarely used except for very specific cases. Now that CPUs are
| faster and the average machine has a lot more RAM, it becomes
| more feasible to do this.
___________________________________________________________________
(page generated 2024-08-19 23:01 UTC)