[HN Gopher] Deterministic Linux for controlled testing and softw...
___________________________________________________________________
Deterministic Linux for controlled testing and software bug-finding
Author : rrnewton
Score : 82 points
Date : 2022-11-22 18:12 UTC (4 hours ago)
(HTM) web link (developers.facebook.com)
(TXT) w3m dump (developers.facebook.com)
| jasonwhite wrote:
| TL;DR: This is a Rust project that forces deterministic execution
| of arbitrary programs and acts like a reproducible container.
| That is, it _hermetically_ isolates the program from sources of
| non-determinism such as time, thread interleavings, random number
| generation, etc. Guaranteed determinism is a powerful tool and it
| serves as a basis for a number of applications, including
| concurrency stress testing, record /replay, reproducible builds,
| automatic diagnosis of concurrency bugs, and more.
|
| I've been on the team working on this project over the past ~2
| years. AMA!
|
| Here is the GitHub repository:
| https://github.com/facebookexperimental/hermit
| mtlmtlmtlmtl wrote:
| This is exactly the type of thing I've been wanting for testing
| my chess engine. Parallelism is based on emergent pseudorandom
| effects of things like interleaving causing searcher threads to
| mostly end up in non-overlapping parts of the search tree.
|
| One question: How do you avoid the program being affected by
| things like overall system load and memory pressure?
| jasonwhite wrote:
| Since CPU is a "compressible" resource, system load doesn't
| affect the determinism. It'll make it slower of course. Since
| memory is a non-compressible resource, things can start
| getting killed by the OOM-killer and there's nothing we can
| do about it. There are also certainly things like external
| network communication and file system access that are non-
| deterministic that must be handled at a higher level (e.g.,
| with a reproducible FS image or by recording & replaying
| network traffic).
| mtlmtlmtlmtl wrote:
| The idea of CPU being compressible is very insightful,
| thanks.
|
| I'm curious about what happens with time control though.
| Engines can be given a time constraint and will use various
| heuristics to allocate that time. How does Hermit intercept
| gettimeofday()?
| rrnewton wrote:
| Well, gettimeofday is a syscall, and we do intercept it
| (along with clock_gettime, clock_gettres, time,
| nanosleep, and the rdtsc instruction, even though that
| last one is not a syscall). When we intercept it, we
| report virtual time back to the guest. We make sure that
| virtual time is deterministic, across all threads in the
| container, irrespective of what the wall clock time is on
| the host machine.
|
| So for instance, if there are multiple threads in a chess
| engine, and they are racing to write search results to a
| data structure, these threads will interleave in a
| reproducible order under Hermit, and the races will
| resolve consistently.
|
| But the downside is that Hermit does sequentialize
| execution onto one core. So in the current version, a
| multithreaded program doesn't get actual wall-clock
| speedup from its parallelism. (The earlier dettrace did
| allow some limited guest parallelism, and we plan to
| bring that back.) For this reason, Hermit's good for
| consistent testing multithreaded software, but you
| wouldn't want to run parallel software under it outside
| of testing.
| comex wrote:
| Nice. Some questions:
|
| - How does this compare with rr?
|
| - Out of curiosity, have you ever looked at libTAS? (I realize
| it has a very different intended use case.)
|
| - Have you had an issues with behavior differences between
| CPUs? I know there is architecture-level undefined behavior
| that can differ between CPUs; on the other hand, it sounds like
| you're primarily interested in running well-behaved executables
| that wouldn't intentionally invoke such behavior.
| jasonwhite wrote:
| - We've taken a lot of inspiration from RR and it is very
| good at what it does (record/replay + randomizing thread
| schedules). This project diverges a bit from RR in that we
| focus more on the concurrency testing application of
| determinism. I wrote the toy record/replay system that sits
| on top of the determinism layer (so that we don't have to
| record as much stuff), but it's a long way from being
| complete.
|
| - I haven't seen libTAS until now, so I can't comment on it
| much. At first glance, it does look similar!
|
| - See @rrnewton's reply on this one.
| rrnewton wrote:
| (1) rr [formerly Mozilla rr]
|
| We're big fans of rr!
|
| Hermit is different in that creating a deterministic OS
| semantics is different than recording whatever
| nondeterministic behavior occurs under normal Linux. BUT,
| there's a lot of overlap. And indeed `hermit record` is
| straight up RnR (record & replay).
|
| But hermit for RnR but is not nearly as developed as rr. We
| integrate with gdb/lldb as an (RSP) debugger backend, just
| like rr. Any failing execution you can create with hermit,
| you can attach a debugger. But our support is very
| preliminary, and you'll probably find rough edges. Also, we
| don't support backwards stepping yet (except by running
| again).
|
| If we invest more in using Hermit as a debugger (rather than
| for finding and analyzing concurrency bugs), then there
| should be _some_ advantages over traditional RnR. These would
| relate to the fact that deterministically executing is
| different than recording. For example, process and thread
| IDs, and memory addresses all stay the same across multiple
| runs of the program, even as you begin adding printfs and
| modifying the program to fix the bug. With traditional RnR,
| you can play the same recording as many times as you like,
| but as soon as you take a second recording all bets are off
| wrt what is the same or different compared to the prior
| recording. (That includes losing the "mental state" of
| things like tids & memory addresses, which is a good point
| Robert O Callahan makes about the benefits of RnR when
| accessing the same recording multiple times.)
|
| (2) libTAS - no we haven't! Checking it out now.
|
| (3) Yes, definitely issues with CPU portability.
|
| In general, we are interested in not just determinism on the
| same machine, but portability between machines in our fleet.
| As with any tech company that uses the cloud, at Meta people
| are usually trying to debug an issue on a different machine
| than where the problem occurred. I.e. taking a crash from a
| production or CI machine to a local dev machine.
|
| The way we do this is that we mostly report a fairly old CPU
| to the guest, which disables certain features IF the guest is
| well behaved.
|
| With the current processor tech, I don't think there's any
| way we can stop an adversarial program, which, for example,
| would execute CPUID, find that RDRAND is not supported on the
| processor, but then execute RDRAND anyway. We could build a
| much more invasive binary-instrumentation based emulator that
| _would_ be able to enforce these kinds of rules at the
| instruction granularity, but it would have higher overhead,
| especially startup overhead. The nice thing about Reverie
| though is that we (or others) can add different
| instrumentation backends while keeping the same programming
| instrumentation API. So we could have a "hardened" backend
| that was more about sandboxing and reverse-engineering
| adversarial software, making a different tradeoff with
| respect to performance overhead.
| CJefferson wrote:
| This looks cool.
|
| Personally I'd also be interested in this for academic work --
| anything which makes it easier to be sure an experiment can be
| reproduced later (a week, year, or decade later, in increasing
| order of difficulty), is good to have.
| rrnewton wrote:
| Yes, I'm very interested in that as well. I've been involved
| with the ACM Artifact Evaluation process which has been going
| on in several conferences for a while.
|
| https://www.acm.org/publications/policies/artifact-review-
| ba...
|
| But it's been pretty frustrating. As an author, my PLDI 2014
| artifact stopped working less than 5 years later (Docker
| image binary incompatibility). And when I was co-chair of an
| Artifact Evaluation Committee in 2017, there was not great
| reproducibilty of the artifacts that were submitted either.
|
| If you package a VM (freezing the Linux kernel), and are
| pretty sure that VM will run in 10 years, PLUS you
| determinize the execution itself... that should allow durable
| bitwise reproducibility. Maybe Hermit could be one ingredient
| of that.
|
| For scientific reproducibility, there is a lot of other
| tooling to build too, and I know some folks have been working
| in that area:
|
| https://ctuning.org/
| gmartres wrote:
| Thanks for open-sourcing this! Roughly, what's the performance
| overhead from running code under hermit? I'm wondering if this
| could be used for doing benchmarking with less variance on non-
| deterministic platforms such as the JVM (I assume hermit is
| "deterministic enough" that the JIT and GC threads of the JVM
| will run the same code on every execution?)
| rrnewton wrote:
| Alas the performance overhead in _realtime_ is not great yet.
| It still uses ptrace currently, which often results in a
| multiple-X slowdown (but at least it doesn 't "subscribe" to
| every syscall like strace does, because some are naturally
| deterministic). Reverie's whole design is to make it support
| swappable backends, and this ptrace backend is just the
| reference implementation. The `experimental/reverie-sabre`
| directory in the Reverie repo contains our high performance
| backend, but it's still work-in-progress. It uses binary
| instrumentation and in our early experiments is 10X faster
| than our current backend in the worst case (i.e. strace is
| >10X faster when rewritten with reverie-sabre and run on a
| program that does nothing but syscalls).
|
| But to the second part of your question about deterministic
| benchmarking, that is really a separate question. Hermit
| defines a deterministic notion of virtual time, which is
| based on the branches retired and system calls executed by
| all threads. When you run hermit with `--summary`, it reports
| a total "Elasped virtual global time", which is completely
| deterministic:
|
| $ hermit run --summary /bin/date
|
| ...
|
| Elapsed virtual global (cpu) time: 5_039_700ns
|
| Therefore, any program that runs under hermit can get this
| deterministic notion of performance. We figured that could be
| useful for setting performance regression tests with very
| small regression margins (<1%), which you can't do on normal
| noisy systems. Compilers are one place I've worked where we
| wanted smaller performance regression alarms (for generated
| code) than we could achieve in practice. We haven't actually
| explored this application yet though. There's a whole small
| field of people studying performance modeling and prediction,
| and if one wanted to try this deterministic benchmarking
| approach, they might want take some of that knowledge and
| build a more accurate (correlated with wall time) performance
| model, more realistic than Hermit's current virtual time that
| is.
| dekhn wrote:
| Does running /bin/date under hermit always return the same
| time? Or does it just follow the same codepath to retrieve
| the actual time?
| [deleted]
| wyldfire wrote:
| > AMA!
|
| Eager to try it but encountering the build error here -
| https://github.com/facebookexperimental/hermit/issues/11
|
| Do you have a reference build log / environment you can share?
| Last known good commit sha and/or output from "rustup show"?
| jasonwhite wrote:
| We're working on it! Should be fixed soon.
| rrnewton wrote:
| Reverted a badly-timed breaking change that came through
| the sync system. Will fix it properly shortly (and add a
| Dockerfile and release tag). But for now you may have
| better luck on the main branch after that reversion, which
| yielded 6cb5575ffd287289769144ec82e2900cbf6cd1ad.
|
| Let's discuss further on that issue #11.
| rrnewton wrote:
| Note that this is a follow-on project from the earlier Dettrace
| system, which was applied mainly to reproducible builds (as in
| the academic paper,
| https://dl.acm.org/doi/10.1145/3373376.3378519, and presented
| to the Debian Reproducible Builds summit):
|
| - https://github.com/dettrace/dettrace
|
| And one cool part of it is this Rust program instrumentation
| layer:
|
| - https://github.com/facebookexperimental/reverie
|
| It's good for building OS-emulator style projects or tracing
| tools.
| oulipo wrote:
| Would something like this work for embedded code, like ESP32?
| rrnewton wrote:
| Well, probably not _on device_ ;-).
|
| The underlying Reverie instrumentation layer works on ARM,
| but Hermit isn't ported yet, and we haven't touched RISC-V
| yet at all. (Contributions welcome!)
|
| One thing we haven't tried yet is just putting a whole
| emulator (qemu etc) underneath Hermit. That would address any
| sources of irreproducibility that the emulator lets through
| from the host (threads, RNG, etc).
| wyldfire wrote:
| > thread interleavings
|
| I was going to ask if it could do the flip side - instead of
| stabilizing the scheduler, make it less predictable.
|
| AFAICT, it can! Awesome, looking forward to giving it a try.
| hermit run --chaos --seed-from=SystemRandom
| ./target/debug/hello_race;
| rrnewton wrote:
| Yes indeed.
|
| That concurrency testing capability is a pretty well-studied
| area and we implement a couple existing algorithms. The first
| is our adaptation of the PCT algorithm (ASPLOS'10
| https://www.microsoft.com/en-us/research/wp-
| content/uploads/...). That's what you get by default with
| `--chaos`.
|
| But we also have variations on straight up randomized
| scheduler (random thread selection at each time step).
|
| rr chaos mode has its own take on this:
| https://robert.ocallahan.org/2016/02/introducing-rr-chaos-
| mo...
|
| This study compares a few approaches - http://www.doc.ic.ac.u
| k/~afd/homepages/papers/pdfs/2016/TOPC....
| daniel-levin wrote:
| Neat! This is the direction I'd hoped to see gvisor go in. What's
| the reasoning for building from scratch and not piggybacking off
| gvisor?
| jasonwhite wrote:
| We certainly looked into gVisor and Firecracker when we started
| this project a few years ago. These systems use KVM and gVisor
| in particular uses the Model Specific Registers (MSRs) to
| intercept system calls before forwarding them to the host
| kernel. Intercepting syscalls this way has less overhead than
| ptrace and we would have complete control over the system
| environment. I think it's a good approach and worth exploring
| more, but ultimately the deal breaker was that KVM requires
| root privileges to run and it wouldn't run on our already-
| virtualized dev machines. We also wanted to allow the guest
| program to interact with the host's file system. So, we went
| with good ol' ptrace. Last I checked gVisor also has a ptrace
| backend, but it wasn't very far along at the time. When going
| the ptrace route, there is less reason to depend on another
| project. Another reason of course is that we'd be beholden to a
| Google project. ;)
| srosenberg wrote:
| Great work and thanks for making it OSS! I was familiar with the
| prior (academic) work and its limitations, specifically TCP/IP.
| Could you elaborate on how you solved that problem?
| rrnewton wrote:
| Sure! So it really breaks down into two cases: internal and
| external networking relative to the container Hermit creates.
|
| (1) internal networking
|
| If you run a test like `rust/network_hello_world.rs` under
| Hermit, then the communication between threads is part of the
| "deterministic bubble" that we're running inside of. When one
| thread blocks on a network call, the Hermit scheduler takes the
| thread out of the run pool, and it has to deterministically
| decide when it is ready to rejoin the run-pool by waking up.
| The scheduler proceeds in linear turns (labeled "COMMIT" in the
| logs), and if thread 5 unblocks from a network read at turn 100
| in one run, it must unblock at that same point in time in all
| other runs.
|
| Sometimes we use a precise model of the blocking operation
| (like with futexes) and other times we depend on sending Linux
| a non-blocking version of the syscall as a way to poll the IO
| and see if it is ready to complete (given the history of every
| operation that has committed on turns 1..N-1).
|
| (2) external networking
|
| This is impossible to determinize, of course. Unless you suck
| the whole network including both hosts into the deterministic
| bubble, as the DDOS fork of Linux experimented with in ~2013.
| That was kind of a negative result IMO because performance was
| pretty bad, but the paper is here:
| https://www.dcc.fc.up.pt/~ines/aulas/1314/SDM/papers/DDOS.pdf
|
| That's where record-replay comes in. `hermit record` can record
| network calls, but is in a pretty early state and doesn't
| support many programs. `hermit run` can just allow networking
| through and hope for the best, but in the future we plan to add
| features to record _just_ network calls (and no other
| syscalls), so that you can mix and match different external-
| network-responses with different thread schedules. That is, you
| could "pin" the network responses with network-only recording,
| and then mess around with other parameters or even modify the
| program.
| teknopaul wrote:
| Can you explain how making flakey tests, not flakey, helps find
| bugs. I would have thought these differences are essentially free
| fuzzing and desirable?
| rrnewton wrote:
| Sure! I think underpinning your question is a really subtle
| point there. And I think the answer is in the different
| purposes of regression testing and bug finding. In regression
| testing (CI), you're testing if the code introduced new
| problems. You don't at that point in time really want to know
| that someone else's test downstream from your component fails
| when given a new thread schedule that it has not previously
| seen. Wherease if you're stress testing (including fuzzing and
| concurrency testing) you probably want to torture the program
| overnight to see if you can turn up new failures.
|
| The Coyote project at Microsoft is a concurrency testing
| project with some similarities to Hermit. For the reasons
| above, they say in their docs to use a constant seed for CI
| regression testing, but use random exploration for bug finding:
| https://www.microsoft.com/en-us/research/project/coyote/
|
| Still, it does feel like wasted resources to test the same
| points in the (exponentially large) schedule space again and
| again. Kind of like some exploration/exploitation tradeoff.
|
| We don't do it yet, but I would consider doing a randomized
| exploration during CI, but making the observable semantics the
| fixed version. If the randomized one fails, send that over to
| the "bug finding" component for further study, while quickly
| retrying with the known-good seed for the CI visible regression
| test results.
|
| I don't think there's one right policy here. But having control
| over these knobs lets us be intentional about it.
|
| P.S. Taking the random schedules the OS gives us is kind of
| "free fuzzing", but it is very BAD free fuzzing. It over-
| samples the probable, boring schedules and under-samples the
| more extreme corner cases. Hence concurrency bugs lurk until
| the machine is under load in production and edge cases emerge.
| jasonwhite wrote:
| Once we have complete control over the determinism of a test,
| we can start to play with tweaking the non-deterministic inputs
| in a controlled way. For example, we can tweak the RNG seed
| used for thread scheduling to explore schedules that wouldn't
| normally happen under the Linux scheduler.
| stonemetal12 wrote:
| How do you know if a flakey test has been fixed? A
| deterministic environment can turn flakey into repeatable
| failure and then known to be fixed.
| rrnewton wrote:
| This has been the culmination of several years of work
| intercepting and sanitizing the Linux system call API. It's now
| open source.
| aseipp wrote:
| Hey Ryan, just wanted to say I hope you're doing great these
| days! Really glad to see that the work originally done by you
| and Joe at Cloudseal evolving and becoming more readily
| available to all of us. I still thought back every once in a
| while about what y'all were up to. :) Super excited that it's
| kept on chuggin' and now is something we can all enjoy!
| hoosieree wrote:
| I guess FB poaching you from IU in the middle of teaching your
| compiler course has a silver lining! Nice to see this being
| shared with the open source community.
| alex_suzuki wrote:
| Is it just me or are we experiencing an uptick in high-quality,
| sophisticated software projects being open-sourced by FAANG
| companies?
| crmd wrote:
| It's great how much open source infrastructure software has
| come out of FAANG in the past decade. Between Kubernetes,
| react, Kafka, etc it's wild how much of my tech stack is open
| source software developed by people at large tech companies.
| It's also great PR for the companies.
| Enginerrrd wrote:
| Facebook/Meta in particular seem to be hitting the front page
| with nerd-bait research. I suspect it is indeed part of a
| public relations strategy.
|
| And I hate to say it, but it's working on me. I've seen some
| really cool projects come out of Facebook lately!
| rrnewton wrote:
| Heh, I implore you to consider the engineers-eye view. We're
| tech geeks inside or outside of FAANG, with all the usual
| incentives. I've been the tech lead on this project for 5
| years, through 2 companies, and of course I hope someone
| finds it useful and chooses to contribute.
|
| I'd like to think we could help someone somehow with public
| relations, but I don't think we can ;-). Actually, I don't
| think any of the big techs are leaning all that much into
| _recruiting_ right now though....
| crmd wrote:
| Hey, I totally rewrote my comment after reading your
| response. Thanks for engaging - hearing from the project
| author made me realize I was acting like a cynical asshole.
| Congrats on Hermit and thanks for supporting free software.
| umanwizard wrote:
| Are we? PyTorch, React, etc. have been around for ages.
| tobyhinloopen wrote:
| React? He mentioned high quality. React is a mess, especially
| when hooks became dominant.
| topazas wrote:
| maybe symbolic execution also can be included here?
| rrnewton wrote:
| It's a good question. We would like to make it usable as a
| platform for dynamic analysis. The idea being that you can
| control all these external factors (like thread scheduling),
| find a crashing run, and then ask introspective questions of
| what the code is doing in a crashing run.
|
| In practice, one challenge we have is bridging between the
| runtime view of the software (as a debugger would see) -- raw
| machine instructions and system calls, and the static view that
| you would get from analyzing the source code.
|
| Sanitizers, for example (ASAN, TSAN, etc), are typically
| implemented in the compiler as program instrumenations. If we
| integrated binary instrumentation tools like Intel Pin or
| DynamoRio, we could perform additional dynamic analysis, but
| still at the machine code rather than source code level, which
| is a big gap from how symbolic execution normally happens, at
| the source code / AST level.
___________________________________________________________________
(page generated 2022-11-22 23:00 UTC)