[HN Gopher] Why Rust nextest is process-per-test
___________________________________________________________________
Why Rust nextest is process-per-test
Author : jicea
Score : 88 points
Date : 2025-01-09 19:40 UTC (3 days ago)
(HTM) web link (sunshowers.io)
(TXT) w3m dump (sunshowers.io)
| marky1991 wrote:
| I don't understand why he jumps straight from 'one test per
| process' to 'one test per thread' as the alternative.
|
| I'm not actually clear what he means by 'test' to be honest, but
| I assume he means 'a single test function that can either pass or
| fail'
|
| Eg in python (nose)
|
| class TestSomething: def test_A(): ... def test_B(): ...
|
| I'm assuming he means test_A. But why not run all of
| TestSomething in a process?
|
| Honestly, I think the idea of having tests have shared state is
| bad to begin with (for things that truly matter, eg if the
| outcome of your test depends on the state of sys.modules,
| something else is horribly wrong), so I would never make this
| tradeoff to benefit a scenario that I never think should be done.
|
| Even if we were being absolute purists, this still hasn't solved
| the problem, the second your process communicates with any other
| process (or server). And that problem seems largely unsolveable,
| short of mocking.
|
| Basically, I'm not convinced this is a good tradeoff, because the
| idea of creating thousands and thousands of processes to run a
| test suite, even on linux, sounds like a bad idea. (And at work,
| would definitely be a bad idea, for performance reasons)
| cbarrick wrote:
| > I'm not actually clear what he means by 'test' to be honest,
| but I assume he means 'a single test function that can either
| pass or fail'
|
| I assume so as well.
|
| Unit testing in Rust is based around functions annotated with
| #[test], so it's safe to assume that when the author says
| "test" they are referring to one such function.
|
| It's up to the user to decide what they do in each function.
| For example, you could do a Go-style table-driven test, but the
| entire function would be a single "test", _not_ one "test" per
| table entry.
| saghm wrote:
| I think most of the context that might explain your confusion
| is the way that tests work out of the box in Rust. The default
| test harness when invoking `cargo test` runs one test per
| thread (and by default parallelizes based on the number of
| cores available, although this is configurable with a flag). In
| Rust, there isn't any equivalent to the `TestSomething` class
| you give; each test is always a top-level function. Since
| `cargo nextest` is a mostly drop-in replacement for `cargo
| test`, I imagine the author is using one test per thread as an
| alternative because that's the paradigm that users will be
| switching from if they start using cargo nextest.
|
| While enforcing no shared state in tests might be useful, that
| wouldn't be feasible in Rust without adding quite a lot of
| constraints that would be tough if not impossible to enforce in
| a drop-in replacement for cargo test. There's certainly room
| for alternatives in the testing ecosystem in Rust that don't
| try to maintain compatibility with the built-in test harness,
| but I don't think the intention of cargo nextest is to try to
| do that.
|
| One other point that might not be obvious is that right now,
| there's no stable way to hook into Rust's libtest. The only
| options to provide an alternative testing harness in Rust are
| to either only support nightly rather than stable, break
| compatibility with tests written for the built-in test harness,
| or provide a separate harness that still supports existing
| tests. I'm sure there are arguments to be made for each of the
| other alternatives, but personally, I don't think there's any
| super realistic chance for adoption of anything that picks the
| first two options, so the approach cargo nextest is taking the
| most viable approach available (at least until libtest
| stabilizes, but it's not obvious when that will happen).
| jclulow wrote:
| NB: the author's pronouns are they/she, not he.
| sunshowers wrote:
| FYI I use they/she pronouns (thanks jclulow!)
|
| > Honestly, I think the idea of having tests have shared state
| is bad to begin with (for things that truly matter, eg if the
| outcome of your test depends on the state of sys.modules,
| something else is horribly wrong), so I would never make this
| tradeoff to benefit a scenario that I never think should be
| done.
|
| I don't disagree as a matter of principle, but the reality
| really is different. Some of the first nextest users outside of
| myself and my workplace were graphical libraries and engines.
|
| > Basically, I'm not convinced this is a good tradeoff, because
| the idea of creating thousands and thousands of processes to
| run a test suite, even on linux, sounds like a bad idea. (And
| at work, would definitely be a bad idea, for performance
| reasons)
|
| With Rust or other native languages it really is quite fast.
| With Python I agree not as much, so this tradeoff wouldn't make
| as much sense there yes.
|
| But note that things like test cancellation are a little easier
| to do in an interpreted model.
| hinkley wrote:
| > Honestly, I think the idea of having tests have shared state
| is bad to begin with
|
| I blame this partially on our notions of code reuse. We
| conflate it with several other things, and in the case of tests
| we conflate it with state reuse.
|
| And the availability of state reuse leads to people writing
| fakes when they should be using mocks, and people not being
| able to tell the difference between mocks and fakes and thus
| being incapable to have a rational discussion about them.
|
| To my thinking, and the thinking of pretty much all of the test
| experts I've called Mentor (or even Brother), beforeEaches
| should be repeatable. Add a test, it repeats one more time.
| Delete a test, one less. And if they're repeatable, they don't
| have to share the same heap. One heap is as good as another.
|
| Lots of languages can only do that segregation at the process
| level. In NodeJS it would be isolates (workers). If you're very
| careful about global state you could do it per thread. But that
| doesn't happen very often because "you" includes library
| writers, language designers, and your coworker Steve who is
| addicted to in-memory caching. I can say, "don't be Steve"
| until I'm blue in the face but nearly every team hires at least
| one Steve, and some are rotten with them.
| grayhatter wrote:
| Restating the exact same thing 4 different times in the first few
| paragraphs is an LLM feature right?
| diggan wrote:
| Sounds more like someone is writing "Inverted pyramid" style,
| where you repeat information somewhat but go deeper and deeper
| every time.
| sunshowers wrote:
| I had Claude do a mild review but this was entirely written by
| me. The sibling is right: the goal is to present information in
| increasing levels of detail. Repetition is a powerful tool to
| make ideas stick.
|
| I do try and present a decent level of detail in the post.
| OptionOfT wrote:
| I prefer per process over the alternatives.
|
| When you write code you have the choice to do per process, per
| thread, or sequential.
|
| The problem is that doing multiple tests in a shared space
| doesn't necessarily match the world in which this code is run.
|
| Per process testing allows you to design a test that matches the
| usage of your codebase. Per thread already constrains you.
|
| For example: we might elect to write a job as a process that runs
| on demand, and the library we use has a memory leak, but it can't
| be fixed in reasonable time. Since we write it as a process that
| gets restarted we manage to constrain the behavior.
|
| Doing multiple tests in multiple threads might not work here as
| there is a shared space that is retained and isn't representative
| of real world usage.
|
| Concurrency is a feature of your software that you need to code
| for. So if you have multiple things happening, then that should
| be part of your test harness.
|
| The test harness forcing you to think of it isn't always a
| desirable trait.
|
| That said, I have worked on a codebase where we discovered bugs
| because the tests were run in parallel, in a shared space.
| sunshowers wrote:
| There is definitely some value in shaking out bugs by running
| code in parallel in the same process space -- someone on
| Lobsters brought this up too. I've been wondering if there's
| some kind of optional feature that can be enabled here.
| NewJazz wrote:
| Can you do it deliberately in the test?
| sunshowers wrote:
| Yes, you can write your own meta-test that runs all of the
| actual tests in separate threads. But it's a bit
| inconvenient and you won't necessarily get separate
| reporting.
| amelius wrote:
| According to some of these reasons every library should run in
| its own process too.
| sujayakar wrote:
| that's roughly what the wasm component model is aiming for!
|
| https://hacks.mozilla.org/2019/11/announcing-the-bytecode-al...
| sunshowers wrote:
| I think the difference is that misbehavior isn't as expected of
| libraries as it is of tests. But yes if your libraries are
| prone to misbehavior, it's common to compartmentalize them in
| separate processes.
| sedatk wrote:
| > Memory corruption in one test doesn't cause others to behave
| erratically. One test segfaulting does not take down a bunch of
| other tests.
|
| Is "memory corruption" an issue with Rust? Also, if one test
| segfaults, isn't it a reason to halt the run because something
| got seriously broken?
| codetrotter wrote:
| > Is "memory corruption" an issue with Rust?
|
| You can cause memory corruption if you opt out of memory safety
| guarantees by using Unsafe Rust.
|
| https://doc.rust-lang.org/book/ch19-01-unsafe-rust.html
|
| Sometimes unsafe is necessary and the idea then is that the
| "dangerous" parts of the code remain isolated in explicitly
| marked "unsafe" blocks, where it can be closely reviewed.
|
| Also, even if your own Rust code is doing nothing unsafe you
| might be using external libraries written in other languages
| and things might go wrong.
|
| > if one test segfaults, isn't it a reason to halt the run
| because something got seriously broken?
|
| Sometimes it's still interesting and helpful to continue
| running other tests even if one fail. If several of them fail
| it might even help you pinpoint exactly what's going wrong than
| just a single failure might. (Although having a bunch of
| failing tests can also be more noise.)
| sunshowers wrote:
| In local runs it makes sense to halt the run, but in CI not as
| much -- nextest makes this entirely configurable.
| Ericson2314 wrote:
| This is good for an entirely different reason, which is running
| cross-compiled tests in an emulator.
|
| That is especially good for bare metal. If you don't have global
| allocator, have limitted ram, etc., you might not be able to
| write the test harness as part of the guest program at all! so
| you want want to move as much logic to the host program as
| possible, and then run as little as a few instructions (!) in the
| guess program.
|
| See https://github.com/gz/rust-x86 for an example of doing some
| of this.
| sunshowers wrote:
| This is why Miri also works well with nextest, yeah!
| pjc50 wrote:
| This will be _horrendously_ slow on Windows.
| sunshowers wrote:
| It's certainly slower than on Linux, but horrendous is a little
| bit of an overstatement. For many real-world test suites,
| process creation is a small part of overall time, and nextest
| still ends up being overall faster due to better handling of
| long-pole tests. The issues around thread termination and stuff
| are quite relevant on Windows. (And maybe posts like mine are
| going to get Microsoft and AV vendors to pay more attention to
| process creation performance! It really doesn't have to be
| slow, it just is in practice.)
|
| Measured by weekly downloads (around 120k a week total last I
| checked), Windows is actually the number two platform nextest
| is used on, ahead of macOS. It's mostly CI, but clearly a lot
| of people are getting value out of nextest on Windows.
| cortesi wrote:
| Nextest is one of the very small handful of tools I use dozens or
| hundreds of times a day. Parallelism can reduce test suite
| execution time significantly, depending on your project, and has
| saved me cumulative days of my life. The output is nicer, test
| filtering is nicer, leak detection is great, and the developer is
| friendly and responsive. Thanks sunshowers!
|
| The one thing we've had to be aware of is that the execution
| model means there can sometimes be differences in behaviour
| between nextest and cargo test. Very occasionally there are tests
| that fail in cargo test but succeed in nextest due to better
| isolation. In practice this just means that we run cargo test in
| CI.
| zbentley wrote:
| This article is a good primer on why process isolation is more
| robust/separated than threads/coroutines _in general_ , though
| ironically I don't think it fully justifies why process isolation
| is better _for tests_ as a specific usecase benefitting from that
| isolation.
|
| For tests specifically, some considerations I found to be
| missing:
|
| - Given speed requirements for tests, and representativeness
| requirements, it's often beneficial to refrain from too much
| isolation so that multiple tests can use/excercise paths that use
| pre-primed in memory state (caches, open sockets, etc.). It's odd
| that the article calls out that global-ish state mutation as a
| specific benefit of process isolation, given that it's often
| substantially faster and more representative of real production
| environments to run tests in the presence of already-primed
| global state. Other commenters have pointed this out.
|
| - I wish the article were clearer about threads as an alternative
| isolation mechanism _for sequential tests_ versus threads as a
| _means of parallelizing tests_. If tests really do need to be run
| in parallel, processes are indeed the way to go in many cases,
| since thread-parallel tests often test a more stringent
| requirement than production would. Consider, for example, a
| global connection pool which is primed sequentially on webserver
| start, before the webserver begins (maybe parallel) request
| servicing. That setup code doesn 't need to be thread-safe, so
| using threads to test it _in parallel_ may surface concurrency
| issues that are not realistic.
|
| - On the other hand, enough benefits are lost when running clean-
| slate test-per-process that it's sometimes more appropriate to
| have the test harness orchestrate a series of parallel executors
| and schedule multiple tests to each one. Many testing frameworks
| support this on other platforms; I'm not as sure about Rust--my
| testing needs tend to be very simple (and, shamefully, my
| coverage of fragile code lower than it should be), so take this
| with a grain of salt.
|
| - Many testing scenarios want to abort testing on the first
| failure, in which case processes vs. threads is largely moot. If
| you run your tests with a thread or otherwise-backgrounded
| routine that can observe a timeout, it doesn't matter whether
| your test harness can reliably kill the test and keep going;
| aborting the entire test harness (including all processes/threads
| involved) is sufficient in those cases.
|
| - Debugging tools are often friendlier to in-process test code.
| It's usually possible to get debuggers to understand process-
| based test harnesses, but this isn't usually set up by default.
| If you want to breakpoint/debug during testing, running your
| tests in-process and on the main thread (with a background thread
| aborting the harness or auto-starting a debugger on timeout) is
| generally the most debugger-friendly practice. This is true on
| most platforms, not just Rust.
|
| - fork() is a middle ground here as well, which can be slow,
| though mitigations exist, but can also speed things up
| considerably by sharing e.g. primed in-memory caches and socket
| state to tests when they run. Given fork()'s sharp edges re:
| filehandle sharing, this, too, works best with sequential rather
| than parallel test execution. Depending on the libraries in use
| in code-under-test, though, this is often more trouble than it's
| worth. Dealing with a mixture of fork-aware and fork-unaware code
| is miserable; better to do as the article suggests if you find
| yourself in that situation. How to set up library/reusable code
| to hit the right balance between fork-awareness/fork-safety and
| environment-agnosticism is a big and complicated question with no
| easy answers (and also excludes the easy rejoinder of "fork is
| obsolete/bad/harmful; don't bother supporting it and don't use
| it, just read Baumann et. al!").
|
| - In many ways, this article makes a good case for something it
| doesn't explicitly mention: a means of annotating/interrogating
| in-memory global state, like caches/lazy_static/connections, used
| by code under test. With such an annotation, it's relatively easy
| to let invocations of the test harness choose how they want to
| work: reuse a process for testing and re-set global state before
| each test, have _the harness itself_ (rather than tests by side-
| effect) set up the global state, run each test with and /or
| without pre-primed global state and see if behavior differs, etc.
| Annotating such global state interactions isn't trivial, though,
| if third-party code is in the mix. A robust combination of
| annotations in first-party code and a clear place to manually
| observe/prime/reset-if-possible state that isn't annotated is a
| good harness feature to strive for. Even if you don't get 100% of
| the way there, incremental progress in this direction yields
| considerable rewards.
| sunshowers wrote:
| But that's exactly it, right -- everything you've listed are
| valid points in the design space, but they require a lot of
| coordination between various actors in the system. Process per
| test solves a whole swath of coordination issues.
___________________________________________________________________
(page generated 2025-01-12 23:01 UTC)