Search code, repositories, users, issues, pull requests...

opened

Contributor

I'd like to add an option to the CLI of the standard test runner so that it stops running tests after the first one fails. This would be useful in situations where you just want to find whether any test fails. I don't propose to change the default.

My immediate motivation is that cargo-mutants just needs to see whether any test fails, and it's a waste of time to continue running tests after one failure has been found.

More generally this is a feature many test runners have and that people seem to find useful. For example, failing fast is the default in nextest: https://nexte.st/docs/running/#failing-fast, Bazel has --test_runner_fail_fast, and pytest has --exitfirst. And of course cargo test fails fast by default at the test target level.

Today, cargo test will fail fast by default at the test target level: if any tests fail, it won't run any more. However, within the test target, there's no way to fail fast. This can be confusing, but it would be disruptive to change it now.

I discovered that the logic for this actually already exists, it's just not exposed in the CLI. #142807 adds an option. With that change, you can run cargo test -- --fail-fast -Zunstable-options and it will stop after the first test fails.

When multiple threads are used the tests are run in nondeterministic order, and so in a tree with multiple failing tests, with this option on, it's nondeterministic which tests will get run before the process stops. I don't think that's surprising to people who just want to know of any one failure, and the order can be made predictable by running on a single thread.

I have read that people would like to move away from the current libtest architecture and so apparently there has been a soft feature freeze for some months or years. However since this is a small change in the implementation and shouldn't introduce any compatibility concerns I hope it could still be considered.

Crates can work around the absence of this feature in libtest by setting harness = false and using Nextest or some other harness, but that's a large transition and I think it would be nice to have it in the standard library: at least, that would help cargo-mutants get better performance on most crates.

cc @rust-lang/testing-devex

added

added

changed the title ~~[-]PR: Add a `--fail-fast` option to libtest[/-]~~ FR: Add a `--fail-fast` option to libtest

mentioned this

github-project-automation

libtest: expose --fail-fast as an unstable command-line option #142807

added

and removed

added this to testing-devex backlog

Contributor

@oli-obk and @compiler-errors, in #105153, you both recognized that the fail-fast mode was a hack. While testing-devex and libs-api decide what should be part of the stable API, any input on user facing problems or limitations from that hack? If we move forward with this, I'd like for us to understand what might be blockers for stabilization.

compiler-errors

Member

No user facing problems; I think the only reason we called it hack was how it was implemented (like via a rustc specific env var), and not the general idea itself.

Contributor

@sourcefrog sorry for the delays, the testing-devex team hasn't been able to meet in a bit. I'm going to go ahead and try to prime the conversation here with my own thoughts to hopefully streamline things for when we do meet.

Focusing on the the question of what should be in the API / CLI, we've overall been working to shrink the surface area of libtest. Right now, this has mostly been us deprecating (but not removing) functionality. We may remove some unstable functionality. This is part of our effort to flesh out custom test harnesses, including the inter-process API that cargo test and other test runners would interact with. To reduce that surface and to improve some aspects of usability, we also want to shift some responsibilities from libtest to cargo test.

So from my perspective, the questions that would be relevant to testing-devex in discussing this:

Is this part of the minimal API needed for a harness? My gut says yes. There isn't really another way to workaround this. This is common and useful enough to have it expected of all harnesses (without too much of a burden put on them). I suspect we might be able to augment the UX with cargo test knowing every "modern" harness supports this by dropping our weird "keep-going within a binary but fail fast across binaries" to having "keep going across binaries and --fail-fast would do so across binaries".
How might this feature evolve over time? It might be good to examine prior art to see how other test libraries deal with failures to see if just the flag is sufficient, if there is a common enough short for us to offer, etc. You mentioned a couple related flags but it would be good to summarize them and related features in a single place rather than linking out to them so its easier to analyze when the team gets a chance. For example, nextest doesn't just have --fail-fast but also --no-fail-fast and --max-fail. Could we dig into the motivations to see if they apply here?

ContributorAuthor

Thanks! I'll follow up with a survey of what other frameworks do.

ContributorAuthor

In short: Adding --fail-fast into libtest seems to me to align with the common name for a common practice and to fill a worthwhile gap that can't reasonably be worked around at the cargo level.

Focusing on the the question of what should be in the API / CLI, we've overall been working to shrink the surface area of libtest. Right now, this has mostly been us deprecating (but not removing) functionality. We may remove some unstable functionality. This is part of our effort to flesh out custom test harnesses, including the inter-process API that cargo test and other test runners would interact with. To reduce that surface and to improve some aspects of usability, we also want to shift some responsibilities from libtest to cargo test.

I'm assuming the split here continues to be that cargo test runs various test target binaries, each of which uses a library/harness that's fairly opaque to cargo test.

Are you thinking we might have --fail-fast as a standard argument that test processes should expect?

So from my perspective, the questions that would be relevant to testing-devex in discussing this:

Is this part of the minimal API needed for a harness? My gut says yes. There isn't really another way to workaround this. This is common and useful enough to have it expected of all harnesses (without too much of a burden put on them).

Right, this seems inherently very tied to how the individual tests are executed, which seems to be very much the business of the individual test harness implementation.

All the test runners I've seen have some kind of loop over a work queue, possibly with multiple workers. They may run the tests in process, on threads, in subprocesses, in containers, or remotely, but there's still some kind of queue. It's easy to exit early when one or more tests have failed.

I suspect we might be able to augment the UX with cargo test knowing every "modern" harness supports this by dropping our weird "keep-going within a binary but fail fast across binaries" to having "keep going across binaries and --fail-fast would do so across binaries".

Right, I think that would be a less confusing experience, and this is something that essentially every Rust user will hit when they write a failing test. If we weren't constrained by previous behavior I think that might be a better default. But it will be a change in the command line behavior. Personally I would welcome it but I also prize Rust's stability commitments.

How might this feature evolve over time? It might be good to examine prior art to see how other test libraries deal with failures to see if just the flag is sufficient, if there is a common enough short for us to offer, etc. You mentioned a couple related flags but it would be good to summarize them and related features in a single place rather than linking out to them so its easier to analyze when the team gets a chance. For example, nextest doesn't just have --fail-fast but also --no-fail-fast and --max-fail. Could we dig into the motivations to see if they apply here?

Other test libraries

Nextest: Has --max-fail=N (or =all), --fail-fast (default) and --no-fail-fast.

HUnit (Haskell): Apparently doesn't have a fail fast feature.

cargo-maelstrom: Has --stop-after=N.

go test: Has -failfast

Python unittest (in stdlib): Has -f, --failfast

pytest: Has -x (fail fast, no long option?), and --maxfail=N

jest (js): Has --bail or --bail=N

Boost (C++): Doesn't seem to have an option

Junit: Has --fail-fast

Rake (Ruby): Has --fail-fast[=N]

Overall adding a --fail-fast and optionally a --max-fail=N seems to align with common practice. Since we have --no-fail-fast the style of options already aligns with Nextest, rather than the alternate style of --fail-fast=false. It could reasonably be abbreviated to -f and also -F for --no-fail-fast.

Motivations

I think the motivation to use this feature come from two distinct scenarios:

Interactive edit/run/fix loops:

Simpler and smaller output: rather than potentially hundreds or thousands of lines of tracebacks from failures, you get one failure to fix next and less text to manage.
Less cognitive load: the process steers you to fixing one thing at a time and not trying to guess what other tests are related.
Shorter time to usable output: test harnesses like cargo test don't emit any details until all the tests have finished because of concurrency; with --fail-fast you'll see the error earlier.
The test process stops earlier without being interrupted and is easier to run again. This also helps with tools that run tests automatically when files are saved.
During operations like git bisect the user may only want to know whether any tests fail.

CI (see https://www.software.ac.uk/blog/continuous-integration-fail-fast-and-fail-first)

If developers are expected to have all tests passing before submission then a single failure is enough to ask them to check again.
Saves CI CPU time.
As a somewhat niche case, mutation testing tools expect there to be failures when code is mutated, and any work after that first failure is wasted.

There are certainly situations where users would rather make a throughput/latency choice to get many errors in batch, including if the test suite is very slow or (as a special case) if some errors are hard to reproduce outside CI and CI takes a while. I've also seen, less than once a year in my experience, that some test failures are incomprehensible and I need to skim many failures to work out where to begin -- but users will still have the option to run all tests when they need it.

Generalizations and evolution of this feature

This feature has existed in other languages for many years, without apparently growing a lot of complexity. So, it doesn't seem very likely to lead to many follow-on features in Rust? But I will mention two:

Stop after N failures

The most common generalization is from "stop after a failure" to "stop after N failures", allowing people to adjust the tradeoff between getting short usable output faster versus the cost of running the test suite up to the point something fails.

--max-fail=N could make sense for harnesses to add. A straightforward implementation within the harness would mean "stop after N failures in one binary". Since it's relatively rare and I'd say less important than stopping after one failure, perhaps this is reasonable to add as a target-specific option? On the other hand it's unlikely to be difficult for any harness to implement this, so it could be part of a standard protocol.

Run tests more than once

A related area is to retry failing tests, or all tests. I've used Bazel's --runs_per_test=N and --flaky_test_attempts=N which are quite useful when you suspect a test is flaky.

Maelstrom also has --repeat=N.

Workarounds

The main workaround I can think of is that cargo mutants could kill the subprocess when it notices that a test has failed. I have thought about doing this in cargo-mutants. (It would be clunky to do this from the text output, but more reliable if the protocol looks more like subunit or junit.) That has some drawbacks:

Output from tests that have already finished might be lost.
Perhaps resources are less likely to be reliably cleaned up if the process is signalled: for example making sure that any spawned grandchild processes are killed can be complex. Obviously users do interrupt test subprocesses and it generally works but doing it as part of the normal flow of cargo test would make any problems more prominent.

Alternatively I can imagine adding an interactive protocol between cargo test and the harness, where the harness reports incremental results (perhaps over subunit) and cargo test can ask it to gracefully stop. It doesn't seem worth it for only this feature, and seems likely to complicate and constrain the harness implementation, but perhaps there would be other features that want this.

Contributor

Thanks for that write up!

I guess if cargo test uses the cargo nextest model, this technically wouldn't be needed.

I'm surprised so many have a "first N" variant. I wonder what the use cases look like for that that motivated that in case it impacts the design here, especially since we're shifting focus from humans passing flags to machine.

I was wondering about what worflows we might want to offer from cargo. I've been particularly eyeing --last-failed

--lf, --last-failed Rerun only the tests that failed at the last run (or all if none failed)
--ff, --failed-first Run all tests, but run the last failures first. This may re-order tests and thus lead to repeated fixture setup/teardown.
--nf, --new-first Run tests from new files first, then the rest of the tests sorted by file mtime

The other two are about sort order and I've prototyped in libtest2 a solution that will allow cargo to do those.

For --last-failed, I think --fail-fast becomes important. I'd probably have the "or all if none failed" case imply fail-fast since this is an iterative development mode which it fits with. Unless the "find any" case for CI is important enough, I wonder if --fail-fast in cargo test would be worth it.

ContributorAuthor

Thanks for that write up!

I guess if cargo test uses the cargo nextest model, this technically wouldn't be needed.

Right, if it ran each test function in one process it would be totally in control of when to stop. Also, this would remove the need to finish all the tests in one target before starting the next.

However, there are downsides to this approach, because launching a process can be significantly slower than running a small unit test and so the overall test time can be much slower on Nextest on some trees.

So I guess I would be inclined to leave this up to the harnesses to experiment with, but I haven't read all the history of how the testing-devx team conceives of this interface.

I'm surprised so many have a "first N" variant. I wonder what the use cases look like for that that motivated that in case it impacts the design here, especially since we're shifting focus from humans passing flags to machine.

I think it's essentially splitting the difference between the motivations I described above: I don't want to be spammed by dozens of failures, but I also want to get more data out of a slow test run than just a single failure. My guess these would be rarely used but they're easy to add.

I was wondering about what worflows we might want to offer from cargo. I've been particularly eyeing --last-failed

--lf, --last-failed Rerun only the tests that failed at the last run (or all if none failed)
--ff, --failed-first Run all tests, but run the last failures first. This may re-order tests and thus lead to repeated fixture setup/teardown.
--nf, --new-first Run tests from new files first, then the rest of the tests sorted by file mtime
The other two are about sort order and I've prototyped in libtest2 a solution that will allow cargo to do those.

For --last-failed, I think --fail-fast becomes important. I'd probably have the "or all if none failed" case imply fail-fast since this is an iterative development mode which it fits with. Unless the "find any" case for CI is important enough, I wonder if --fail-fast in cargo test would be worth it.

If you're going to add that then I'd suggest also options to run tests in random or seeded pseudorandom order as people will discover some nondeterminicity. Also perhaps the Bazel thing of repeating failed or all tests.

As additional inspiration cargo-mutants has --iterate which basically re-runs failed meta-tests by looking at the previous failures.

These features seem pretty good. I guess there is a question of approach between allowing harnesses to add them vs having batteries included in the standard tool.

Also, I would rather like to land this into the existing harness even if a large 2.0 is in the pipeline. It doesn't seem like it would constrain future changes too much.

mentioned this

on Jul 18, 2025

Stop testing as soon as any test fails sourcefrog/cargo-mutants#531