CHORE: Explicit random seed for tests #3048

connortann · 2023-06-27T09:24:22Z

Improves the use of randomness in the test suite for reproducibility, and to mitigate the occurence of flaky tests.

Key changes are in conftest.py:

Adds a fixture to provide a changing random seed for tests. If a test fails, the random seed will be printed by pytest.
Use local RandomState in each test rather than the global random state.
Adds a CLI argument to fix the random state, for easy local reproduction of any failures.
Resets the global random state before all tests to zero.

Should help address #2960 , improving the reproducibility of any test failures.

codecov · 2023-06-27T09:30:21Z

Codecov Report

Merging #3048 (99e1a62) into master (974d996) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master    #3048   +/-   ##
=======================================
  Coverage   54.92%   54.92%           
=======================================
  Files          90       90           
  Lines       12862    12862           
=======================================
  Hits         7064     7064           
  Misses       5798     5798

see 2 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

thatlittleboy · 2023-06-27T15:33:56Z

Thanks @connortann for the PR. I think setting the seeds is a good idea in general.

I do have one concern (which I raised in #2960 as well), which is that the nature of the failures that we are encountering is not of the random() > 0.5 kind, but are mostly additivity checks failing.

Right now, it's not entirely clear to me why this is happening, so it's particularly concerning because it could mean there are some edge cases where our algorithms are not considering => thus resulting in additivity failures. (I'm leaning towards this conclusion because there are many open issues here talking about this very problem.)

In some ways, it's a good thing that we're getting the occasional errors => it's actually giving us examples upon which we can debug.

connortann · 2023-06-27T15:37:19Z

I completely agree. So, I suggest we don't close issue #2960 , but keep it open until we can diagnose the root cause. However, in the meantime it's probably preferable for this issue not to affect other unrelated PRs.

Perhaps setting the random seed will also be helpful for diagnosing the issue: we could determine a value of the random seed that reliably causes the tests to fail.

thatlittleboy · 2023-06-27T17:03:39Z

But the only way for us to produce these failures at the moment is via the tests running in CI (where we don't fix the random seed), right?

If we fix the random seed now, then it would make it harder for us to identify which tests we need to focus our attention on (to fix the additivity issues)

in the meantime it's probably preferable for this issue not to affect other unrelated PRs.

The impact isn't that bad, we just need to re-run the CI. It's annoying, but I see the aforementioned problem as a bigger problem than the annoyance from having to re-run CI.

connortann · 2023-06-27T20:27:11Z

But the only way for us to produce these failures at the moment is via the tests running in CI (where we don't fix the random seed), right?

I don't think that's quite right: if you wanted to work on the additivity issue, you could create a PR that sets the random seed to a number that does reliably cause a failure. That's an improvement on the current state, as you will be able to determine when the issue is actually fixed.

thatlittleboy · 2023-06-28T00:14:22Z

That's not quite what I meant. Let me rephrase: are we able to confidently list down all of the existing tests that fail additivity checks for particular random seeds, and also the random seed that generates the failure? If we can't, I don't think #2960 should be closed.

So far, I've listed one test (the xgboost test) with one random seed in the original issue to reproduce the problem.
Are there any more? And what seeds cause the additivity failures in those?

My point is that we don't have full answers to the above question. And the only way to get the answer is to leave the tests running as they are in CI. (how else would we know which tests are "flaky"?) Unless we introduce the hypothesis library into our testing.

connortann · 2023-06-28T13:14:58Z

I see what you mean, I'm with you.

I'll rejig this PR with a slightly different aim then, to ensure that each test with randomness accepts a random seed, and the the seed is printed to the pytest logs if the test fails.

connortann · 2023-06-28T13:52:13Z

I had a go implementing the above. Hopefully that is the best of both worlds: a different seed will be used in each run by default, but it will be easy to fix the seed to reproduce a given failure.

My suggestion is to use the new random_seed fixture:

def test_foobar(random_seed):
    assert False

If the test fails, the seed will be printed by pytest:

tests/explainers/test_deep.py F                                 [100%]

============================== FAILURES ===============================
_____________________________ test_foobar _____________________________

random_seed = 736

    def test_foobar(random_seed):
>       assert False
E       assert False

tests/explainers/test_deep.py:622: AssertionError
======================= short test summary info =======================
FAILED tests/explainers/test_deep.py::test_foobar - assert False
================== 1 failed, 261 deselected in 2.56s ==================

connortann · 2023-06-28T20:28:52Z

Found a flaky failure: test_tf_keras_linear, with
random_seed = 896

thatlittleboy

EDIT: This approach I can get on board with :)

There are a few more test files that I think we should also cover in this PR. (I just did a grep for np.random.seed in the project)

tests/explainers/test_linear.py
tests/explainers/test_tree.py
and a couple more tests in tests/explainers/test_kernel.py (like test_linear() in that file)

tests/explainers/test_gradient.py

thatlittleboy · 2023-07-02T14:10:22Z

Found a flaky failure: test_tf_keras_linear, with random_seed = 896

It failed again with random_seed = 823, so there's clearly something off here about the implementation. We'll need to look into this at some point.

connortann · 2023-07-02T18:07:40Z

There are a few more test files that I think we should also cover in this PR

I noticed in test_linear.py that the global random seed was reset, but randomness doesn't seem to be used explicitly in the test. I think adding fuzzing here wouldn't really make sense, as it isn't clear what if anything is being fuzzed.

However, it's probably wise to set the global seed explicitly for reproducibility, as the implicit default expectation is that unit tests are deterministic and reproducible. I added a global_random_seed fixture to handle this, which has autouse=true.

Then, for tests that explicitly wish to use fuzzing, the random_seed fixture is used with a new numpy random Generator. That should make the application of fuzzing explicit and obvious to future readers of the tests.

connortann · 2023-07-03T10:00:28Z

I made a few further updates from having examined some failures:

Suggesting use of RandomState rather than default_rng(), as it has stricter compatibility guarantees between versions and platforms
Added more pytorch, tensorflow and xgboost random state seeds
Did not change tests for plotting functions. Both the test data generation and the plotting function itself use the global random state, so changing the way the test data is generated will lead to a different output image.
Added a pytest CLI argument to set the random state, for ease of local debugging. Example call:
```
pytest -k my_function --random-state 123
```
Pinned a passing seed for the tests that we've already identified as being flaky, as tracked in Random floating-point errors in GitHub Actions #2960 . Once we've identified that a given test is flaky, I don't see any further benefit in having the test fail on other unrelated PRs.

I've re-run the tests a few times, I'll keep noting any flaky issues in #2960.

connortann · 2023-07-04T09:00:27Z

FYI I've re-run the suite a few times to try to identify other flaky failures & seeds. Things seem to be passing consistently now, run 4 sets (of 8 parallel runs) without a failure 🎉

thatlittleboy

Just one last clarification. I'll pre-approve since it's a minor one. Thanks for the good work!

tests/explainers/test_linear.py

connortann added enhancement Indicates new feature requests ci Relating to Continuous Integration / GitHub Actions labels Jun 27, 2023

connortann self-assigned this Jun 27, 2023

connortann requested a review from thatlittleboy June 27, 2023 10:05

connortann marked this pull request as ready for review June 27, 2023 10:06

connortann force-pushed the chore/seeds branch from 0f20d0f to 4f93227 Compare June 27, 2023 15:37

connortann force-pushed the chore/seeds branch from 4f93227 to 4e11b5e Compare June 27, 2023 20:32

connortann changed the title ~~CHORE: Fix random seed for tests~~ CHORE: Explicit random seed for tests Jun 28, 2023

thatlittleboy requested changes Jul 2, 2023

View reviewed changes

tests/explainers/test_gradient.py Outdated Show resolved Hide resolved

connortann requested a review from thatlittleboy July 2, 2023 18:10

connortann marked this pull request as draft July 2, 2023 20:56

connortann force-pushed the chore/seeds branch from a2a25e5 to a495a1a Compare July 3, 2023 09:24

connortann mentioned this pull request Jul 3, 2023

Random floating-point errors in GitHub Actions #2960

Open

10 tasks

connortann marked this pull request as ready for review July 3, 2023 13:05

connortann added this to the 0.42.0 milestone Jul 3, 2023

connortann force-pushed the chore/seeds branch from 62d3617 to 156326a Compare July 4, 2023 11:40

Fix random seed for tests

a59bdc9

connortann added 18 commits July 4, 2023 13:20

Set pytorch random seed

770e951

Use changing random seed

4797f15

Better docs

cadc92a

Cleanup docstring

2106b21

Fix typo

4b8a7b1

Handle global random seed

ff8c1c8

More consistent test random seeding

287a5b9

Tensorflow seed

9faacbb

Typo

830d5df

Fix typo

e97e1ae

Allow CLI setup of random seed

ffc65c9

Change from Generator to RandomState

0953167

Revert changes to plot tests

f230c06

Pin seeds for identified flaky tests

3ae1a47

Fix typo

c3da8aa

Flaky xgboost

d3c9ec4

Changelog entry

5a94ab1

Flaky test_several_trees

99e1a62

connortann force-pushed the chore/seeds branch from 156326a to 99e1a62 Compare July 4, 2023 12:20

thatlittleboy approved these changes Jul 4, 2023

View reviewed changes

tests/explainers/test_linear.py Show resolved Hide resolved

connortann merged commit c1a2264 into master Jul 4, 2023

connortann deleted the chore/seeds branch July 4, 2023 18:49

connortann mentioned this pull request Jul 3, 2023

Meta-issue: preparing for a release #2979

Closed

19 tasks

thatlittleboy mentioned this pull request Jul 8, 2023

Fix seed for flaky tests #3073

Merged

2 tasks

CHORE: Explicit random seed for tests #3048

CHORE: Explicit random seed for tests #3048

Uh oh!

Conversation

connortann commented Jun 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Jun 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

thatlittleboy commented Jun 27, 2023

Uh oh!

connortann commented Jun 27, 2023

Uh oh!

thatlittleboy commented Jun 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

connortann commented Jun 27, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thatlittleboy commented Jun 28, 2023

Uh oh!

connortann commented Jun 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

connortann commented Jun 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

connortann commented Jun 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thatlittleboy left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

thatlittleboy commented Jul 2, 2023

Uh oh!

connortann commented Jul 2, 2023

Uh oh!

connortann commented Jul 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

connortann commented Jul 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thatlittleboy left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

connortann commented Jun 27, 2023 •

edited

Loading

codecov bot commented Jun 27, 2023 •

edited

Loading

thatlittleboy commented Jun 27, 2023 •

edited

Loading

connortann commented Jun 27, 2023 •

edited

Loading

connortann commented Jun 28, 2023 •

edited

Loading

connortann commented Jun 28, 2023 •

edited

Loading

connortann commented Jun 28, 2023 •

edited

Loading

thatlittleboy left a comment •

edited

Loading

connortann commented Jul 3, 2023 •

edited

Loading

connortann commented Jul 4, 2023 •

edited

Loading