Skip to content

CHORE: Explicit random seed for tests #3048

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 19 commits into from
Jul 4, 2023
Merged

CHORE: Explicit random seed for tests #3048

merged 19 commits into from
Jul 4, 2023

Conversation

connortann
Copy link
Collaborator

@connortann connortann commented Jun 27, 2023

Improves the use of randomness in the test suite for reproducibility, and to mitigate the occurence of flaky tests.

Key changes are in conftest.py:

  • Adds a fixture to provide a changing random seed for tests. If a test fails, the random seed will be printed by pytest.
  • Use local RandomState in each test rather than the global random state.
  • Adds a CLI argument to fix the random state, for easy local reproduction of any failures.
  • Resets the global random state before all tests to zero.

Should help address #2960 , improving the reproducibility of any test failures.

@codecov
Copy link

codecov bot commented Jun 27, 2023

Codecov Report

Merging #3048 (99e1a62) into master (974d996) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master    #3048   +/-   ##
=======================================
  Coverage   54.92%   54.92%           
=======================================
  Files          90       90           
  Lines       12862    12862           
=======================================
  Hits         7064     7064           
  Misses       5798     5798           

see 2 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@connortann connortann added enhancement Indicates new feature requests ci Relating to Continuous Integration / GitHub Actions labels Jun 27, 2023
@connortann connortann self-assigned this Jun 27, 2023
@connortann connortann requested a review from thatlittleboy June 27, 2023 10:05
@connortann connortann marked this pull request as ready for review June 27, 2023 10:06
@thatlittleboy
Copy link
Collaborator

Thanks @connortann for the PR. I think setting the seeds is a good idea in general.

I do have one concern (which I raised in #2960 as well), which is that the nature of the failures that we are encountering is not of the random() > 0.5 kind, but are mostly additivity checks failing.

Right now, it's not entirely clear to me why this is happening, so it's particularly concerning because it could mean there are some edge cases where our algorithms are not considering => thus resulting in additivity failures. (I'm leaning towards this conclusion because there are many open issues here talking about this very problem.)

In some ways, it's a good thing that we're getting the occasional errors => it's actually giving us examples upon which we can debug.

@connortann
Copy link
Collaborator Author

I completely agree. So, I suggest we don't close issue #2960 , but keep it open until we can diagnose the root cause. However, in the meantime it's probably preferable for this issue not to affect other unrelated PRs.

Perhaps setting the random seed will also be helpful for diagnosing the issue: we could determine a value of the random seed that reliably causes the tests to fail.

@thatlittleboy
Copy link
Collaborator

thatlittleboy commented Jun 27, 2023

But the only way for us to produce these failures at the moment is via the tests running in CI (where we don't fix the random seed), right?

If we fix the random seed now, then it would make it harder for us to identify which tests we need to focus our attention on (to fix the additivity issues)

in the meantime it's probably preferable for this issue not to affect other unrelated PRs.

The impact isn't that bad, we just need to re-run the CI. It's annoying, but I see the aforementioned problem as a bigger problem than the annoyance from having to re-run CI.

@connortann
Copy link
Collaborator Author

connortann commented Jun 27, 2023

But the only way for us to produce these failures at the moment is via the tests running in CI (where we don't fix the random seed), right?

I don't think that's quite right: if you wanted to work on the additivity issue, you could create a PR that sets the random seed to a number that does reliably cause a failure. That's an improvement on the current state, as you will be able to determine when the issue is actually fixed.

@thatlittleboy
Copy link
Collaborator

That's not quite what I meant. Let me rephrase: are we able to confidently list down all of the existing tests that fail additivity checks for particular random seeds, and also the random seed that generates the failure? If we can't, I don't think #2960 should be closed.

So far, I've listed one test (the xgboost test) with one random seed in the original issue to reproduce the problem.
Are there any more? And what seeds cause the additivity failures in those?

My point is that we don't have full answers to the above question. And the only way to get the answer is to leave the tests running as they are in CI. (how else would we know which tests are "flaky"?) Unless we introduce the hypothesis library into our testing.

@connortann
Copy link
Collaborator Author

connortann commented Jun 28, 2023

I see what you mean, I'm with you.

I'll rejig this PR with a slightly different aim then, to ensure that each test with randomness accepts a random seed, and the the seed is printed to the pytest logs if the test fails.

@connortann
Copy link
Collaborator Author

connortann commented Jun 28, 2023

I had a go implementing the above. Hopefully that is the best of both worlds: a different seed will be used in each run by default, but it will be easy to fix the seed to reproduce a given failure.

My suggestion is to use the new random_seed fixture:

def test_foobar(random_seed):
    assert False

If the test fails, the seed will be printed by pytest:

tests/explainers/test_deep.py F                                 [100%]

============================== FAILURES ===============================
_____________________________ test_foobar _____________________________

random_seed = 736

    def test_foobar(random_seed):
>       assert False
E       assert False

tests/explainers/test_deep.py:622: AssertionError
======================= short test summary info =======================
FAILED tests/explainers/test_deep.py::test_foobar - assert False
================== 1 failed, 261 deselected in 2.56s ==================

@connortann connortann changed the title CHORE: Fix random seed for tests CHORE: Explicit random seed for tests Jun 28, 2023
@connortann
Copy link
Collaborator Author

connortann commented Jun 28, 2023

Found a flaky failure: test_tf_keras_linear, with
random_seed = 896

Copy link
Collaborator

@thatlittleboy thatlittleboy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EDIT: This approach I can get on board with :)


There are a few more test files that I think we should also cover in this PR. (I just did a grep for np.random.seed in the project)

  • tests/explainers/test_linear.py
  • tests/explainers/test_tree.py
  • and a couple more tests in tests/explainers/test_kernel.py (like test_linear() in that file)

@thatlittleboy
Copy link
Collaborator

Found a flaky failure: test_tf_keras_linear, with random_seed = 896

It failed again with random_seed = 823, so there's clearly something off here about the implementation. We'll need to look into this at some point.

@connortann
Copy link
Collaborator Author

There are a few more test files that I think we should also cover in this PR

I noticed in test_linear.py that the global random seed was reset, but randomness doesn't seem to be used explicitly in the test. I think adding fuzzing here wouldn't really make sense, as it isn't clear what if anything is being fuzzed.

However, it's probably wise to set the global seed explicitly for reproducibility, as the implicit default expectation is that unit tests are deterministic and reproducible. I added a global_random_seed fixture to handle this, which has autouse=true.

Then, for tests that explicitly wish to use fuzzing, the random_seed fixture is used with a new numpy random Generator. That should make the application of fuzzing explicit and obvious to future readers of the tests.

@connortann connortann requested a review from thatlittleboy July 2, 2023 18:10
@connortann connortann marked this pull request as draft July 2, 2023 20:56
@connortann
Copy link
Collaborator Author

connortann commented Jul 3, 2023

I made a few further updates from having examined some failures:

  • Suggesting use of RandomState rather than default_rng(), as it has stricter compatibility guarantees between versions and platforms
  • Added more pytorch, tensorflow and xgboost random state seeds
  • Did not change tests for plotting functions. Both the test data generation and the plotting function itself use the global random state, so changing the way the test data is generated will lead to a different output image.
  • Added a pytest CLI argument to set the random state, for ease of local debugging. Example call:
    pytest -k my_function --random-state 123
  • Pinned a passing seed for the tests that we've already identified as being flaky, as tracked in Random floating-point errors in GitHub Actions #2960 . Once we've identified that a given test is flaky, I don't see any further benefit in having the test fail on other unrelated PRs.

I've re-run the tests a few times, I'll keep noting any flaky issues in #2960.

@connortann connortann marked this pull request as ready for review July 3, 2023 13:05
@connortann connortann added this to the 0.42.0 milestone Jul 3, 2023
@connortann
Copy link
Collaborator Author

connortann commented Jul 4, 2023

FYI I've re-run the suite a few times to try to identify other flaky failures & seeds. Things seem to be passing consistently now, run 4 sets (of 8 parallel runs) without a failure 🎉

Copy link
Collaborator

@thatlittleboy thatlittleboy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one last clarification. I'll pre-approve since it's a minor one. Thanks for the good work!

domino-effect-self-high-five

@connortann connortann merged commit c1a2264 into master Jul 4, 2023
@connortann connortann deleted the chore/seeds branch July 4, 2023 18:49
@connortann connortann mentioned this pull request Jul 3, 2023
19 tasks
@thatlittleboy thatlittleboy mentioned this pull request Jul 8, 2023
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci Relating to Continuous Integration / GitHub Actions enhancement Indicates new feature requests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants