-
Notifications
You must be signed in to change notification settings - Fork 7.1k
[DEBUG] flaky gaussian blur #6755
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I was able to "reproduce" the flakiness in this PR: https://github.com/pytorch/vision/actions/runs/3235448345/jobs/5299895095 You can download the inputs to the test here: https://github.com/pytorch/vision/suites/8739042153/artifacts/395530798 Of these,
fail CI, but pass for me locally. I'm not sure how this can happen. The only thing I can imagine so far is some kind of non-determinism inside the eager or scripted kernel. Gaussian blurring features a
but AFAIK that is only non-deterministic on CUDA. |
The eager execution is the one that exhibits non-determinism. I've attached inputs and outputs that were generated by CI in this PR: debug-6755.zip The files inside the archive can be loaded with input_args, input_kwargs, output_scripted, output_eager = torch.load(...) So far I was unable to reproduce the non-determinism locally. 32ffeb6 tries to do so in CI. |
It seems that this non-determinism is happening on a setup / run level. Meaning, if one call in a run exhibits this behavior, all calls will:
|
This becomes more apparent when spawning multiple CI jobs in parallel: https://github.com/pytorch/vision/actions/runs/3240555909/jobs/5311321738 4 of 10 failed again with same 100% failure behavior. |
@pmeier can you try to run on float input ?
Are we 100% sure that input is the same for all matrix workers even that we set the seed ? Can you compute |
We have never had any flakiness on float inputs, so this will probably not detect anything.
Yes. You can download the archive from #6755 (comment) whose content was generated by separate runs. Inputs as well as the scripted output match exactly. Only the eager output differs in exactly one value. |
No longer needed since we merged #6762. |
No description provided.