Skip to content

[DEBUG] flaky gaussian blur #6755

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 16 commits into from

Conversation

pmeier
Copy link
Collaborator

@pmeier pmeier commented Oct 12, 2022

No description provided.

@pmeier
Copy link
Collaborator Author

pmeier commented Oct 12, 2022

I was able to "reproduce" the flakiness in this PR: https://github.com/pytorch/vision/actions/runs/3235448345/jobs/5299895095

You can download the inputs to the test here: https://github.com/pytorch/vision/suites/8739042153/artifacts/395530798

Of these,

  • test_scripted_vs_eager[cpu-gaussian_blur_video-07]
  • test_scripted_vs_eager[cpu-gaussian_blur_video-08]

fail CI, but pass for me locally. I'm not sure how this can happen. The only thing I can imagine so far is some kind of non-determinism inside the eager or scripted kernel. Gaussian blurring features a conv2d call

img = conv2d(img, kernel, groups=img.shape[-3])

but AFAIK that is only non-deterministic on CUDA.

@pmeier
Copy link
Collaborator Author

pmeier commented Oct 13, 2022

The eager execution is the one that exhibits non-determinism. I've attached inputs and outputs that were generated by CI in this PR: debug-6755.zip

The files inside the archive can be loaded with

input_args, input_kwargs, output_scripted, output_eager = torch.load(...)

So far I was unable to reproduce the non-determinism locally. 32ffeb6 tries to do so in CI.

@pmeier
Copy link
Collaborator Author

pmeier commented Oct 13, 2022

It seems that this non-determinism is happening on a setup / run level. Meaning, if one call in a run exhibits this behavior, all calls will:

@pmeier
Copy link
Collaborator Author

pmeier commented Oct 13, 2022

This becomes more apparent when spawning multiple CI jobs in parallel: https://github.com/pytorch/vision/actions/runs/3240555909/jobs/5311321738

4 of 10 failed again with same 100% failure behavior.

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Oct 13, 2022

@pmeier can you try to run on float input ?

F.gaussian_blur_video(video.float(), kernel_size=3)

Are we 100% sure that input is the same for all matrix workers even that we set the seed ? Can you compute video.float().mean() as an id ?

@pmeier
Copy link
Collaborator Author

pmeier commented Oct 13, 2022

can you try to run on float input ?

We have never had any flakiness on float inputs, so this will probably not detect anything.

Are we 100% sure that input is the same for all matrix workers even that we set the seed ?

Yes. You can download the archive from #6755 (comment) whose content was generated by separate runs. Inputs as well as the scripted output match exactly. Only the eager output differs in exactly one value.

@pmeier
Copy link
Collaborator Author

pmeier commented Oct 17, 2022

No longer needed since we merged #6762.

@pmeier pmeier closed this Oct 17, 2022
@pmeier pmeier deleted the debug-flaky-gaussian-blur branch October 17, 2022 18:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants