[FBcode->GH] Fix DeformConvTester::test_backward_cuda #3942

NicolasHug · 2021-06-01T10:11:26Z

Summary:
test_backward_cuda_contiguous and test_backward_cuda_non_contiguous have been failing on fbcode for a while with the following error too many resources requested for launch which suggests that too may threads per block are requested.

This issue was already causing problems in the original PR #2791 (comment), where the author decided that CC >= 6 was a good threshold because with CC >= 6 GPUs have more registers. (CC = Compute Capability)

However, I'm not certain that this is actually true: if we look at https://en.wikipedia.org/wiki/CUDA#Version_features_and_specifications, it's clear that 6.2 has less registers per thread block than 6.0. So I'm not sure this threshold completely makes sense.

Moreover, let's note that that the current tests (as on master):

pass on OSS linux CI which rely on a P4 GPU (up to last week), i.e. CC = 6.1
pass on OSS windows CI which relies on a T4 GPU, i.e. CC = 7.5
fail on the AWS cluster which relies on a V100 GPU, i.e. CC = 7.0

It is quite unclear to me what kind of resource is "enough" for the tests to pass on both 6.1 and 7.5 but not on 7.0. As a result, I think it's safer to just reduce the number of threads per block, irrespective of the CC.

Reviewed By: fmassa

Differential Revision: D28641626

fbshipit-source-id: 2618c366c5d18bbb7ebafc33032e7ac6c0404d0b

Summary: `test_backward_cuda_contiguous` and `test_backward_cuda_non_contiguous` have been failing on fbcode for a while with the following error `too many resources requested for launch` which suggests that too may threads per block are requested. This issue was already causing problems in the original PR pytorch#2791 (comment), where the author decided that CC >= 6 was a good threshold because with CC >= 6 GPUs have more registers. (CC = Compute Capability) However, I'm not certain that this is actually true: if we look at https://en.wikipedia.org/wiki/CUDA#Version_features_and_specifications, it's clear that 6.2 has less registers per thread block than 6.0. So I'm not sure this threshold completely makes sense. Moreover, let's note that that the current tests (as on `master`): - **pass** on OSS linux CI which rely on a P4 GPU (up to last week), i.e. **CC = 6.1** - **pass** on OSS windows CI which relies on a T4 GPU, i.e. **CC = 7.5** - **fail** on the AWS cluster which relies on a V100 GPU, i.e. **CC = 7.0** It is quite unclear to me what kind of resource is "enough" for the tests to pass on both 6.1 and 7.5 but not on 7.0. As a result, I think it's safer to just reduce the number of threads per block, irrespective of the CC. ngimel, fmassa suggested that I tag you here since you could have some valuable insight for us. Thanks! Reviewed By: fmassa Differential Revision: D28641626 fbshipit-source-id: 2618c366c5d18bbb7ebafc33032e7ac6c0404d0b

NicolasHug · 2021-06-01T10:21:53Z

Clang is failing for a minor issue, I'm not sure what we should do in this case.
I could fix the formatting but if I do it here, this will likely create a conflict when syncing GH with fbcode.

I guess it's fine to ignore the failure?

datumbox

The code change looks good but the linter fails.

After fixing the styling issue, we will still need to pass this change to FBcode.

Edit: Sorry my browser showed a cached version of your PR, without your above comment. We have to options:

Merge and fix ASAP the styling issue on a follow up PR. This has the unfortunate side-effect that breaks master for 1 commit but does not cause conflicts on FBcode.
Close this PR, send a new PR that is not marked with [FBcode->GH] and properly adds the change (including styles) and then move this single PR on FBcode and resolve conflicts.

I don't have a strong opinion over what we will do. To avoid similar issues on the future, we should always export our Diffs to GH before merging on FBcode and ensure all CIs pass.

NicolasHug · 2021-06-01T11:11:50Z

@datumbox so how do you suggest we proceed? As I mentioned above, fixing the formatting in this PR will create conflicts I think, whether we prefix with [FBcode->GH] or not.

We can merge as-is and fix the formatting in another PR but that might be overkill as the formatting issue should only be present in the PR that introduces them (so clang will stay green in master starting from the next merged PR). At least that's how flake8 works.

@fmassa what would you recommend?

NicolasHug · 2021-06-01T11:17:12Z

OK, sorry @datumbox I replied before I saw your EDIT.

Let's do 1 then, as it has the lowest risk of conflicts, which is always a pain when syncing. I'll merge now and provide the formatting fix right away

NicolasHug · 2021-06-01T11:20:53Z

@fmassa you're the oncall this week and I forgot to include the tag in the commit message for this PR, sorry

facebook-github-bot added the cla signed label Jun 1, 2021

datumbox reviewed Jun 1, 2021

View reviewed changes

datumbox approved these changes Jun 1, 2021

View reviewed changes

NicolasHug merged commit e395a79 into pytorch:master Jun 1, 2021

NicolasHug added bug module: ops module: tests labels Jun 1, 2021

NicolasHug mentioned this pull request Jun 1, 2021

Fix clang formatting in deform_conv2d_kernel.cu #3943

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FBcode->GH] Fix DeformConvTester::test_backward_cuda #3942

[FBcode->GH] Fix DeformConvTester::test_backward_cuda #3942

Uh oh!

NicolasHug commented Jun 1, 2021

Uh oh!

NicolasHug commented Jun 1, 2021

Uh oh!

datumbox left a comment •

edited

Loading

Uh oh!

NicolasHug commented Jun 1, 2021

Uh oh!

NicolasHug commented Jun 1, 2021

Uh oh!

NicolasHug commented Jun 1, 2021

Uh oh!

Uh oh!

[FBcode->GH] Fix DeformConvTester::test_backward_cuda #3942

[FBcode->GH] Fix DeformConvTester::test_backward_cuda #3942

Uh oh!

Conversation

NicolasHug commented Jun 1, 2021

Uh oh!

NicolasHug commented Jun 1, 2021

Uh oh!

datumbox left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NicolasHug commented Jun 1, 2021

Uh oh!

NicolasHug commented Jun 1, 2021

Uh oh!

NicolasHug commented Jun 1, 2021

Uh oh!

Uh oh!

datumbox left a comment •

edited

Loading