-
Notifications
You must be signed in to change notification settings - Fork 7.1k
[FBcode->GH] Fix DeformConvTester::test_backward_cuda #3942
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Summary: `test_backward_cuda_contiguous` and `test_backward_cuda_non_contiguous` have been failing on fbcode for a while with the following error `too many resources requested for launch` which suggests that too may threads per block are requested. This issue was already causing problems in the original PR pytorch#2791 (comment), where the author decided that CC >= 6 was a good threshold because with CC >= 6 GPUs have more registers. (CC = Compute Capability) However, I'm not certain that this is actually true: if we look at https://en.wikipedia.org/wiki/CUDA#Version_features_and_specifications, it's clear that 6.2 has less registers per thread block than 6.0. So I'm not sure this threshold completely makes sense. Moreover, let's note that that the current tests (as on `master`): - **pass** on OSS linux CI which rely on a P4 GPU (up to last week), i.e. **CC = 6.1** - **pass** on OSS windows CI which relies on a T4 GPU, i.e. **CC = 7.5** - **fail** on the AWS cluster which relies on a V100 GPU, i.e. **CC = 7.0** It is quite unclear to me what kind of resource is "enough" for the tests to pass on both 6.1 and 7.5 but not on 7.0. As a result, I think it's safer to just reduce the number of threads per block, irrespective of the CC. ngimel, fmassa suggested that I tag you here since you could have some valuable insight for us. Thanks! Reviewed By: fmassa Differential Revision: D28641626 fbshipit-source-id: 2618c366c5d18bbb7ebafc33032e7ac6c0404d0b
Clang is failing for a minor issue, I'm not sure what we should do in this case. I guess it's fine to ignore the failure? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code change looks good but the linter fails.
After fixing the styling issue, we will still need to pass this change to FBcode.
Edit: Sorry my browser showed a cached version of your PR, without your above comment. We have to options:
- Merge and fix ASAP the styling issue on a follow up PR. This has the unfortunate side-effect that breaks master for 1 commit but does not cause conflicts on FBcode.
- Close this PR, send a new PR that is not marked with
[FBcode->GH]
and properly adds the change (including styles) and then move this single PR on FBcode and resolve conflicts.
I don't have a strong opinion over what we will do. To avoid similar issues on the future, we should always export our Diffs to GH before merging on FBcode and ensure all CIs pass.
@datumbox so how do you suggest we proceed? As I mentioned above, fixing the formatting in this PR will create conflicts I think, whether we prefix with We can merge as-is and fix the formatting in another PR but that might be overkill as the formatting issue should only be present in the PR that introduces them (so clang will stay green in master starting from the next merged PR). At least that's how flake8 works. @fmassa what would you recommend? |
OK, sorry @datumbox I replied before I saw your EDIT. Let's do 1 then, as it has the lowest risk of conflicts, which is always a pain when syncing. I'll merge now and provide the formatting fix right away |
@fmassa you're the oncall this week and I forgot to include the tag in the commit message for this PR, sorry |
Summary:
test_backward_cuda_contiguous
andtest_backward_cuda_non_contiguous
have been failing on fbcode for a while with the following errortoo many resources requested for launch
which suggests that too may threads per block are requested.This issue was already causing problems in the original PR #2791 (comment), where the author decided that CC >= 6 was a good threshold because with CC >= 6 GPUs have more registers. (CC = Compute Capability)
However, I'm not certain that this is actually true: if we look at https://en.wikipedia.org/wiki/CUDA#Version_features_and_specifications, it's clear that 6.2 has less registers per thread block than 6.0. So I'm not sure this threshold completely makes sense.
Moreover, let's note that that the current tests (as on
master
):It is quite unclear to me what kind of resource is "enough" for the tests to pass on both 6.1 and 7.5 but not on 7.0. As a result, I think it's safer to just reduce the number of threads per block, irrespective of the CC.
Reviewed By: fmassa
Differential Revision: D28641626
fbshipit-source-id: 2618c366c5d18bbb7ebafc33032e7ac6c0404d0b