-
Notifications
You must be signed in to change notification settings - Fork 24k
[ATen][Native][CUDA][SCALED_MM] limit f8f8bf16 rowwise scaled matmul to sm_90 #145728
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/145728
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit d8acae8 with merge base 71caac2 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
@@ -790,6 +790,9 @@ void check_inputs( | |||
const at::Tensor& scale_b, | |||
const std::optional<at::Tensor>& bias, | |||
const at::Tensor& out) { | |||
auto dprops = at::cuda::getCurrentDeviceProperties(); | |||
TORCH_CHECK(dprops->major == 9, "f8f8bf16_rowwise is sm_90 specific."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't there be another change to not call into this at all?
We should fall back to another implementation for sm10+ right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also think in the vary least we should have a tracker for functionality/features we skip on new hardware but we are tracking so that support can be added in full
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about SM_90 minor version? Not relevant at all here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about SM_90 minor version? Not relevant at all here?
It is not relevant. But, I missed SM_89, I will include it as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should fall back to another implementation for sm10+ right?
Correct, the current approach/kernel is not compatible with SM_100+, since there are no kernels for the Blackwell machines yet, I propose to just throw an exception. Otherwise, it will fail with a CUTLASS error, which is not an elegant behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Aidyn-A do they still fail with CUTLASS 3.7 btw? Or do we need to wait for 3.8?
That is a good question. I will need to check it. Thanks for reminding me about the CUTLASS update!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Aidyn-A We also need a CUDNN update (only for the versions we started the ManyLinux upgrade on so 2.6/2.8)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Skylion007 for the PR! I just ran the test with the latest CUTLASS 3.8 on SM_100 and got the errors:
FAILED [0.3029s] test/test_matmul_cuda.py::TestFP8MatmulCudaCUDA::test_float8_rowwise_scaling_sanity_use_fast_accum_False_cuda - AssertionError: Tensor-likes are not close!
FAILED [2.5473s] test/test_matmul_cuda.py::TestFP8MatmulCudaCUDA::test_float8_rowwise_scaling_sanity_use_fast_accum_True_cuda - AssertionError: Tensor-likes are not close!
FAILED [0.0025s] test/test_matmul_cuda.py::TestFP8MatmulCudaCUDA::test_scaled_mm_vs_emulated_row_wise_bfloat16_cuda - AssertionError: Tensor-likes are not close!
The reason it failed with numerical mismatches is that the kernel was simply aborted with the following message:
ERROR : Arch conditional MMA instruction used without targeting sm90a compute capability. Aborting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like __CUDA_ARCH_FEAT_SM90_ALL
is not defined, not sure if this a CMAKE or CUTLASS bug.
@@ -790,6 +790,9 @@ void check_inputs( | |||
const at::Tensor& scale_b, | |||
const std::optional<at::Tensor>& bias, | |||
const at::Tensor& out) { | |||
auto dprops = at::cuda::getCurrentDeviceProperties(); | |||
TORCH_CHECK(dprops->major <= 9, "f8f8bf16_rowwise is sm_90 specific."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The really bug here is that CUTLASS just throws a print statement here if the arch isn't supported and doesn't propogate an error up the stack right? Seems like an API design failure over there, but I want to know if there is an easy fix that can be made for 3.8
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think there is an easy solution for that, as a bunch of PTX-level instructions needed for the RowwiseScaledMM
are not supported on Blackwell:
ptxas /tmp/tmpxft_000138f8_00000000-6_RowwiseScaledMM.ptx, line 208010; error : Instruction 'wgmma.mma_async with FP8 types' not supported on .target 'sm_100'
ptxas /tmp/tmpxft_000138f8_00000000-6_RowwiseScaledMM.ptx, line 208010; error : Instruction 'wgmma.mma_async with FP8 types' cannot be compiled for architecture 'sm_100'
ptxas /tmp/tmpxft_000138f8_00000000-6_RowwiseScaledMM.ptx, line 208015; error : Instruction 'wgmma.commit_group' not supported on .target 'sm_100'
ptxas /tmp/tmpxft_000138f8_00000000-6_RowwiseScaledMM.ptx, line 208015; error : Instruction 'wgmma.commit_group' cannot be compiled for architecture 'sm_100'
ptxas /tmp/tmpxft_000138f8_00000000-6_RowwiseScaledMM.ptx, line 208025; error : Instruction 'wgmma.wait_group' not supported on .target 'sm_100'
ptxas /tmp/tmpxft_000138f8_00000000-6_RowwiseScaledMM.ptx, line 208025; error : Instruction 'wgmma.wait_group' cannot be compiled for architecture 'sm_100'
ptxas /tmp/tmpxft_000138f8_00000000-6_RowwiseScaledMM.ptx, line 208389; error : Instruction 'wgmma.fence' not supported on .target 'sm_100'
ptxas /tmp/tmpxft_000138f8_00000000-6_RowwiseScaledMM.ptx, line 208389; error : Instruction 'wgmma.fence' cannot be compiled for architecture 'sm_100'
ptxas /tmp/tmpxft_000138f8_00000000-6_RowwiseScaledMM.ptx, line 209850; error : Instruction 'setmaxnreg.dec' not supported on .target 'sm_100'
ptxas /tmp/tmpxft_000138f8_00000000-6_RowwiseScaledMM.ptx, line 210122; error : Instruction 'setmaxnreg.inc' not supported on .target 'sm_100'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Aidyn-A Okay, I found the issue wgmma
support was dropped completely in Blackwell, and replaced with a new instruction with different call signature / call convention, so yeah this isn't an easy fix and probably requires a deep refactor of CUTLASS ;-;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Aidyn-A Don't add a new exception here, just reuse the current exception logic on line 712 and expand the if conditional
@@ -790,6 +790,9 @@ void check_inputs( | |||
const at::Tensor& scale_b, | |||
const std::optional<at::Tensor>& bias, | |||
const at::Tensor& out) { | |||
auto dprops = at::cuda::getCurrentDeviceProperties(); | |||
TORCH_CHECK(dprops->major <= 9, "f8f8bf16_rowwise is sm_90 specific."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Aidyn-A Don't add a new exception here, just reuse the current exception logic on line 712 and expand the if conditional
cc @eqy do we have any tracking for sm100 updates needed. I think the proper fix is to have another instantiation w/
More so curious if we have an issue or somwhere we can backlog this stuff |
sm100 doesn't have any PTXAS instructions for wgmma emulation, so we need a CUTLASS update with new kernels beyond 3.8 (updating to 3.8 will not fix this sadly). |
OK with merging this temporarily before CUTLASS upgrade to abate noisy failures on Blackwell |
@pytorchbot merge |
Merge startedYour change will be merged once all checks pass (ETA 0-4 Hours). Learn more about merging in the wiki. Questions? Feedback? Please reach out to the PyTorch DevX Team |
…to sm_90 (pytorch#145728) The CUTLASS-based kernel for f8f8bf16 rowwise scaled matmul is specific to Hopper devices only. It is not re-usable on newer devices without modifications. This PR adds a guard for this matmul to be sm_90 specific. Once the kernel is there, the guard may be removed. Pull Request resolved: pytorch#145728 Approved by: https://github.com/Skylion007, https://github.com/eqy
The CUTLASS-based kernel for f8f8bf16 rowwise scaled matmul is specific to Hopper devices only. It is not re-usable on newer devices without modifications. This PR adds a guard for this matmul to be sm_90 specific. Once the kernel is there, the guard may be removed.
cc @ptrblck @msaroufim @eqy @yanbing-j @vkuzo @albanD @kadeng @penguinwu @manuelcandales @SherlockNoMad @angelayi