Skip to content

[AMDGPU][Clang] Allow amdgpu-waves-per-eu attribute to lower target occupancy range #138284

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

lucas-rami
Copy link
Contributor

@lucas-rami lucas-rami commented May 2, 2025

The amdgpu-waves-per-eu attribute currently does not allow users to lower the target occupancy range that the AMDGPU backend will try to achieve. For example, for a kernel targeting gfx942 with default flat workgroup sizes and no LDS usage, the AMDGPUSubtarget::getWavesPerEU method will by default produce the range [4,10]. Setting "amdgpu-waves-per-eu=M,N" (N being optional) will only have an effect if $4 \leq M \leq N \leq 10$, in which case the [M,N] range will be produced instead. Advanced developers may in some cases know that a specifc kernel would not benefit from higher occupancies and wish to communicate that to the backend. It in turns could make better codegen decisions if it knows that increasing occupancy is not a priority.

This modifies the interpretation of the amdgpu-waves-per-eu attribute to enable this behavior. User-provided minimum/maximum number of waves/EU are now able to change the default waves/EU range almost arbitrarily, with only subtarget's specifications and the maximum occupancy induced by workgroup size and LDS usage limiting the target occupancy range. In the previous example, the target range will be [M,N] as long as $1 \leq M \leq N \leq 10$.

To allow users to specify only a maximum desired number of waves per EU, the attribute now accepts a minimum value of 0 and understands it as a wish to not restrict the mininum number of waves per EU (a minimum value of 1 indicates instead that the user is fine if the final occupancy is 1, hence it cannot be used as the default). The Clang attribute now additionally accepts ranges of the form [0,M] to support the latter use case.

@llvmbot
Copy link
Member

llvmbot commented May 2, 2025

@llvm/pr-subscribers-llvm-ir
@llvm/pr-subscribers-clang

@llvm/pr-subscribers-backend-amdgpu

Author: Lucas Ramirez (lucas-rami)

Changes

Clang's attribute reference specifies, for the "amdgpu-waves-per-eu" attribute, that "Passing 0, 0 as <min>, <max> implies the default behavior". However, the backend currently treats a minimum or maximum of 0 as an invalid bound that makes the whole attribute invalid/ignored.

This makes the backend treat a 0 in the range as per the documentation. In particular, this allows a user to specify a maximum desired number of waves/EU without specifying a minimum (e.g., "amdgpu-waves-per-eu"="0,4"), which is not currently feasible.

The following equivalences hold ($$0<N<getMaxWavesPerEU()$$):

  • "amdgpu-waves-per-eu"="N,0" and "amdgpu-waves-per-eu"="N".
  • "amdgpu-waves-per-eu"="0,0" and no attribute.

Full diff: https://github.com/llvm/llvm-project/pull/138284.diff

5 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/AMDGPUAttributor.cpp (+1-2)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPUSubtarget.cpp (+16-8)
  • (modified) llvm/lib/Target/AMDGPU/AMDGPUSubtarget.h (+3-1)
  • (modified) llvm/test/CodeGen/AMDGPU/attr-amdgpu-waves-per-eu.ll (+38)
  • (modified) llvm/test/CodeGen/AMDGPU/propagate-waves-per-eu.ll (+14-14)
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUAttributor.cpp b/llvm/lib/Target/AMDGPU/AMDGPUAttributor.cpp
index b9ce8dc0c5cdb..0bbbe766968fc 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUAttributor.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUAttributor.cpp
@@ -1125,8 +1125,7 @@ struct AAAMDWavesPerEU : public AAAMDSizeRangeAttribute {
       indicateOptimisticFixpoint();
     };
 
-    std::pair<unsigned, unsigned> MaxWavesPerEURange{
-        1U, InfoCache.getMaxWavesPerEU(*F)};
+    std::pair<unsigned, unsigned> MaxWavesPerEURange{0, 0};
 
     // If the attribute exists, we will honor it if it is not the default.
     if (auto Attr = InfoCache.getWavesPerEUAttr(*F)) {
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUSubtarget.cpp b/llvm/lib/Target/AMDGPU/AMDGPUSubtarget.cpp
index 563605f964cc6..4212d97eb9404 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUSubtarget.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUSubtarget.cpp
@@ -191,17 +191,25 @@ std::pair<unsigned, unsigned> AMDGPUSubtarget::getEffectiveWavesPerEU(
       getOccupancyWithWorkGroupSizes(LDSBytes, FlatWorkGroupSizes).second};
   Default.first = std::min(Default.first, Default.second);
 
-  // Make sure requested minimum is less than requested maximum.
+  if (RequestedWavesPerEU.first) {
+    // Requested minimum must not violate subtarget's specifications.
+    if (RequestedWavesPerEU.first < Default.first)
+      return Default;
+    // Requested maximum must be no lesser than minimum.
+    if (RequestedWavesPerEU.second &&
+        RequestedWavesPerEU.first > RequestedWavesPerEU.second)
+      return Default;
+  }
+  // Requested maximum must not violate subtarget's specifications.
   if (RequestedWavesPerEU.second &&
-      RequestedWavesPerEU.first > RequestedWavesPerEU.second)
-    return Default;
-
-  // Make sure requested values do not violate subtarget's specifications and
-  // are compatible with values implied by minimum/maximum flat workgroup sizes.
-  if (RequestedWavesPerEU.first < Default.first ||
       RequestedWavesPerEU.second > Default.second)
     return Default;
 
+  // Replace unspecified bounds in the request with the default bounds.
+  if (!RequestedWavesPerEU.first)
+    RequestedWavesPerEU.first = Default.first;
+  if (!RequestedWavesPerEU.second)
+    RequestedWavesPerEU.second = Default.second;
   return RequestedWavesPerEU;
 }
 
@@ -220,7 +228,7 @@ std::pair<unsigned, unsigned>
 AMDGPUSubtarget::getWavesPerEU(std::pair<unsigned, unsigned> FlatWorkGroupSizes,
                                unsigned LDSBytes, const Function &F) const {
   // Default minimum/maximum number of waves per execution unit.
-  std::pair<unsigned, unsigned> Default(1, getMaxWavesPerEU());
+  std::pair<unsigned, unsigned> Default(0, 0);
 
   // Requested minimum/maximum number of waves per execution unit.
   std::pair<unsigned, unsigned> Requested =
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUSubtarget.h b/llvm/lib/Target/AMDGPU/AMDGPUSubtarget.h
index 91fe2a69bc0b7..1cbb6a7b1ad43 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUSubtarget.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPUSubtarget.h
@@ -119,7 +119,9 @@ class AMDGPUSubtarget {
   /// Returns the target minimum/maximum number of waves per EU. This is based
   /// on the minimum/maximum number of \p RequestedWavesPerEU and further
   /// limited by the maximum achievable occupancy derived from the range of \p
-  /// FlatWorkGroupSizes and number of \p LDSBytes per workgroup.
+  /// FlatWorkGroupSizes and number of \p LDSBytes per workgroup. A
+  /// minimum/maximum requested waves/EU value of 0 indicates an intent to not
+  /// restrict the corresponding bound.
   std::pair<unsigned, unsigned>
   getEffectiveWavesPerEU(std::pair<unsigned, unsigned> RequestedWavesPerEU,
                          std::pair<unsigned, unsigned> FlatWorkGroupSizes,
diff --git a/llvm/test/CodeGen/AMDGPU/attr-amdgpu-waves-per-eu.ll b/llvm/test/CodeGen/AMDGPU/attr-amdgpu-waves-per-eu.ll
index 4507fd5865989..d8827d0405295 100644
--- a/llvm/test/CodeGen/AMDGPU/attr-amdgpu-waves-per-eu.ll
+++ b/llvm/test/CodeGen/AMDGPU/attr-amdgpu-waves-per-eu.ll
@@ -200,3 +200,41 @@ entry:
   ret void
 }
 attributes #10 = {"amdgpu-flat-work-group-size"="256,256" "amdgpu-waves-per-eu"="2,2"}
+
+; At most 2 waves per execution unit.
+; CHECK-LABEL: {{^}}empty_at_most_2:
+; CHECK: SGPRBlocks: 12
+; CHECK: VGPRBlocks: 21
+; CHECK: NumSGPRsForWavesPerEU: 102
+; CHECK: NumVGPRsForWavesPerEU: 85
+define amdgpu_kernel void @empty_at_most_2() #11 {
+entry:
+  ret void
+}
+attributes #11 = {"amdgpu-waves-per-eu"="0,2"}
+
+; Exactly 1024 workitems (limits occupancy to 8) and at least 5 waves per execution unit.
+; "amdgpu-waves-per-eu"="5,0" should have the same effect as "amdgpu-waves-per-eu"="5".
+; CHECK-LABEL: {{^}}empty_workitems_exactly_1024_waves_at_least_5:
+; CHECK: SGPRBlocks: 8
+; CHECK: VGPRBlocks: 7
+; CHECK: NumSGPRsForWavesPerEU: 65
+; CHECK: NumVGPRsForWavesPerEU: 29
+define amdgpu_kernel void @empty_workitems_exactly_1024_waves_at_least_5() #12 {
+entry:
+  ret void
+}
+attributes #12 = {"amdgpu-waves-per-eu"="5,0" "amdgpu-flat-work-group-size"="1024,1024"}
+
+; Unrestricted number of waves per execution unit.
+; "amdgpu-waves-per-eu"="0,0" should have the same effect as not providing the attribute.
+; CHECK-LABEL: {{^}}empty_default_waves:
+; CHECK: SGPRBlocks: 0
+; CHECK: VGPRBlocks: 0
+; CHECK: NumSGPRsForWavesPerEU: 1
+; CHECK: NumVGPRsForWavesPerEU: 1
+define amdgpu_kernel void @empty_default_waves() #13 {
+entry:
+  ret void
+}
+attributes #13 = {"amdgpu-waves-per-eu"="0,0"}
diff --git a/llvm/test/CodeGen/AMDGPU/propagate-waves-per-eu.ll b/llvm/test/CodeGen/AMDGPU/propagate-waves-per-eu.ll
index ae114f3213d8f..967cc764ea19c 100644
--- a/llvm/test/CodeGen/AMDGPU/propagate-waves-per-eu.ll
+++ b/llvm/test/CodeGen/AMDGPU/propagate-waves-per-eu.ll
@@ -1,7 +1,7 @@
 ; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --check-attributes --check-globals --version 2
 ; RUN: opt -S -mtriple=amdgcn-amd-amdhsa -passes=amdgpu-attributor %s | FileCheck %s
 
-; Check propagation of amdgpu-flat-work-group-size attribute.
+; Check propagation of amdgpu-waves-per-eu attribute.
 
 ; Called from a single kernel with 1,8
 define internal void @default_to_1_8_a() {
@@ -216,30 +216,30 @@ define internal i32 @bitcasted_function() {
   ret i32 0
 }
 
-define internal void @called_from_invalid_bounds_0() {
-; CHECK-LABEL: define internal void @called_from_invalid_bounds_0
-; CHECK-SAME: () #[[ATTR10:[0-9]+]] {
+define internal void @called_without_min_waves() {
+; CHECK-LABEL: define internal void @called_without_min_waves
+; CHECK-SAME: () #[[ATTR0]] {
 ; CHECK-NEXT:    ret void
 ;
   ret void
 }
 
-define internal void @called_from_invalid_bounds_1() {
-; CHECK-LABEL: define internal void @called_from_invalid_bounds_1
-; CHECK-SAME: () #[[ATTR10]] {
+define internal void @called_from_invalid_bounds() {
+; CHECK-LABEL: define internal void @called_from_invalid_bounds
+; CHECK-SAME: () #[[ATTR10:[0-9]+]] {
 ; CHECK-NEXT:    ret void
 ;
   ret void
 }
 
-; Invalid range for amdgpu-waves-per-eu
-define amdgpu_kernel void @kernel_invalid_bounds_0_8() #9 {
-; CHECK-LABEL: define amdgpu_kernel void @kernel_invalid_bounds_0_8
+; amdgpu-waves-per-eu range only provides a maximum.
+define amdgpu_kernel void @kernel_0_8() #9 {
+; CHECK-LABEL: define amdgpu_kernel void @kernel_0_8
 ; CHECK-SAME: () #[[ATTR0]] {
-; CHECK-NEXT:    call void @called_from_invalid_bounds_0()
+; CHECK-NEXT:    call void @called_without_min_waves()
 ; CHECK-NEXT:    ret void
 ;
-  call void @called_from_invalid_bounds_0()
+  call void @called_without_min_waves()
   ret void
 }
 
@@ -247,10 +247,10 @@ define amdgpu_kernel void @kernel_invalid_bounds_0_8() #9 {
 define amdgpu_kernel void @kernel_invalid_bounds_1_123() #10 {
 ; CHECK-LABEL: define amdgpu_kernel void @kernel_invalid_bounds_1_123
 ; CHECK-SAME: () #[[ATTR11:[0-9]+]] {
-; CHECK-NEXT:    call void @called_from_invalid_bounds_1()
+; CHECK-NEXT:    call void @called_from_invalid_bounds()
 ; CHECK-NEXT:    ret void
 ;
-  call void @called_from_invalid_bounds_1()
+  call void @called_from_invalid_bounds()
   ret void
 }
 

Copy link

github-actions bot commented May 2, 2025

✅ With the latest revision this PR passed the C/C++ code formatter.

@shiltian
Copy link
Contributor

shiltian commented May 2, 2025

Can we instead disallow using just one value? I find it super annoying to have something like std::optional<int, std::optional<int>> in the parser code.

@lucas-rami
Copy link
Contributor Author

Can we instead disallow using just one value?

That is orthogonal to allowing 0 as min/max right? One may still want to specify only a minimum or a maximum, so we need a way to say "don't care" for the other bound. I could disallow specifying only one value in a follow-up PR if there is agreement there, though I am not sure of what additional precaution to take for changes that will make some existing user code fail to compile.

@shiltian
Copy link
Contributor

shiltian commented May 2, 2025

That is orthogonal to allowing 0 as min/max right?

That's right.

One may still want to specify only a minimum or a maximum, so we need a way to say "don't care" for the other bound.

At the moment only the upper bound can be skipped. 0 is a good value, as what you are doing right here.

@arsenm
Copy link
Contributor

arsenm commented May 2, 2025

The clang behavior doesn't necessarily map 1:1 to the IR attribute behavior. Isn't there some processing of the input values in clang?

@lucas-rami
Copy link
Contributor Author

Isn't there some processing of the input values in clang?

Not sure where to look, but the following snippet supports 0 as min/max and writes them without processing. It's only the subtarget methods that seem to consider them as invalid.

void CodeGenModule::handleAMDGPUWavesPerEUAttr(
llvm::Function *F, const AMDGPUWavesPerEUAttr *Attr) {
unsigned Min =
Attr->getMin()->EvaluateKnownConstInt(getContext()).getExtValue();
unsigned Max =
Attr->getMax()
? Attr->getMax()->EvaluateKnownConstInt(getContext()).getExtValue()
: 0;
if (Min != 0) {
assert((Max == 0 || Min <= Max) && "Min must be less than or equal Max");
std::string AttrVal = llvm::utostr(Min);
if (Max != 0)
AttrVal = AttrVal + "," + llvm::utostr(Max);
F->addFnAttr("amdgpu-waves-per-eu", AttrVal);
} else
assert(Max == 0 && "Max must be zero");
}

@arsenm
Copy link
Contributor

arsenm commented May 2, 2025

It doesn't add the attribute if it's 0?

Copy link
Contributor

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a reason to support this on the IR level. We should just make it a verifier error to use 0. Supporting 0 is only a shortcut to the default behavior without knowing the target limits in the source

@llvmbot llvmbot added clang Clang issues not falling into any other category clang:frontend Language frontend issues, e.g. anything involving "Sema" clang:codegen IR generation bugs: mangling, exceptions, etc. llvm:ir labels May 8, 2025
@lucas-rami
Copy link
Contributor Author

It doesn't add the attribute if it's 0?

Indeed I missed this, sorry.

I don't see a reason to support this on the IR level. We should just make it a verifier error to use 0.

After further investigation and from what I can understand sema already rejects a minimum of 0 if the maximum is not 0 i.e., it will only accept amdgpu_waves_per_eu(0, 0) or amdgpu_waves_per_eu(N) or amdgpu_waves_per_eu(N, M) where N>0 and M>=N.

Supporting 0 is only a shortcut to the default behavior without knowing the target limits in the source

I see your point. With the current way the attribute affects the range of waves/EU returned by the subtarget method a user may set the range to [1,N] to only limit the maximum number of waves/EU, so accepting 0 is indeed useless as-is. I think it only makes sense with what I intended to be the followup change to this, which I integrated here now (PR description updated to reflect the change). Hopefully letting the minimum be 0 makes more sense now.

@lucas-rami lucas-rami changed the title [AMDGPU] Allow 0 as min./max. of amdgpu-waves-per-eu [AMDGPU][Clang] Allow amdgpu-waves-per-eu attribute to lower target occupancy range May 8, 2025
@lucas-rami lucas-rami requested a review from jrbyrnes May 8, 2025 00:07
@lucas-rami lucas-rami force-pushed the waves-per-eu-zero-is-allowed branch from df06f67 to 9906334 Compare May 22, 2025 13:27
@@ -279,7 +268,7 @@ define amdgpu_kernel void @kernel_3_6() #12 {
; 3,6 -> 6,9
define internal void @refine_upper_func_3_6() #13 {
; CHECK-LABEL: define internal void @refine_upper_func_3_6
; CHECK-SAME: () #[[ATTR9]] {
; CHECK-SAME: () #[[ATTR14:[0-9]+]] {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the last "real" change in the file. The waves/EU range for this function goes from [4,10] to [3,6]. All attribute changes below are just renames.

@lucas-rami lucas-rami requested a review from srpande May 27, 2025 16:30
@lucas-rami lucas-rami force-pushed the waves-per-eu-zero-is-allowed branch from 580f9a9 to 1bb023b Compare June 26, 2025 13:22
@lucas-rami
Copy link
Contributor Author

ping

@shiltian
Copy link
Contributor

The amdgpu-waves-per-eu attribute currently does not allow users to lower the target occupancy range that the AMDGPU backend will try to achieve. For example, for a kernel targeting gfx942 with default flat workgroup sizes and no LDS usage, the AMDGPUSubtarget::getWavesPerEU method will by default produce the range [4,10].

IIUC that is because the flat workgroup size. Waves per EU must yield to the value computed from flat workgroup size, and if it is absent, we must assume it can be 1024.

@lucas-rami
Copy link
Contributor Author

IIUC that is because the flat workgroup size. Waves per EU must yield to the value computed from flat workgroup size, and if it is absent, we must assume it can be 1024.

AFAIU the intent of the existing implementation is that the default minimum waves/EU is set so that all the waves of a workgroup of maximum size can fit concurrently on a single CU. I am not proposing we change that, what I would like the "amdgpu-waves-per-eu" attribute to do is be able to lower than minimum at the user's request in cases where higher occupancies are not thought to be beneficial.

Comment on lines 2522 to 2544

if (TT.isAMDGPU()) {
if (auto A = Attrs.getFnAttr("amdgpu-waves-per-eu"); A.isValid()) {
std::pair<StringRef, StringRef> Strs = A.getValueAsString().split(',');
unsigned Min = 0;
StringRef MinStr = Strs.first.trim();
Check(!MinStr.getAsInteger(0, Min),
"minimum for 'amdgpu-waves-per-eu' must be integer: " + MinStr);
if (!Strs.second.empty()) {
unsigned Max = 0;
StringRef MaxStr = Strs.second.trim();
Check(!MaxStr.getAsInteger(0, Max),
"maximum for 'amdgpu-waves-per-eu' must be integer: " + MaxStr);
Check(Max, "maximum for 'amdgpu-waves-per-eu' must be non-zero");
Check(Min <= Max, "minimum must be less than or equal to maximum for "
"'amdgpu-waves-per-eu': " +
MinStr + " > " + MaxStr);
} else {
Check(Min, "minimum for 'amdgpu-waves-per-eu' must be non-zero when "
"maximum is not provided");
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should do this in separate patch. We could also drop some of the error messages we emit in the backend on parse failures of these

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleted this for now and will prepare a patch to go on top of this one.

Comment on lines 5 to 7
define void @valid_amdgpu_waves_per_eu_range() "amdgpu-waves-per-eu"="2,4" {
ret void
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

valid cases are tested in test/Assembler or just implicitly from their other uses

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will keep that in mind, thanks.

@@ -46,7 +46,6 @@ __attribute__((amdgpu_num_sgpr(4294967296))) kernel void kernel_num_sgpr_L() {}
__attribute__((amdgpu_num_vgpr(4294967296))) kernel void kernel_num_vgpr_L() {} // expected-error {{integer constant expression evaluates to value 4294967296 that cannot be represented in a 32-bit unsigned integer type}}

__attribute__((amdgpu_flat_work_group_size(0, 64))) kernel void kernel_flat_work_group_size_0_64() {} // expected-error {{'amdgpu_flat_work_group_size' attribute argument is invalid: max must be 0 since min is 0}}
__attribute__((amdgpu_waves_per_eu(0, 4))) kernel void kernel_waves_per_eu_0_4() {} // expected-error {{'amdgpu_waves_per_eu' attribute argument is invalid: max must be 0 since min is 0}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing clang codegen test changes that show the new accepted values. This is still not emitting minimums of 0 though, so this is just losing a test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[0,4] is now a valid range (i.e., no minimum requested, at most 4) so I moved it below instead of deleting it. I also added some HIP codegen tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend:AMDGPU clang:codegen IR generation bugs: mangling, exceptions, etc. clang:frontend Language frontend issues, e.g. anything involving "Sema" clang Clang issues not falling into any other category llvm:ir
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants