-
Notifications
You must be signed in to change notification settings - Fork 14.4k
[AMDGPU][Clang] Allow amdgpu-waves-per-eu attribute to lower target occupancy range #138284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@llvm/pr-subscribers-llvm-ir @llvm/pr-subscribers-backend-amdgpu Author: Lucas Ramirez (lucas-rami) ChangesClang's attribute reference specifies, for the "amdgpu-waves-per-eu" attribute, that "Passing 0, 0 as <min>, <max> implies the default behavior". However, the backend currently treats a minimum or maximum of 0 as an invalid bound that makes the whole attribute invalid/ignored. This makes the backend treat a 0 in the range as per the documentation. In particular, this allows a user to specify a maximum desired number of waves/EU without specifying a minimum (e.g., "amdgpu-waves-per-eu"="0,4"), which is not currently feasible. The following equivalences hold (
Full diff: https://github.com/llvm/llvm-project/pull/138284.diff 5 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUAttributor.cpp b/llvm/lib/Target/AMDGPU/AMDGPUAttributor.cpp
index b9ce8dc0c5cdb..0bbbe766968fc 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUAttributor.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUAttributor.cpp
@@ -1125,8 +1125,7 @@ struct AAAMDWavesPerEU : public AAAMDSizeRangeAttribute {
indicateOptimisticFixpoint();
};
- std::pair<unsigned, unsigned> MaxWavesPerEURange{
- 1U, InfoCache.getMaxWavesPerEU(*F)};
+ std::pair<unsigned, unsigned> MaxWavesPerEURange{0, 0};
// If the attribute exists, we will honor it if it is not the default.
if (auto Attr = InfoCache.getWavesPerEUAttr(*F)) {
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUSubtarget.cpp b/llvm/lib/Target/AMDGPU/AMDGPUSubtarget.cpp
index 563605f964cc6..4212d97eb9404 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUSubtarget.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUSubtarget.cpp
@@ -191,17 +191,25 @@ std::pair<unsigned, unsigned> AMDGPUSubtarget::getEffectiveWavesPerEU(
getOccupancyWithWorkGroupSizes(LDSBytes, FlatWorkGroupSizes).second};
Default.first = std::min(Default.first, Default.second);
- // Make sure requested minimum is less than requested maximum.
+ if (RequestedWavesPerEU.first) {
+ // Requested minimum must not violate subtarget's specifications.
+ if (RequestedWavesPerEU.first < Default.first)
+ return Default;
+ // Requested maximum must be no lesser than minimum.
+ if (RequestedWavesPerEU.second &&
+ RequestedWavesPerEU.first > RequestedWavesPerEU.second)
+ return Default;
+ }
+ // Requested maximum must not violate subtarget's specifications.
if (RequestedWavesPerEU.second &&
- RequestedWavesPerEU.first > RequestedWavesPerEU.second)
- return Default;
-
- // Make sure requested values do not violate subtarget's specifications and
- // are compatible with values implied by minimum/maximum flat workgroup sizes.
- if (RequestedWavesPerEU.first < Default.first ||
RequestedWavesPerEU.second > Default.second)
return Default;
+ // Replace unspecified bounds in the request with the default bounds.
+ if (!RequestedWavesPerEU.first)
+ RequestedWavesPerEU.first = Default.first;
+ if (!RequestedWavesPerEU.second)
+ RequestedWavesPerEU.second = Default.second;
return RequestedWavesPerEU;
}
@@ -220,7 +228,7 @@ std::pair<unsigned, unsigned>
AMDGPUSubtarget::getWavesPerEU(std::pair<unsigned, unsigned> FlatWorkGroupSizes,
unsigned LDSBytes, const Function &F) const {
// Default minimum/maximum number of waves per execution unit.
- std::pair<unsigned, unsigned> Default(1, getMaxWavesPerEU());
+ std::pair<unsigned, unsigned> Default(0, 0);
// Requested minimum/maximum number of waves per execution unit.
std::pair<unsigned, unsigned> Requested =
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUSubtarget.h b/llvm/lib/Target/AMDGPU/AMDGPUSubtarget.h
index 91fe2a69bc0b7..1cbb6a7b1ad43 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUSubtarget.h
+++ b/llvm/lib/Target/AMDGPU/AMDGPUSubtarget.h
@@ -119,7 +119,9 @@ class AMDGPUSubtarget {
/// Returns the target minimum/maximum number of waves per EU. This is based
/// on the minimum/maximum number of \p RequestedWavesPerEU and further
/// limited by the maximum achievable occupancy derived from the range of \p
- /// FlatWorkGroupSizes and number of \p LDSBytes per workgroup.
+ /// FlatWorkGroupSizes and number of \p LDSBytes per workgroup. A
+ /// minimum/maximum requested waves/EU value of 0 indicates an intent to not
+ /// restrict the corresponding bound.
std::pair<unsigned, unsigned>
getEffectiveWavesPerEU(std::pair<unsigned, unsigned> RequestedWavesPerEU,
std::pair<unsigned, unsigned> FlatWorkGroupSizes,
diff --git a/llvm/test/CodeGen/AMDGPU/attr-amdgpu-waves-per-eu.ll b/llvm/test/CodeGen/AMDGPU/attr-amdgpu-waves-per-eu.ll
index 4507fd5865989..d8827d0405295 100644
--- a/llvm/test/CodeGen/AMDGPU/attr-amdgpu-waves-per-eu.ll
+++ b/llvm/test/CodeGen/AMDGPU/attr-amdgpu-waves-per-eu.ll
@@ -200,3 +200,41 @@ entry:
ret void
}
attributes #10 = {"amdgpu-flat-work-group-size"="256,256" "amdgpu-waves-per-eu"="2,2"}
+
+; At most 2 waves per execution unit.
+; CHECK-LABEL: {{^}}empty_at_most_2:
+; CHECK: SGPRBlocks: 12
+; CHECK: VGPRBlocks: 21
+; CHECK: NumSGPRsForWavesPerEU: 102
+; CHECK: NumVGPRsForWavesPerEU: 85
+define amdgpu_kernel void @empty_at_most_2() #11 {
+entry:
+ ret void
+}
+attributes #11 = {"amdgpu-waves-per-eu"="0,2"}
+
+; Exactly 1024 workitems (limits occupancy to 8) and at least 5 waves per execution unit.
+; "amdgpu-waves-per-eu"="5,0" should have the same effect as "amdgpu-waves-per-eu"="5".
+; CHECK-LABEL: {{^}}empty_workitems_exactly_1024_waves_at_least_5:
+; CHECK: SGPRBlocks: 8
+; CHECK: VGPRBlocks: 7
+; CHECK: NumSGPRsForWavesPerEU: 65
+; CHECK: NumVGPRsForWavesPerEU: 29
+define amdgpu_kernel void @empty_workitems_exactly_1024_waves_at_least_5() #12 {
+entry:
+ ret void
+}
+attributes #12 = {"amdgpu-waves-per-eu"="5,0" "amdgpu-flat-work-group-size"="1024,1024"}
+
+; Unrestricted number of waves per execution unit.
+; "amdgpu-waves-per-eu"="0,0" should have the same effect as not providing the attribute.
+; CHECK-LABEL: {{^}}empty_default_waves:
+; CHECK: SGPRBlocks: 0
+; CHECK: VGPRBlocks: 0
+; CHECK: NumSGPRsForWavesPerEU: 1
+; CHECK: NumVGPRsForWavesPerEU: 1
+define amdgpu_kernel void @empty_default_waves() #13 {
+entry:
+ ret void
+}
+attributes #13 = {"amdgpu-waves-per-eu"="0,0"}
diff --git a/llvm/test/CodeGen/AMDGPU/propagate-waves-per-eu.ll b/llvm/test/CodeGen/AMDGPU/propagate-waves-per-eu.ll
index ae114f3213d8f..967cc764ea19c 100644
--- a/llvm/test/CodeGen/AMDGPU/propagate-waves-per-eu.ll
+++ b/llvm/test/CodeGen/AMDGPU/propagate-waves-per-eu.ll
@@ -1,7 +1,7 @@
; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --check-attributes --check-globals --version 2
; RUN: opt -S -mtriple=amdgcn-amd-amdhsa -passes=amdgpu-attributor %s | FileCheck %s
-; Check propagation of amdgpu-flat-work-group-size attribute.
+; Check propagation of amdgpu-waves-per-eu attribute.
; Called from a single kernel with 1,8
define internal void @default_to_1_8_a() {
@@ -216,30 +216,30 @@ define internal i32 @bitcasted_function() {
ret i32 0
}
-define internal void @called_from_invalid_bounds_0() {
-; CHECK-LABEL: define internal void @called_from_invalid_bounds_0
-; CHECK-SAME: () #[[ATTR10:[0-9]+]] {
+define internal void @called_without_min_waves() {
+; CHECK-LABEL: define internal void @called_without_min_waves
+; CHECK-SAME: () #[[ATTR0]] {
; CHECK-NEXT: ret void
;
ret void
}
-define internal void @called_from_invalid_bounds_1() {
-; CHECK-LABEL: define internal void @called_from_invalid_bounds_1
-; CHECK-SAME: () #[[ATTR10]] {
+define internal void @called_from_invalid_bounds() {
+; CHECK-LABEL: define internal void @called_from_invalid_bounds
+; CHECK-SAME: () #[[ATTR10:[0-9]+]] {
; CHECK-NEXT: ret void
;
ret void
}
-; Invalid range for amdgpu-waves-per-eu
-define amdgpu_kernel void @kernel_invalid_bounds_0_8() #9 {
-; CHECK-LABEL: define amdgpu_kernel void @kernel_invalid_bounds_0_8
+; amdgpu-waves-per-eu range only provides a maximum.
+define amdgpu_kernel void @kernel_0_8() #9 {
+; CHECK-LABEL: define amdgpu_kernel void @kernel_0_8
; CHECK-SAME: () #[[ATTR0]] {
-; CHECK-NEXT: call void @called_from_invalid_bounds_0()
+; CHECK-NEXT: call void @called_without_min_waves()
; CHECK-NEXT: ret void
;
- call void @called_from_invalid_bounds_0()
+ call void @called_without_min_waves()
ret void
}
@@ -247,10 +247,10 @@ define amdgpu_kernel void @kernel_invalid_bounds_0_8() #9 {
define amdgpu_kernel void @kernel_invalid_bounds_1_123() #10 {
; CHECK-LABEL: define amdgpu_kernel void @kernel_invalid_bounds_1_123
; CHECK-SAME: () #[[ATTR11:[0-9]+]] {
-; CHECK-NEXT: call void @called_from_invalid_bounds_1()
+; CHECK-NEXT: call void @called_from_invalid_bounds()
; CHECK-NEXT: ret void
;
- call void @called_from_invalid_bounds_1()
+ call void @called_from_invalid_bounds()
ret void
}
|
✅ With the latest revision this PR passed the C/C++ code formatter. |
Can we instead disallow using just one value? I find it super annoying to have something like |
That is orthogonal to allowing 0 as min/max right? One may still want to specify only a minimum or a maximum, so we need a way to say "don't care" for the other bound. I could disallow specifying only one value in a follow-up PR if there is agreement there, though I am not sure of what additional precaution to take for changes that will make some existing user code fail to compile. |
That's right.
At the moment only the upper bound can be skipped. |
The clang behavior doesn't necessarily map 1:1 to the IR attribute behavior. Isn't there some processing of the input values in clang? |
Not sure where to look, but the following snippet supports 0 as min/max and writes them without processing. It's only the subtarget methods that seem to consider them as invalid. llvm-project/clang/lib/CodeGen/Targets/AMDGPU.cpp Lines 741 to 759 in abd2c07
|
It doesn't add the attribute if it's 0? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see a reason to support this on the IR level. We should just make it a verifier error to use 0. Supporting 0 is only a shortcut to the default behavior without knowing the target limits in the source
Indeed I missed this, sorry.
After further investigation and from what I can understand sema already rejects a minimum of 0 if the maximum is not 0 i.e., it will only accept amdgpu_waves_per_eu(0, 0) or amdgpu_waves_per_eu(N) or amdgpu_waves_per_eu(N, M) where N>0 and M>=N.
I see your point. With the current way the attribute affects the range of waves/EU returned by the subtarget method a user may set the range to [1,N] to only limit the maximum number of waves/EU, so accepting 0 is indeed useless as-is. I think it only makes sense with what I intended to be the followup change to this, which I integrated here now (PR description updated to reflect the change). Hopefully letting the minimum be 0 makes more sense now. |
df06f67
to
9906334
Compare
@@ -279,7 +268,7 @@ define amdgpu_kernel void @kernel_3_6() #12 { | |||
; 3,6 -> 6,9 | |||
define internal void @refine_upper_func_3_6() #13 { | |||
; CHECK-LABEL: define internal void @refine_upper_func_3_6 | |||
; CHECK-SAME: () #[[ATTR9]] { | |||
; CHECK-SAME: () #[[ATTR14:[0-9]+]] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the last "real" change in the file. The waves/EU range for this function goes from [4,10] to [3,6]. All attribute changes below are just renames.
580f9a9
to
1bb023b
Compare
ping |
IIUC that is because the flat workgroup size. Waves per EU must yield to the value computed from flat workgroup size, and if it is absent, we must assume it can be 1024. |
AFAIU the intent of the existing implementation is that the default minimum waves/EU is set so that all the waves of a workgroup of maximum size can fit concurrently on a single CU. I am not proposing we change that, what I would like the "amdgpu-waves-per-eu" attribute to do is be able to lower than minimum at the user's request in cases where higher occupancies are not thought to be beneficial. |
llvm/lib/IR/Verifier.cpp
Outdated
|
||
if (TT.isAMDGPU()) { | ||
if (auto A = Attrs.getFnAttr("amdgpu-waves-per-eu"); A.isValid()) { | ||
std::pair<StringRef, StringRef> Strs = A.getValueAsString().split(','); | ||
unsigned Min = 0; | ||
StringRef MinStr = Strs.first.trim(); | ||
Check(!MinStr.getAsInteger(0, Min), | ||
"minimum for 'amdgpu-waves-per-eu' must be integer: " + MinStr); | ||
if (!Strs.second.empty()) { | ||
unsigned Max = 0; | ||
StringRef MaxStr = Strs.second.trim(); | ||
Check(!MaxStr.getAsInteger(0, Max), | ||
"maximum for 'amdgpu-waves-per-eu' must be integer: " + MaxStr); | ||
Check(Max, "maximum for 'amdgpu-waves-per-eu' must be non-zero"); | ||
Check(Min <= Max, "minimum must be less than or equal to maximum for " | ||
"'amdgpu-waves-per-eu': " + | ||
MinStr + " > " + MaxStr); | ||
} else { | ||
Check(Min, "minimum for 'amdgpu-waves-per-eu' must be non-zero when " | ||
"maximum is not provided"); | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should do this in separate patch. We could also drop some of the error messages we emit in the backend on parse failures of these
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deleted this for now and will prepare a patch to go on top of this one.
define void @valid_amdgpu_waves_per_eu_range() "amdgpu-waves-per-eu"="2,4" { | ||
ret void | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
valid cases are tested in test/Assembler or just implicitly from their other uses
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will keep that in mind, thanks.
@@ -46,7 +46,6 @@ __attribute__((amdgpu_num_sgpr(4294967296))) kernel void kernel_num_sgpr_L() {} | |||
__attribute__((amdgpu_num_vgpr(4294967296))) kernel void kernel_num_vgpr_L() {} // expected-error {{integer constant expression evaluates to value 4294967296 that cannot be represented in a 32-bit unsigned integer type}} | |||
|
|||
__attribute__((amdgpu_flat_work_group_size(0, 64))) kernel void kernel_flat_work_group_size_0_64() {} // expected-error {{'amdgpu_flat_work_group_size' attribute argument is invalid: max must be 0 since min is 0}} | |||
__attribute__((amdgpu_waves_per_eu(0, 4))) kernel void kernel_waves_per_eu_0_4() {} // expected-error {{'amdgpu_waves_per_eu' attribute argument is invalid: max must be 0 since min is 0}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing clang codegen test changes that show the new accepted values. This is still not emitting minimums of 0 though, so this is just losing a test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[0,4] is now a valid range (i.e., no minimum requested, at most 4) so I moved it below instead of deleting it. I also added some HIP codegen tests.
The$4 \leq M \leq N \leq 10$ , in which case the [M,N] range will be produced instead. Advanced developers may in some cases know that a specifc kernel would not benefit from higher occupancies and wish to communicate that to the backend. It in turns could make better codegen decisions if it knows that increasing occupancy is not a priority.
amdgpu-waves-per-eu
attribute currently does not allow users to lower the target occupancy range that the AMDGPU backend will try to achieve. For example, for a kernel targeting gfx942 with default flat workgroup sizes and no LDS usage, theAMDGPUSubtarget::getWavesPerEU
method will by default produce the range [4,10]. Setting"amdgpu-waves-per-eu=M,N"
(N being optional) will only have an effect ifThis modifies the interpretation of the$1 \leq M \leq N \leq 10$ .
amdgpu-waves-per-eu
attribute to enable this behavior. User-provided minimum/maximum number of waves/EU are now able to change the default waves/EU range almost arbitrarily, with only subtarget's specifications and the maximum occupancy induced by workgroup size and LDS usage limiting the target occupancy range. In the previous example, the target range will be [M,N] as long asTo allow users to specify only a maximum desired number of waves per EU, the attribute now accepts a minimum value of 0 and understands it as a wish to not restrict the mininum number of waves per EU (a minimum value of 1 indicates instead that the user is fine if the final occupancy is 1, hence it cannot be used as the default). The Clang attribute now additionally accepts ranges of the form [0,M] to support the latter use case.