-
Notifications
You must be signed in to change notification settings - Fork 13.5k
[mlir][gpu] Skip address space checks for memrefs between launchFuncOp and kernel func #102925
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[mlir][gpu] Skip address space checks for memrefs between launchFuncOp and kernel func #102925
Conversation
@llvm/pr-subscribers-mlir-gpu @llvm/pr-subscribers-mlir Author: Petr Kurapov (kurapov-peter) ChangesHi! I've been trying to use the new signature lowering (#101664 and #102621) with gpu kernel outlining, gpu binary generation, and OpenCL runtime. Consider the following example function that accepts some device-host shared memory: func.func @<!-- -->foo(%mem : memref<5xf32>) {
gpu.launch_func @<!-- -->gpu_kernels::@<!-- -->kernel args(%mem : memref<5xf32>)
return
}
gpu.module @<!-- -->gpu_kernels {
gpu.func @<!-- -->kernel(%arg0 : memref<5xf32, #gpu.address_space<global>>) kernel {
gpu.return
}
} The correct address space for a kernel argument is The easiest workaround we discussed with @victor-eds is to allow this discrepancy on the validation side. It can be even more specific, and check the target to ensure this is the right case. Full diff: https://github.com/llvm/llvm-project/pull/102925.diff 2 Files Affected:
diff --git a/mlir/lib/Dialect/GPU/IR/GPUDialect.cpp b/mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
index a1f87a637a6141..8c3391c8d92936 100644
--- a/mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
+++ b/mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
@@ -401,8 +401,29 @@ LogicalResult GPUDialect::verifyOperationAttribute(Operation *op,
<< expectedNumArguments;
auto functionType = kernelGPUFunction.getFunctionType();
+ auto typesMatch = [&](Type launchOpArgType, Type gpuFuncArgType) {
+ auto launchOpMemref = dyn_cast<MemRefType>(launchOpArgType);
+ auto kernelMemref = dyn_cast<MemRefType>(gpuFuncArgType);
+ // Allow address space incompatibility for OpenCL kernels: `gpu.launch`'s
+ // argument memref without address space attribute will match a kernel
+ // function's memref argument with address space `Global`.
+ if (launchOpMemref && kernelMemref) {
+ auto launchAS = llvm::dyn_cast_or_null<gpu::AddressSpaceAttr>(
+ launchOpMemref.getMemorySpace());
+ auto kernelAS = llvm::dyn_cast_or_null<gpu::AddressSpaceAttr>(
+ kernelMemref.getMemorySpace());
+ if (!launchAS && kernelAS &&
+ kernelAS.getValue() == gpu::AddressSpace::Global)
+ return launchOpMemref.getShape() == kernelMemref.getShape() &&
+ launchOpMemref.getLayout() == kernelMemref.getLayout() &&
+ launchOpMemref.getElementType() ==
+ kernelMemref.getElementType();
+ }
+ return launchOpArgType == gpuFuncArgType;
+ };
for (unsigned i = 0; i < expectedNumArguments; ++i) {
- if (launchOp.getKernelOperand(i).getType() != functionType.getInput(i)) {
+ if (!typesMatch(launchOp.getKernelOperand(i).getType(),
+ functionType.getInput(i))) {
return launchOp.emitOpError("type of function argument ")
<< i << " does not match";
}
diff --git a/mlir/test/Dialect/GPU/ops.mlir b/mlir/test/Dialect/GPU/ops.mlir
index ba7897f4e80cb5..fdfd9fcc8b1853 100644
--- a/mlir/test/Dialect/GPU/ops.mlir
+++ b/mlir/test/Dialect/GPU/ops.mlir
@@ -441,3 +441,17 @@ gpu.module @module_with_two_target [#nvvm.target, #rocdl.target<chip = "gfx90a">
gpu.module @module_with_offload_handler <#gpu.select_object<0>> [#nvvm.target] {
}
+
+// Check kernel memref args are valid even if the address space differs
+module attributes {gpu.container_module} {
+ func.func @foo(%mem : memref<5xf32>) {
+ %c0 = arith.constant 0 : i32
+ gpu.launch_func @gpu_kernels::@kernel blocks in (%c0, %c0, %c0) threads in (%c0, %c0, %c0) : i32 args(%mem : memref<5xf32>)
+ return
+ }
+ gpu.module @gpu_kernels {
+ gpu.func @kernel(%arg0 : memref<5xf32, #gpu.address_space<global>>) kernel {
+ gpu.return
+ }
+ }
+}
|
Also, can we get a test for this in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I'll add more reviewers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm blocking preemptively before discussing the changes, as I'm not that familiar with spv
design.
Why don't you pass the correct address space from the beginning? For example:
func.func @foo(%mem : memref<5xf32, #gpu.address_space<global>>) {
gpu.launch_func @gpu_kernels::@kernel args(%mem : memref<5xf32, #gpu.address_space<global>>)
return
}
gpu.module @gpu_kernels {
gpu.func @kernel(%arg0 : memref<5xf32, #gpu.address_space<global>>) kernel {
gpu.return
}
}
Also, you need to change the description and title, this is not LaunchOp
, but LaunchFuncOp
I tried to explain that in the description: basically, for memory that is shared between host and device (the runtime controls the necessary transfers) the host memref should have address space zero as a regular memory buffer. It's not a gpu buffer, so the attribute doesn't make any sense.
Yup, fixed. |
My issue here is that you're changing the semantics of the op for a very particular case. The semantics of To me the correct solution is having a new operation for your runtime. |
Do you mean something like There was also an idea to leave the memrefs intact (no attributes) and then attach address spaces during the lowering to llvm. The problem is that it is unclear what should happen to a kernel that has two pointers as arguments: one with an attribute and one without.
Yeah, I thought, maybe making it very specific to the use case would be tolerable. To keep all the other cases intact. |
The issue is not unified memory. For example, CUDA and HIP also have it, but their models take the generic address space as the global one, hence no issues arise.
The address space is part of the type, making exceptions is not the best solution.
You know that you can attach attributes to individual arguments? One option is adding: |
After taking a look at the MLIR Vulkan runner (also SPIR-V-based), it faces a similar issue and it's solution is to run a pass to set the expected address spaces both in host and device as part of the conversion pipeline. I haven't dived deep enough into this, but my first perception is this works "by chance" (or good planning in advance!). The address space they are assuming for default (
This would be changed to:
And, when lowered to LLVM dialect (pseudocode):
It works fine because the In our case, we'd need to use the IMO being explicit with the memory spaces is better than any other tricks we come up with to avoid this issue. I am not that familiar with other programming models and I don't know how CUDA/HIP pipelines handle this (EDIT thanks @fabianmcg for clarifying!), but, as there are examples of address space mismatches between host and device in LLVM, I would reconsider the approach proposed by @kurapov-peter here. EDIT: The generic address space in our case would be mapped to LLVM address space |
Yeah... somebody decided
Oh, interesting, thanks Victor! Isn't that weird though? I suppose it also relies on the implicit knowledge that on host this would yield correct address space just because the default one is zero? |
Sure, I mean it is unclear whether the user's intention actually was to have one of the arguments without the attribute for whatever reason. I suspect this shouldn't be a problem in practice though, since you need
That's interesting. Probably not in gpu too? Like, for all the devices on a platform? |
Those mismatches actually have a
What I'm saying, is, if we add a 4th GPU address space: |
I'm inclined to just patch the signature in the pass. Sounds cleanest so far. Will give it some more thought. |
I meant between host and device in LLVM IR. Anyway, I think we can just give this a thought and come back if we find further issues as @kurapov-peter says. |
Closing this as we have a cleaner solution in #105664. |
Hi! I've been trying to use the new signature lowering (#101664 and #102621) with gpu kernel outlining, gpu binary generation, and OpenCL runtime.
gpu-to-llvm-spv
pass can handle memrefs with address spaces attributes, however, when lowering an arbitrary function (say, a function that accepts a tensor and then runs bufferization) and then using its argument as a kernel argument the address space is lacking.Consider the following example function that accepts some device-host shared memory:
The correct address space for a kernel argument is
1
for global (OpenCL's requirement), but it doesn't make any sense on the host side, for which,0
is the right one (say, we rely on some runtime mechanism to deliver the data to the device). The two don't match and validation fails on the type checking even though the code is still valid.The easiest workaround we discussed with @victor-eds is to allow this discrepancy on the validation side. It can be even more specific, and check the target to ensure this is the right case.