[mlir][gpu] Skip address space checks for memrefs between launchFuncOp and kernel func #102925

kurapov-peter · 2024-08-12T16:03:03Z

Hi! I've been trying to use the new signature lowering (#101664 and #102621) with gpu kernel outlining, gpu binary generation, and OpenCL runtime. gpu-to-llvm-spv pass can handle memrefs with address spaces attributes, however, when lowering an arbitrary function (say, a function that accepts a tensor and then runs bufferization) and then using its argument as a kernel argument the address space is lacking.

Consider the following example function that accepts some device-host shared memory:

  func.func @foo(%mem : memref<5xf32>) {
    gpu.launch_func @gpu_kernels::@kernel args(%mem : memref<5xf32>)
    return
  }
  gpu.module @gpu_kernels {
    gpu.func @kernel(%arg0 : memref<5xf32, #gpu.address_space<global>>) kernel {
      gpu.return
    }
  }

The correct address space for a kernel argument is 1 for global (OpenCL's requirement), but it doesn't make any sense on the host side, for which, 0 is the right one (say, we rely on some runtime mechanism to deliver the data to the device). The two don't match and validation fails on the type checking even though the code is still valid.

The easiest workaround we discussed with @victor-eds is to allow this discrepancy on the validation side. It can be even more specific, and check the target to ensure this is the right case.

llvmbot · 2024-08-12T16:03:36Z

@llvm/pr-subscribers-mlir-gpu

@llvm/pr-subscribers-mlir

Author: Petr Kurapov (kurapov-peter)

Changes

Hi! I've been trying to use the new signature lowering (#101664 and #102621) with gpu kernel outlining, gpu binary generation, and OpenCL runtime. gpu-to-llvm-spv pass can handle memrefs with address spaces attributes, however, when lowering an arbitrary function (say, a function that accepts a tensor and then runs bufferization) and then using its argument as a kernel argument the address space is lacking.

Consider the following example function that accepts some device-host shared memory:

  func.func @<!-- -->foo(%mem : memref&lt;5xf32&gt;) {
    gpu.launch_func @<!-- -->gpu_kernels::@<!-- -->kernel args(%mem : memref&lt;5xf32&gt;)
    return
  }
  gpu.module @<!-- -->gpu_kernels {
    gpu.func @<!-- -->kernel(%arg0 : memref&lt;5xf32, #gpu.address_space&lt;global&gt;&gt;) kernel {
      gpu.return
    }
  }

The correct address space for a kernel argument is 1 for global (OpenCL's requirement), but it doesn't make any sense on the host side, for which, 0 is the right one (say, we rely on some runtime mechanism to deliver the data to the device). The two don't match and validation fails on the type checking even though the code is still valid.

The easiest workaround we discussed with @victor-eds is to allow this discrepancy on the validation side. It can be even more specific, and check the target to ensure this is the right case.

Full diff: https://github.com/llvm/llvm-project/pull/102925.diff

2 Files Affected:

(modified) mlir/lib/Dialect/GPU/IR/GPUDialect.cpp (+22-1)
(modified) mlir/test/Dialect/GPU/ops.mlir (+14)

diff --git a/mlir/lib/Dialect/GPU/IR/GPUDialect.cpp b/mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
index a1f87a637a6141..8c3391c8d92936 100644
--- a/mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
+++ b/mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
@@ -401,8 +401,29 @@ LogicalResult GPUDialect::verifyOperationAttribute(Operation *op,
              << expectedNumArguments;
 
     auto functionType = kernelGPUFunction.getFunctionType();
+    auto typesMatch = [&](Type launchOpArgType, Type gpuFuncArgType) {
+      auto launchOpMemref = dyn_cast<MemRefType>(launchOpArgType);
+      auto kernelMemref = dyn_cast<MemRefType>(gpuFuncArgType);
+      // Allow address space incompatibility for OpenCL kernels: `gpu.launch`'s
+      // argument memref without address space attribute will match a kernel
+      // function's memref argument with address space `Global`.
+      if (launchOpMemref && kernelMemref) {
+        auto launchAS = llvm::dyn_cast_or_null<gpu::AddressSpaceAttr>(
+            launchOpMemref.getMemorySpace());
+        auto kernelAS = llvm::dyn_cast_or_null<gpu::AddressSpaceAttr>(
+            kernelMemref.getMemorySpace());
+        if (!launchAS && kernelAS &&
+            kernelAS.getValue() == gpu::AddressSpace::Global)
+          return launchOpMemref.getShape() == kernelMemref.getShape() &&
+                 launchOpMemref.getLayout() == kernelMemref.getLayout() &&
+                 launchOpMemref.getElementType() ==
+                     kernelMemref.getElementType();
+      }
+      return launchOpArgType == gpuFuncArgType;
+    };
     for (unsigned i = 0; i < expectedNumArguments; ++i) {
-      if (launchOp.getKernelOperand(i).getType() != functionType.getInput(i)) {
+      if (!typesMatch(launchOp.getKernelOperand(i).getType(),
+                      functionType.getInput(i))) {
         return launchOp.emitOpError("type of function argument ")
                << i << " does not match";
       }
diff --git a/mlir/test/Dialect/GPU/ops.mlir b/mlir/test/Dialect/GPU/ops.mlir
index ba7897f4e80cb5..fdfd9fcc8b1853 100644
--- a/mlir/test/Dialect/GPU/ops.mlir
+++ b/mlir/test/Dialect/GPU/ops.mlir
@@ -441,3 +441,17 @@ gpu.module @module_with_two_target [#nvvm.target, #rocdl.target<chip = "gfx90a">
 
 gpu.module @module_with_offload_handler <#gpu.select_object<0>> [#nvvm.target] {
 }
+
+// Check kernel memref args are valid even if the address space differs
+module attributes {gpu.container_module} {
+  func.func @foo(%mem : memref<5xf32>) {
+    %c0 = arith.constant 0 : i32
+    gpu.launch_func @gpu_kernels::@kernel blocks in (%c0, %c0, %c0) threads in (%c0, %c0, %c0) : i32 args(%mem : memref<5xf32>)
+    return
+  }
+  gpu.module @gpu_kernels {
+    gpu.func @kernel(%arg0 : memref<5xf32, #gpu.address_space<global>>) kernel {
+      gpu.return
+    }
+  }
+}

mlir/lib/Dialect/GPU/IR/GPUDialect.cpp

victor-eds · 2024-08-12T16:30:54Z

Also, can we get a test for this in mlir/test/Dialect/gpu/ops.mlir?

victor-eds

LGTM. I'll add more reviewers

fabianmcg

I'm blocking preemptively before discussing the changes, as I'm not that familiar with spv design.

Why don't you pass the correct address space from the beginning? For example:

func.func @foo(%mem : memref<5xf32, #gpu.address_space<global>>) {
    gpu.launch_func @gpu_kernels::@kernel args(%mem : memref<5xf32, #gpu.address_space<global>>)
    return
  }
  gpu.module @gpu_kernels {
    gpu.func @kernel(%arg0 : memref<5xf32, #gpu.address_space<global>>) kernel {
      gpu.return
    }
  }

Also, you need to change the description and title, this is not LaunchOp, but LaunchFuncOp

kurapov-peter · 2024-08-14T12:58:42Z

Why don't you pass the correct address space from the beginning? For example:

func.func @foo(%mem : memref<5xf32, #gpu.address_space<global>>) {
    gpu.launch_func @gpu_kernels::@kernel args(%mem : memref<5xf32, #gpu.address_space<global>>)
    return
  }
  gpu.module @gpu_kernels {
    gpu.func @kernel(%arg0 : memref<5xf32, #gpu.address_space<global>>) kernel {
      gpu.return
    }
  }

I tried to explain that in the description: basically, for memory that is shared between host and device (the runtime controls the necessary transfers) the host memref should have address space zero as a regular memory buffer. It's not a gpu buffer, so the attribute doesn't make any sense.

Also, you need to change the description and title, this is not LaunchOp, but LaunchFuncOp

Yup, fixed.

fabianmcg · 2024-08-14T13:17:06Z

for memory that is shared between host and device (the runtime controls the necessary transfers) the host memref should have address space zero as a regular memory buffer. It's not a gpu buffer, so the attribute doesn't make any sense.

My issue here is that you're changing the semantics of the op for a very particular case. The semantics of gpu.launch_func are similar to those of a call operation, where the signature is expected to be an exact match.

To me the correct solution is having a new operation for your runtime.

kurapov-peter · 2024-08-14T13:42:24Z

To me the correct solution is having a new operation for your runtime.

Do you mean something like gpu.launch_opencl_func? It's not like it is specific to opencl runtime though; sycl has unified shared memory (see for example here: https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#sec:memory.model) - would have the same problem. The semantics is the same too with the address space exception, so the operation would essentially be a copy of the gpu.launch_func with the changed validation.

There was also an idea to leave the memrefs intact (no attributes) and then attach address spaces during the lowering to llvm. The problem is that it is unclear what should happen to a kernel that has two pointers as arguments: one with an attribute and one without.

My issue here is that you're changing the semantics of the op for a very particular case.

Yeah, I thought, maybe making it very specific to the use case would be tolerable. To keep all the other cases intact.

fabianmcg · 2024-08-14T14:06:55Z

It's not like it is specific to opencl runtime though; sycl has unified shared memory (see for example here: https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#sec:memory.model) -

The issue is not unified memory. For example, CUDA and HIP also have it, but their models take the generic address space as the global one, hence no issues arise.

The semantics is the same too with the address space exception, so the operation would essentially be a copy of the gpu.launch_func with the changed validation.

The address space is part of the type, making exceptions is not the best solution.

The problem is that it is unclear what should happen to a kernel that has two pointers as arguments: one with an attribute and one without.

You know that you can attach attributes to individual arguments?

One option is adding: #gpu.address_space<unified>.

victor-eds · 2024-08-14T14:13:03Z

After taking a look at the MLIR Vulkan runner (also SPIR-V-based), it faces a similar issue and it's solution is to run a pass to set the expected address spaces both in host and device as part of the conversion pipeline. I haven't dived deep enough into this, but my first perception is this works "by chance" (or good planning in advance!).

The address space they are assuming for default (memref<f32>) memory spaces is StorageBuffer. So, if they have:

gpu.func @kernel(%arg0: memref<f32>) ...
...
gpu.launch @kernel(%arg0: memref<f32>)

This would be changed to:

gpu.func @kernel(%arg0: memref<f32, #spirv.StorageClass<StorageBuffer>>) ...
...
gpu.launch @kernel(%arg0: memref<f32, #spirv.StorageClass<StorageBuffer>>)

And, when lowered to LLVM dialect (pseudocode):

llvm.func @kernel(%arg0: !llvm.ptr) ...
...
"launch_op" @kernel(%arg0: !llvm.ptr)

It works fine because the StorageBuffer storage class is converted to address space 0 in LLVM. On the host, these programming models (AFAIK) use the default (0) address space both for host-side allocations and device-side (needed for this parameter passing). It happens to be the case the device side uses LLVM address space 0 to represent this particular storage class.

In our case, we'd need to use the CrossWorkgroup storage class there, mapping to LLVM address space 1 and, well, we'd be using address space 1 on the host (something these programming models do not do at all).

IMO being explicit with the memory spaces is better than any other tricks we come up with to avoid this issue. I am not that familiar with other programming models and I don't know how CUDA/HIP pipelines handle this (EDIT thanks @fabianmcg for clarifying!), but, as there are examples of address space mismatches between host and device in LLVM, I would reconsider the approach proposed by @kurapov-peter here.

EDIT: The generic address space in our case would be mapped to LLVM address space 4, so that's not an option either. Also, generic pointers as kernel arguments are not allowed in our environment.

kurapov-peter · 2024-08-14T14:27:57Z

The issue is not unified memory. For example, CUDA and HIP also have it, but their models take the generic address space as the global one, hence no issues arise.

Yeah... somebody decided 1 is a great default value, now I'm suffering. One thought that comes to mind is whether this should be a part of some kind of a host-device calling convention (which is runtime-specific). That would make the validation strict by means of adhering to the convention and not rely on the default values.

After taking a look at the MLIR Vulkan runner (also SPIR-V-based), it faces a similar issue and it's solution is to run a pass to set the expected address spaces both in host and device as part of the conversion pipeline.

Oh, interesting, thanks Victor! Isn't that weird though? I suppose it also relies on the implicit knowledge that on host this would yield correct address space just because the default one is zero?

kurapov-peter · 2024-08-14T14:36:01Z

The problem is that it is unclear what should happen to a kernel that has two pointers as arguments: one with an attribute and one without.

You know that you can attach attributes to individual arguments?

Sure, I mean it is unclear whether the user's intention actually was to have one of the arguments without the attribute for whatever reason. I suspect this shouldn't be a problem in practice though, since you need addrpspace(1) in any case.

One option is adding: #gpu.address_space.

That's interesting. Probably not in gpu too? Like, for all the devices on a platform?

fabianmcg · 2024-08-14T16:25:47Z

but, as there are examples of address space mismatches between host and device in LLVM, I would reconsider the approach proposed by @kurapov-peter here.

Those mismatches actually have a TODO attached see: https://github.com/llvm/llvm-project/blob/main/mlir/lib/Dialect/GPU/IR/GPUDialect.cpp#L389-L394
We shouldn't aggravate the situation.

That's interesting. Probably not in gpu too? Like, for all the devices on a platform?

What I'm saying, is, if we add a 4th GPU address space: unified, that could fit your needs for the SPIR-V-like model.

kurapov-peter · 2024-08-14T16:41:44Z

I'm inclined to just patch the signature in the pass. Sounds cleanest so far. Will give it some more thought.

victor-eds · 2024-08-14T17:37:21Z

Those mismatches actually have a TODO attached see: https://github.com/llvm/llvm-project/blob/main/mlir/lib/Dialect/GPU/IR/GPUDialect.cpp#L389-L394
We shouldn't aggravate the situation.

I meant between host and device in LLVM IR. Anyway, I think we can just give this a thought and come back if we find further issues as @kurapov-peter says.

kurapov-peter · 2024-08-26T13:40:49Z

Closing this as we have a cleaner solution in #105664.

Skip address space checks for memrefs between launchOp and kernel func

5d86ec1

llvmbot added mlir:gpu mlir labels Aug 12, 2024

victor-eds reviewed Aug 12, 2024

View reviewed changes

mlir/lib/Dialect/GPU/IR/GPUDialect.cpp Show resolved Hide resolved

victor-eds assigned kurapov-peter Aug 14, 2024

victor-eds approved these changes Aug 14, 2024

View reviewed changes

victor-eds requested review from ftynse, kuhar and fabianmcg August 14, 2024 11:00

fabianmcg requested changes Aug 14, 2024

View reviewed changes

kurapov-peter changed the title ~~Skip address space checks for memrefs between launchOp and kernel func~~ Skip address space checks for memrefs between launchFuncOp and kernel func Aug 14, 2024

kuhar changed the title ~~Skip address space checks for memrefs between launchFuncOp and kernel func~~ [mlir][gpu] Skip address space checks for memrefs between launchFuncOp and kernel func Aug 14, 2024

kurapov-peter mentioned this pull request Aug 22, 2024

[MLIR][GPU-LLVM] Add in-pass signature update option for opencl kernels #105664

Merged

kurapov-peter closed this Aug 26, 2024

kurapov-peter deleted the ocl-kernel-signature-validation branch October 10, 2024 12:40

[mlir][gpu] Skip address space checks for memrefs between launchFuncOp and kernel func #102925

[mlir][gpu] Skip address space checks for memrefs between launchFuncOp and kernel func #102925

Uh oh!

Conversation

kurapov-peter commented Aug 12, 2024

Uh oh!

llvmbot commented Aug 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

victor-eds commented Aug 12, 2024

Uh oh!

victor-eds left a comment

Choose a reason for hiding this comment

Uh oh!

fabianmcg left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kurapov-peter commented Aug 14, 2024

Uh oh!

fabianmcg commented Aug 14, 2024

Uh oh!

kurapov-peter commented Aug 14, 2024

Uh oh!

fabianmcg commented Aug 14, 2024

Uh oh!

victor-eds commented Aug 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kurapov-peter commented Aug 14, 2024

Uh oh!

kurapov-peter commented Aug 14, 2024

Uh oh!

fabianmcg commented Aug 14, 2024

Uh oh!

kurapov-peter commented Aug 14, 2024

Uh oh!

victor-eds commented Aug 14, 2024

Uh oh!

kurapov-peter commented Aug 26, 2024

Uh oh!

Uh oh!

llvmbot commented Aug 12, 2024 •

edited

Loading

fabianmcg left a comment •

edited

Loading

victor-eds commented Aug 14, 2024 •

edited

Loading