Skip to content

[mlir][gpu] Skip address space checks for memrefs between launchFuncOp and kernel func #102925

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

kurapov-peter
Copy link
Contributor

Hi! I've been trying to use the new signature lowering (#101664 and #102621) with gpu kernel outlining, gpu binary generation, and OpenCL runtime. gpu-to-llvm-spv pass can handle memrefs with address spaces attributes, however, when lowering an arbitrary function (say, a function that accepts a tensor and then runs bufferization) and then using its argument as a kernel argument the address space is lacking.

Consider the following example function that accepts some device-host shared memory:

  func.func @foo(%mem : memref<5xf32>) {
    gpu.launch_func @gpu_kernels::@kernel args(%mem : memref<5xf32>)
    return
  }
  gpu.module @gpu_kernels {
    gpu.func @kernel(%arg0 : memref<5xf32, #gpu.address_space<global>>) kernel {
      gpu.return
    }
  }

The correct address space for a kernel argument is 1 for global (OpenCL's requirement), but it doesn't make any sense on the host side, for which, 0 is the right one (say, we rely on some runtime mechanism to deliver the data to the device). The two don't match and validation fails on the type checking even though the code is still valid.

The easiest workaround we discussed with @victor-eds is to allow this discrepancy on the validation side. It can be even more specific, and check the target to ensure this is the right case.

@llvmbot
Copy link
Member

llvmbot commented Aug 12, 2024

@llvm/pr-subscribers-mlir-gpu

@llvm/pr-subscribers-mlir

Author: Petr Kurapov (kurapov-peter)

Changes

Hi! I've been trying to use the new signature lowering (#101664 and #102621) with gpu kernel outlining, gpu binary generation, and OpenCL runtime. gpu-to-llvm-spv pass can handle memrefs with address spaces attributes, however, when lowering an arbitrary function (say, a function that accepts a tensor and then runs bufferization) and then using its argument as a kernel argument the address space is lacking.

Consider the following example function that accepts some device-host shared memory:

  func.func @<!-- -->foo(%mem : memref&lt;5xf32&gt;) {
    gpu.launch_func @<!-- -->gpu_kernels::@<!-- -->kernel args(%mem : memref&lt;5xf32&gt;)
    return
  }
  gpu.module @<!-- -->gpu_kernels {
    gpu.func @<!-- -->kernel(%arg0 : memref&lt;5xf32, #gpu.address_space&lt;global&gt;&gt;) kernel {
      gpu.return
    }
  }

The correct address space for a kernel argument is 1 for global (OpenCL's requirement), but it doesn't make any sense on the host side, for which, 0 is the right one (say, we rely on some runtime mechanism to deliver the data to the device). The two don't match and validation fails on the type checking even though the code is still valid.

The easiest workaround we discussed with @victor-eds is to allow this discrepancy on the validation side. It can be even more specific, and check the target to ensure this is the right case.


Full diff: https://github.com/llvm/llvm-project/pull/102925.diff

2 Files Affected:

  • (modified) mlir/lib/Dialect/GPU/IR/GPUDialect.cpp (+22-1)
  • (modified) mlir/test/Dialect/GPU/ops.mlir (+14)
diff --git a/mlir/lib/Dialect/GPU/IR/GPUDialect.cpp b/mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
index a1f87a637a6141..8c3391c8d92936 100644
--- a/mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
+++ b/mlir/lib/Dialect/GPU/IR/GPUDialect.cpp
@@ -401,8 +401,29 @@ LogicalResult GPUDialect::verifyOperationAttribute(Operation *op,
              << expectedNumArguments;
 
     auto functionType = kernelGPUFunction.getFunctionType();
+    auto typesMatch = [&](Type launchOpArgType, Type gpuFuncArgType) {
+      auto launchOpMemref = dyn_cast<MemRefType>(launchOpArgType);
+      auto kernelMemref = dyn_cast<MemRefType>(gpuFuncArgType);
+      // Allow address space incompatibility for OpenCL kernels: `gpu.launch`'s
+      // argument memref without address space attribute will match a kernel
+      // function's memref argument with address space `Global`.
+      if (launchOpMemref && kernelMemref) {
+        auto launchAS = llvm::dyn_cast_or_null<gpu::AddressSpaceAttr>(
+            launchOpMemref.getMemorySpace());
+        auto kernelAS = llvm::dyn_cast_or_null<gpu::AddressSpaceAttr>(
+            kernelMemref.getMemorySpace());
+        if (!launchAS && kernelAS &&
+            kernelAS.getValue() == gpu::AddressSpace::Global)
+          return launchOpMemref.getShape() == kernelMemref.getShape() &&
+                 launchOpMemref.getLayout() == kernelMemref.getLayout() &&
+                 launchOpMemref.getElementType() ==
+                     kernelMemref.getElementType();
+      }
+      return launchOpArgType == gpuFuncArgType;
+    };
     for (unsigned i = 0; i < expectedNumArguments; ++i) {
-      if (launchOp.getKernelOperand(i).getType() != functionType.getInput(i)) {
+      if (!typesMatch(launchOp.getKernelOperand(i).getType(),
+                      functionType.getInput(i))) {
         return launchOp.emitOpError("type of function argument ")
                << i << " does not match";
       }
diff --git a/mlir/test/Dialect/GPU/ops.mlir b/mlir/test/Dialect/GPU/ops.mlir
index ba7897f4e80cb5..fdfd9fcc8b1853 100644
--- a/mlir/test/Dialect/GPU/ops.mlir
+++ b/mlir/test/Dialect/GPU/ops.mlir
@@ -441,3 +441,17 @@ gpu.module @module_with_two_target [#nvvm.target, #rocdl.target<chip = "gfx90a">
 
 gpu.module @module_with_offload_handler <#gpu.select_object<0>> [#nvvm.target] {
 }
+
+// Check kernel memref args are valid even if the address space differs
+module attributes {gpu.container_module} {
+  func.func @foo(%mem : memref<5xf32>) {
+    %c0 = arith.constant 0 : i32
+    gpu.launch_func @gpu_kernels::@kernel blocks in (%c0, %c0, %c0) threads in (%c0, %c0, %c0) : i32 args(%mem : memref<5xf32>)
+    return
+  }
+  gpu.module @gpu_kernels {
+    gpu.func @kernel(%arg0 : memref<5xf32, #gpu.address_space<global>>) kernel {
+      gpu.return
+    }
+  }
+}

@victor-eds
Copy link
Contributor

Also, can we get a test for this in mlir/test/Dialect/gpu/ops.mlir?

Copy link
Contributor

@victor-eds victor-eds left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I'll add more reviewers

Copy link
Contributor

@fabianmcg fabianmcg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm blocking preemptively before discussing the changes, as I'm not that familiar with spv design.

Why don't you pass the correct address space from the beginning? For example:

func.func @foo(%mem : memref<5xf32, #gpu.address_space<global>>) {
    gpu.launch_func @gpu_kernels::@kernel args(%mem : memref<5xf32, #gpu.address_space<global>>)
    return
  }
  gpu.module @gpu_kernels {
    gpu.func @kernel(%arg0 : memref<5xf32, #gpu.address_space<global>>) kernel {
      gpu.return
    }
  }

Also, you need to change the description and title, this is not LaunchOp, but LaunchFuncOp

@kurapov-peter kurapov-peter changed the title Skip address space checks for memrefs between launchOp and kernel func Skip address space checks for memrefs between launchFuncOp and kernel func Aug 14, 2024
@kurapov-peter
Copy link
Contributor Author

Why don't you pass the correct address space from the beginning? For example:

func.func @foo(%mem : memref<5xf32, #gpu.address_space<global>>) {
    gpu.launch_func @gpu_kernels::@kernel args(%mem : memref<5xf32, #gpu.address_space<global>>)
    return
  }
  gpu.module @gpu_kernels {
    gpu.func @kernel(%arg0 : memref<5xf32, #gpu.address_space<global>>) kernel {
      gpu.return
    }
  }

I tried to explain that in the description: basically, for memory that is shared between host and device (the runtime controls the necessary transfers) the host memref should have address space zero as a regular memory buffer. It's not a gpu buffer, so the attribute doesn't make any sense.

Also, you need to change the description and title, this is not LaunchOp, but LaunchFuncOp

Yup, fixed.

@fabianmcg
Copy link
Contributor

for memory that is shared between host and device (the runtime controls the necessary transfers) the host memref should have address space zero as a regular memory buffer. It's not a gpu buffer, so the attribute doesn't make any sense.

My issue here is that you're changing the semantics of the op for a very particular case. The semantics of gpu.launch_func are similar to those of a call operation, where the signature is expected to be an exact match.

To me the correct solution is having a new operation for your runtime.

@kuhar kuhar changed the title Skip address space checks for memrefs between launchFuncOp and kernel func [mlir][gpu] Skip address space checks for memrefs between launchFuncOp and kernel func Aug 14, 2024
@kurapov-peter
Copy link
Contributor Author

To me the correct solution is having a new operation for your runtime.

Do you mean something like gpu.launch_opencl_func? It's not like it is specific to opencl runtime though; sycl has unified shared memory (see for example here: https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#sec:memory.model) - would have the same problem. The semantics is the same too with the address space exception, so the operation would essentially be a copy of the gpu.launch_func with the changed validation.

There was also an idea to leave the memrefs intact (no attributes) and then attach address spaces during the lowering to llvm. The problem is that it is unclear what should happen to a kernel that has two pointers as arguments: one with an attribute and one without.

My issue here is that you're changing the semantics of the op for a very particular case.

Yeah, I thought, maybe making it very specific to the use case would be tolerable. To keep all the other cases intact.

@fabianmcg
Copy link
Contributor

It's not like it is specific to opencl runtime though; sycl has unified shared memory (see for example here: https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#sec:memory.model) -

The issue is not unified memory. For example, CUDA and HIP also have it, but their models take the generic address space as the global one, hence no issues arise.

The semantics is the same too with the address space exception, so the operation would essentially be a copy of the gpu.launch_func with the changed validation.

The address space is part of the type, making exceptions is not the best solution.

The problem is that it is unclear what should happen to a kernel that has two pointers as arguments: one with an attribute and one without.

You know that you can attach attributes to individual arguments?

One option is adding: #gpu.address_space<unified>.

@victor-eds
Copy link
Contributor

victor-eds commented Aug 14, 2024

After taking a look at the MLIR Vulkan runner (also SPIR-V-based), it faces a similar issue and it's solution is to run a pass to set the expected address spaces both in host and device as part of the conversion pipeline. I haven't dived deep enough into this, but my first perception is this works "by chance" (or good planning in advance!).

The address space they are assuming for default (memref<f32>) memory spaces is StorageBuffer. So, if they have:

gpu.func @kernel(%arg0: memref<f32>) ...
...
gpu.launch @kernel(%arg0: memref<f32>)

This would be changed to:

gpu.func @kernel(%arg0: memref<f32, #spirv.StorageClass<StorageBuffer>>) ...
...
gpu.launch @kernel(%arg0: memref<f32, #spirv.StorageClass<StorageBuffer>>)

And, when lowered to LLVM dialect (pseudocode):

llvm.func @kernel(%arg0: !llvm.ptr) ...
...
"launch_op" @kernel(%arg0: !llvm.ptr)

It works fine because the StorageBuffer storage class is converted to address space 0 in LLVM. On the host, these programming models (AFAIK) use the default (0) address space both for host-side allocations and device-side (needed for this parameter passing). It happens to be the case the device side uses LLVM address space 0 to represent this particular storage class.

In our case, we'd need to use the CrossWorkgroup storage class there, mapping to LLVM address space 1 and, well, we'd be using address space 1 on the host (something these programming models do not do at all).

IMO being explicit with the memory spaces is better than any other tricks we come up with to avoid this issue. I am not that familiar with other programming models and I don't know how CUDA/HIP pipelines handle this (EDIT thanks @fabianmcg for clarifying!), but, as there are examples of address space mismatches between host and device in LLVM, I would reconsider the approach proposed by @kurapov-peter here.

EDIT: The generic address space in our case would be mapped to LLVM address space 4, so that's not an option either. Also, generic pointers as kernel arguments are not allowed in our environment.

@kurapov-peter
Copy link
Contributor Author

The issue is not unified memory. For example, CUDA and HIP also have it, but their models take the generic address space as the global one, hence no issues arise.

Yeah... somebody decided 1 is a great default value, now I'm suffering. One thought that comes to mind is whether this should be a part of some kind of a host-device calling convention (which is runtime-specific). That would make the validation strict by means of adhering to the convention and not rely on the default values.

After taking a look at the MLIR Vulkan runner (also SPIR-V-based), it faces a similar issue and it's solution is to run a pass to set the expected address spaces both in host and device as part of the conversion pipeline.

Oh, interesting, thanks Victor! Isn't that weird though? I suppose it also relies on the implicit knowledge that on host this would yield correct address space just because the default one is zero?

@kurapov-peter
Copy link
Contributor Author

The problem is that it is unclear what should happen to a kernel that has two pointers as arguments: one with an attribute and one without.

You know that you can attach attributes to individual arguments?

Sure, I mean it is unclear whether the user's intention actually was to have one of the arguments without the attribute for whatever reason. I suspect this shouldn't be a problem in practice though, since you need addrpspace(1) in any case.

One option is adding: #gpu.address_space.

That's interesting. Probably not in gpu too? Like, for all the devices on a platform?

@fabianmcg
Copy link
Contributor

but, as there are examples of address space mismatches between host and device in LLVM, I would reconsider the approach proposed by @kurapov-peter here.

Those mismatches actually have a TODO attached see: https://github.com/llvm/llvm-project/blob/main/mlir/lib/Dialect/GPU/IR/GPUDialect.cpp#L389-L394
We shouldn't aggravate the situation.

That's interesting. Probably not in gpu too? Like, for all the devices on a platform?

What I'm saying, is, if we add a 4th GPU address space: unified, that could fit your needs for the SPIR-V-like model.

@kurapov-peter
Copy link
Contributor Author

I'm inclined to just patch the signature in the pass. Sounds cleanest so far. Will give it some more thought.

@victor-eds
Copy link
Contributor

Those mismatches actually have a TODO attached see: https://github.com/llvm/llvm-project/blob/main/mlir/lib/Dialect/GPU/IR/GPUDialect.cpp#L389-L394
We shouldn't aggravate the situation.

I meant between host and device in LLVM IR. Anyway, I think we can just give this a thought and come back if we find further issues as @kurapov-peter says.

@kurapov-peter
Copy link
Contributor Author

Closing this as we have a cleaner solution in #105664.

@kurapov-peter kurapov-peter deleted the ocl-kernel-signature-validation branch October 10, 2024 12:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants