[amdgpu][openmp] Avoiding writing to packet header twice #70695

JonChesterfield · 2023-10-30T17:42:54Z

I think it follows from the HSA spec that a write to the first byte is deemed significant to the GPU in which case writing to the second short and reading back from it later would be safe. However, the examples for this all involve an atomic write to the first 32 bits and it seems a credible risk that the occasional CI errors abound invalid packets have as their root cause that the firmware notices the early write to packet->setup and treats that as a sign that the packet is ready to go.

That was overly-paranoid, however in passing noticed the code in libc is genuinely invalid. The memset writes a zero to the header byte, changing it from type_invalid (1) to type_vendor (0), at which point the GPU is free to read the 64 byte packet and interpret it as a vendor packet, which is probably why libc CI periodically errors about invalid packets.

Also a drive by change to do the atomic store on a uint32_t consistently. I'm not sure offhand what __atomic_store_n on a uint16_t* and an int resolves to, seems better to be unambiguous there.

llvmbot · 2023-10-30T17:44:32Z

@llvm/pr-subscribers-backend-amdgpu

@llvm/pr-subscribers-libc

Author: Jon Chesterfield (JonChesterfield)

Changes

I think it follows from the HSA spec that a write to the first byte is deemed significant to the GPU in which case writing to the second short and reading back from it later would be safe. However, the examples for this all involve an atomic write to the first 32 bits and it seems a credible risk that the occasional CI errors abound invalid packets have as their root cause that the firmware notices the early write to packet->setup and treats that as a sign that the packet is ready to go.

That was overly-paranoid, however in passing noticed the code in libc is genuinely invalid. The memset writes a zero to the header byte, changing it from type_invalid (1) to type_vendor (0), at which point the GPU is free to read the 64 byte packet and interpret it as a vendor packet, which is probably why libc CI periodically errors about invalid packets.

Also a drive by change to do the atomic store on a uint32_t consistently. I'm not sure offhand what __atomic_store_n on a uint16_t* and an int resolves to, seems better to be unambiguous there.

Full diff: https://github.com/llvm/llvm-project/pull/70695.diff

2 Files Affected:

(modified) libc/utils/gpu/loader/amdgpu/Loader.cpp (+6-5)
(modified) openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp (+8-6)

diff --git a/libc/utils/gpu/loader/amdgpu/Loader.cpp b/libc/utils/gpu/loader/amdgpu/Loader.cpp
index 1d0247a6dc5dca0..80f513bb80f8859 100644
--- a/libc/utils/gpu/loader/amdgpu/Loader.cpp
+++ b/libc/utils/gpu/loader/amdgpu/Loader.cpp
@@ -222,13 +222,13 @@ hsa_status_t launch_kernel(hsa_agent_t dev_agent, hsa_executable_t executable,
   // Set up the packet for exeuction on the device. We currently only launch
   // with one thread on the device, forcing the rest of the wavefront to be
   // masked off.
-  std::memset(packet, 0, sizeof(hsa_kernel_dispatch_packet_t));
-  packet->setup = (1 + (params.num_blocks_y * params.num_threads_y != 1) +
+  uint16_t setup = (1 + (params.num_blocks_y * params.num_threads_y != 1) +
                    (params.num_blocks_z * params.num_threads_z != 1))
                   << HSA_KERNEL_DISPATCH_PACKET_SETUP_DIMENSIONS;
   packet->workgroup_size_x = params.num_threads_x;
   packet->workgroup_size_y = params.num_threads_y;
   packet->workgroup_size_z = params.num_threads_z;
+  packet->reserved0 = 0;
   packet->grid_size_x = params.num_blocks_x * params.num_threads_x;
   packet->grid_size_y = params.num_blocks_y * params.num_threads_y;
   packet->grid_size_z = params.num_blocks_z * params.num_threads_z;
@@ -236,7 +236,7 @@ hsa_status_t launch_kernel(hsa_agent_t dev_agent, hsa_executable_t executable,
   packet->group_segment_size = group_size;
   packet->kernel_object = kernel;
   packet->kernarg_address = args;
-
+  packet->reserved2 = 0;
   // Create a signal to indicate when this packet has been completed.
   if (hsa_status_t err =
           hsa_signal_create(1, 0, nullptr, &packet->completion_signal))
@@ -244,12 +244,13 @@ hsa_status_t launch_kernel(hsa_agent_t dev_agent, hsa_executable_t executable,
 
   // Initialize the packet header and set the doorbell signal to begin execution
   // by the HSA runtime.
-  uint16_t setup = packet->setup;
+
   uint16_t header =
       (HSA_PACKET_TYPE_KERNEL_DISPATCH << HSA_PACKET_HEADER_TYPE) |
       (HSA_FENCE_SCOPE_SYSTEM << HSA_PACKET_HEADER_SCACQUIRE_FENCE_SCOPE) |
       (HSA_FENCE_SCOPE_SYSTEM << HSA_PACKET_HEADER_SCRELEASE_FENCE_SCOPE);
-  __atomic_store_n(&packet->header, header | (setup << 16), __ATOMIC_RELEASE);
+  uint32_t header_word = header | (setup << 16u)
+  __atomic_store_n((uint32_t*)&packet->header, header_word, __ATOMIC_RELEASE);
   hsa_signal_store_relaxed(queue->doorbell_signal, packet_id);
 
   // Wait until the kernel has completed execution on the device. Periodically
diff --git a/openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp b/openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
index ab24856f9bc78e4..b763c6371cc363b 100644
--- a/openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
+++ b/openmp/libomptarget/plugins-nextgen/amdgpu/src/rtl.cpp
@@ -650,8 +650,8 @@ struct AMDGPUQueueTy {
     hsa_kernel_dispatch_packet_t *Packet = acquirePacket(PacketId);
     assert(Packet && "Invalid packet");
 
-    // The header of the packet is written in the last moment.
-    Packet->setup = UINT16_C(1) << HSA_KERNEL_DISPATCH_PACKET_SETUP_DIMENSIONS;
+    // The first 32 bits of the packet are written after the other fields
+    uint16_t Setup = UINT16_C(1) << HSA_KERNEL_DISPATCH_PACKET_SETUP_DIMENSIONS;
     Packet->workgroup_size_x = NumThreads;
     Packet->workgroup_size_y = 1;
     Packet->workgroup_size_z = 1;
@@ -667,7 +667,7 @@ struct AMDGPUQueueTy {
     Packet->completion_signal = OutputSignal->get();
 
     // Publish the packet. Do not modify the packet after this point.
-    publishKernelPacket(PacketId, Packet);
+    publishKernelPacket(PacketId, Setup, Packet);
 
     return Plugin::success();
   }
@@ -745,16 +745,17 @@ struct AMDGPUQueueTy {
   /// the kernel launch. Do not modify the packet once this function is called.
   /// Assumes the queue lock is acquired.
   void publishKernelPacket(uint64_t PacketId,
+                           uint16_t Setup,
                            hsa_kernel_dispatch_packet_t *Packet) {
     uint32_t *PacketPtr = reinterpret_cast<uint32_t *>(Packet);
 
-    uint16_t Setup = Packet->setup;
     uint16_t Header = HSA_PACKET_TYPE_KERNEL_DISPATCH << HSA_PACKET_HEADER_TYPE;
     Header |= HSA_FENCE_SCOPE_SYSTEM << HSA_PACKET_HEADER_ACQUIRE_FENCE_SCOPE;
     Header |= HSA_FENCE_SCOPE_SYSTEM << HSA_PACKET_HEADER_RELEASE_FENCE_SCOPE;
 
     // Publish the packet. Do not modify the package after this point.
-    __atomic_store_n(PacketPtr, Header | (Setup << 16), __ATOMIC_RELEASE);
+    uint32_t HeaderWord = Header | (Setup << 16u);
+    __atomic_store_n(PacketPtr, HeaderWord, __ATOMIC_RELEASE);
 
     // Signal the doorbell about the published packet.
     hsa_signal_store_relaxed(Queue->doorbell_signal, PacketId);
@@ -773,7 +774,8 @@ struct AMDGPUQueueTy {
     Header |= HSA_FENCE_SCOPE_SYSTEM << HSA_PACKET_HEADER_RELEASE_FENCE_SCOPE;
 
     // Publish the packet. Do not modify the package after this point.
-    __atomic_store_n(PacketPtr, Header | (Setup << 16), __ATOMIC_RELEASE);
+    uint32_t HeaderWord = Header | (Setup << 16u);
+    __atomic_store_n(PacketPtr, HeaderWord, __ATOMIC_RELEASE);
 
     // Signal the doorbell about the published packet.
     hsa_signal_store_relaxed(Queue->doorbell_signal, PacketId);

github-actions · 2023-10-30T17:55:02Z

✅ With the latest revision this PR passed the C/C++ code formatter.

jhuber6 · 2023-10-30T18:11:01Z

libc/utils/gpu/loader/amdgpu/Loader.cpp

-                  << HSA_KERNEL_DISPATCH_PACKET_SETUP_DIMENSIONS;
+  uint16_t setup = (1 + (params.num_blocks_y * params.num_threads_y != 1) +
+                    (params.num_blocks_z * params.num_threads_z != 1))
+                   << HSA_KERNEL_DISPATCH_PACKET_SETUP_DIMENSIONS;
  packet->workgroup_size_x = params.num_threads_x;


I think we should keep the memset but offset it by the first 4 bytes so we don't need to worry about the reserved fields off of the packet. Unless it's very important that we write to these only once.

To what end? The packet field assignments following are complete, and if they weren't, that would be a problem. It's probably only the first byte that is critical, but might be the first four bytes in practice.

jhuber6

LG

jhuber6 · 2023-10-30T18:28:19Z

libc/utils/gpu/loader/amdgpu/Loader.cpp

  // Create a signal to indicate when this packet has been completed.
  if (hsa_status_t err =
          hsa_signal_create(1, 0, nullptr, &packet->completion_signal))
    handle_error(err);

  // Initialize the packet header and set the doorbell signal to begin execution
  // by the HSA runtime.
-  uint16_t setup = packet->setup;
+


Extra whitespace?

It appears that llvm/llvm-project#70695 fixed an issue that prevented libc unit tests to be run in parallel.

JonChesterfield requested review from carlobertolli, jhuber6 and dhruvachak October 30, 2023 17:42

llvmbot added backend:AMDGPU libc openmp:libomptarget OpenMP offload runtime labels Oct 30, 2023

JonChesterfield force-pushed the jc_hsa_ordering branch from cddc917 to 53fafac Compare October 30, 2023 18:09

jhuber6 reviewed Oct 30, 2023

View reviewed changes

jhuber6 approved these changes Oct 30, 2023

View reviewed changes

[amdgpu][openmp] Avoiding writing to packet header twice

d55e389

JonChesterfield force-pushed the jc_hsa_ordering branch from 53fafac to d55e389 Compare October 30, 2023 18:34

JonChesterfield merged commit 896749a into llvm:main Oct 30, 2023

JonChesterfield deleted the jc_hsa_ordering branch October 30, 2023 18:36

jplehr added a commit to jplehr/llvm-zorg that referenced this pull request Dec 28, 2023

[libc][AMDGPU] Enable parallel tests for libc

349af61

It appears that llvm/llvm-project#70695 fixed an issue that prevented libc unit tests to be run in parallel.

jplehr mentioned this pull request Dec 28, 2023

[libc][AMDGPU] Enable parallel tests for libc on AMDGPU llvm/llvm-zorg#86

Merged

jplehr added a commit to llvm/llvm-zorg that referenced this pull request Dec 28, 2023

[libc][AMDGPU] Enable parallel tests for libc on AMDGPU (#86)

2fe8180

It appears that llvm/llvm-project#70695 fixed an issue that prevented libc unit tests to be run in parallel.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[amdgpu][openmp] Avoiding writing to packet header twice #70695

[amdgpu][openmp] Avoiding writing to packet header twice #70695

Uh oh!

JonChesterfield commented Oct 30, 2023 •

edited

Loading

Uh oh!

llvmbot commented Oct 30, 2023 •

edited

Loading

Uh oh!

github-actions bot commented Oct 30, 2023 •

edited

Loading

Uh oh!

jhuber6 Oct 30, 2023

Uh oh!

JonChesterfield Oct 30, 2023 •

edited

Loading

Uh oh!

jhuber6 left a comment

Uh oh!

jhuber6 Oct 30, 2023

Uh oh!

Uh oh!

[amdgpu][openmp] Avoiding writing to packet header twice #70695

[amdgpu][openmp] Avoiding writing to packet header twice #70695

Uh oh!

Conversation

JonChesterfield commented Oct 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

llvmbot commented Oct 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Oct 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jhuber6 Oct 30, 2023

Choose a reason for hiding this comment

Uh oh!

JonChesterfield Oct 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jhuber6 left a comment

Choose a reason for hiding this comment

Uh oh!

jhuber6 Oct 30, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

JonChesterfield commented Oct 30, 2023 •

edited

Loading

llvmbot commented Oct 30, 2023 •

edited

Loading

github-actions bot commented Oct 30, 2023 •

edited

Loading

JonChesterfield Oct 30, 2023 •

edited

Loading