Patch sync insertion for redundant predicated writes #1684

shmsong · 2022-05-09T21:01:02Z

This PR is a quick patch for redundant predicate sync insertion.

A sync is needed for redundant parallel type unless all use chain of the redundantly written value in smem/gmem arrive at redundant write consumers of the same parallel type.

This PR patches the insertion so that all redundant writes are sync'ed to avoid race conditions that may happen in devel TOT.

The detection for the cases where sync is not needed for redundant types will be in a follow up.

shmsong · 2022-05-09T22:11:38Z

torch/csrc/jit/codegen/cuda/executor.cpp

                  launch_params_.gdimy() * launch_params_.gdimz(),
          "Wanted to launch a cooperative kernel, however the number of blocks is greater than ",
          "what can be resident on the GPU at once. Need: ",
-          launch_params_.gdimx() * launch_params_.gdimy() * launch_params_.gdimz(),


Unrelated formatting.

naoyam · 2022-05-09T23:29:03Z

torch/csrc/jit/codegen/cuda/test/test_gpu.cpp

+  tv0->computeAt(tv3, 0);
+  tv1->computeAt(tv3, 0);


Are these meant to do something?

Not really. Just making sure all the CA parameter has a value. Vaguely remember we didn't have a default behavior without any CA setting but it was a long while ago.

naoyam · 2022-05-09T23:36:10Z

What I mentioned in the MMA PR was that when we have a chain of redundant exprs, I was wondering if each would be synchronized. I added a variation of the test to see what happens, and here's the generated code:

__global__ void kernel1(Tensor<float, 1> T0, Tensor<float, 2> T1, Tensor<float, 2> T3) {
  alignas(16) extern __shared__ char array[];
  unsigned offset = 0;
  offset = alignBufferSize(offset, 16);
  float* T4 = reinterpret_cast<float*>(array + offset);
  offset += (32 * sizeof(float));
  // Alias Allocation - shared
  auto& T2 = T4;
  if ((((nvfuser_index_t)threadIdx.y) == 0)) {
    T4[((nvfuser_index_t)threadIdx.x)]
       = T0[(((nvfuser_index_t)threadIdx.x) * T0.stride[0])];
  }
  __barrier_sync(0);
  if ((((nvfuser_index_t)threadIdx.y) == 0)) {
    T2[((nvfuser_index_t)threadIdx.x)]
       = T4[((nvfuser_index_t)threadIdx.x)];
  }
  __barrier_sync(0);
  T3[(((nvfuser_index_t)threadIdx.y) * 32) + ((nvfuser_index_t)threadIdx.x)]
    = T2[((nvfuser_index_t)threadIdx.x)]
    + T1[(((nvfuser_index_t)threadIdx.y) * T1.stride[0]) + (((nvfuser_index_t)threadIdx.x) * T1.stride[1])];
}

The point I was making is the first sync is redundant.

I guess this is not common, so I think it's fine with this for now, but I wanted to clarify my concern for the future optimization.

naoyam · 2022-05-09T23:37:11Z

Please remove or merge the added test as you'd like. I just wanted to demonstrate the case.

naoyam

Thanks for the fix.

shmsong · 2022-05-09T23:47:41Z

What I mentioned in the MMA PR was that when we have a chain of redundant exprs, I was wondering if each would be synchronized. I added a variation of the test to see what happens, and here's the generated code:
__global__ void kernel1(Tensor<float, 1> T0, Tensor<float, 2> T1, Tensor<float, 2> T3) {
  alignas(16) extern __shared__ char array[];
  unsigned offset = 0;
  offset = alignBufferSize(offset, 16);
  float* T4 = reinterpret_cast<float*>(array + offset);
  offset += (32 * sizeof(float));
  // Alias Allocation - shared
  auto& T2 = T4;
  if ((((nvfuser_index_t)threadIdx.y) == 0)) {
    T4[((nvfuser_index_t)threadIdx.x)]
       = T0[(((nvfuser_index_t)threadIdx.x) * T0.stride[0])];
  }
  __barrier_sync(0);
  if ((((nvfuser_index_t)threadIdx.y) == 0)) {
    T2[((nvfuser_index_t)threadIdx.x)]
       = T4[((nvfuser_index_t)threadIdx.x)];
  }
  __barrier_sync(0);
  T3[(((nvfuser_index_t)threadIdx.y) * 32) + ((nvfuser_index_t)threadIdx.x)]
    = T2[((nvfuser_index_t)threadIdx.x)]
    + T1[(((nvfuser_index_t)threadIdx.y) * T1.stride[0]) + (((nvfuser_index_t)threadIdx.x) * T1.stride[1])];
}
The point I was making is the first sync is redundant.

I guess this is not common, so I think it's fine with this for now, but I wanted to clarify my concern for the future optimization.

Yes. I was planning on handling this in a follow up, i.e. a redundant write has a use-chain that has other redundant writes. So T0-> T2 is a redundant chain and T2->T3 isn't.

I think these vertical redundant chains seem to be not too bad to handle. Would need to think a bit more about if there're pathological horizontal patterns, probably worst case we arrive at sub-optimal code without re-considering expr ordering.

shmsong · 2022-05-09T23:59:00Z

Please remove or merge the added test as you'd like. I just wanted to demonstrate the case.

@naoyam Thanks for the repro. The new test case moved to #1687 and made it a fail for the redundant sync insertion.

shmsong added 3 commits May 9, 2022 12:48

add war/double_buffer sync check in test

06de6cd

patch sync insertion

972efe7

comment

046675b

shmsong changed the base branch from master to devel May 9, 2022 21:01

shmsong mentioned this pull request May 9, 2022

Mma op integration on ampere #1440

Merged

4 tasks

add directed sync insertion test

9637c58

shmsong changed the title ~~WIP: Patch sync insertion for redundant predicated writes~~ Patch sync insertion for redundant predicated writes May 9, 2022

shmsong requested review from naoyam and csarofeen May 9, 2022 22:11

shmsong commented May 9, 2022

View reviewed changes

naoyam reviewed May 9, 2022

View reviewed changes

naoyam approved these changes May 9, 2022

View reviewed changes

shmsong force-pushed the patch_sync_insertion branch from 38b27fa to 9637c58 Compare May 9, 2022 23:58

shmsong merged commit f9132b7 into devel May 10, 2022

shmsong deleted the patch_sync_insertion branch May 10, 2022 17:28

shmsong mentioned this pull request May 20, 2022

Redundant thread compute analysis to avoid un-necessary sync insertion #1687

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Patch sync insertion for redundant predicated writes #1684

Patch sync insertion for redundant predicated writes #1684

Uh oh!

shmsong commented May 9, 2022 •

edited

Loading

Uh oh!

shmsong May 9, 2022

Uh oh!

naoyam May 9, 2022

Uh oh!

shmsong May 9, 2022

Uh oh!

naoyam commented May 9, 2022

Uh oh!

naoyam commented May 9, 2022

Uh oh!

naoyam left a comment

Uh oh!

shmsong commented May 9, 2022 •

edited

Loading

Uh oh!

shmsong commented May 9, 2022 •

edited

Loading

Uh oh!

Uh oh!

Patch sync insertion for redundant predicated writes #1684

Patch sync insertion for redundant predicated writes #1684

Uh oh!

Conversation

shmsong commented May 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shmsong May 9, 2022

Choose a reason for hiding this comment

Uh oh!

naoyam May 9, 2022

Choose a reason for hiding this comment

Uh oh!

shmsong May 9, 2022

Choose a reason for hiding this comment

Uh oh!

naoyam commented May 9, 2022

Uh oh!

naoyam commented May 9, 2022

Uh oh!

naoyam left a comment

Choose a reason for hiding this comment

Uh oh!

shmsong commented May 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shmsong commented May 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

shmsong commented May 9, 2022 •

edited

Loading

shmsong commented May 9, 2022 •

edited

Loading

shmsong commented May 9, 2022 •

edited

Loading