[MatMul] loop interleaving pass to interleave double buffered unrolled loops #1975

shmsong · 2022-09-13T20:10:43Z

The loop interleaving optimization in this PR is needed on ampere with cp.async.

The transform this pass enables is the following:

original code:

for i0 in 0..4
  expr1

for i1 in 0..8
  expr2

for i2 in 0..4
  expr3

with some simple conservative checking that expr 1-3 have no direct dependencies, the pass transforms the above into:

for i0 in {0}
  expr1

for i1 in {0,1}
  expr2

for i2 in {0}
  expr3

for i0 in {1}
  expr1

for i1 in {2,3}
  expr2

for i2 in {1}
  expr3

for i0 in {2}
  expr1

for i1 in {4,5}
  expr2

for i2 in {2}
  expr3

...

The particular use case is the following:

for i0 in 0..4
  cp.async

for i1 in 0..8
  load.shared

// In here we are accumulating a lot of instructions
//  that either read or write shared memory and we will
//  see slow down due to congestions on the hardware

for i2 in 0..4
  mma

The interleaving essentially optimizes away the congestion mentioned above on the comment.

…aving

naoyam · 2022-09-22T05:53:22Z

torch/csrc/jit/codegen/cuda/ir_interface_nodes.h

@@ -627,6 +637,9 @@ class TORCH_CUDA_CU_API TensorView : public Val {
  //! Indicates if the prolog of the double buffer loop of double
  //!  buffer tensor will be lifted out of the main loop.
  bool skew_double_buffer_loop_ = false;
+
+  // Loop where the next level of unrolled loops are interleaved.
+  c10::optional<std::pair<int, int>> maybe_interleave_axis_and_factor_;


Add comments on the pair

naoyam · 2022-09-22T16:03:34Z

torch/csrc/jit/codegen/cuda/lower_interleaved_loop.cpp

+          // If we see main loop before seeing the double buffer axis,
+          //  it cannot be proven safe to interleave by double buffering
+          //  but the other two points might apply.
+          can_interleave = false;


Shouldn't a break be added here?

naoyam · 2022-09-22T16:04:31Z

torch/csrc/jit/codegen/cuda/lower_interleaved_loop.cpp

+      continue;
+    }
+
+    // Double buffered tv doesn't need to be checked, see Point 2 above:


Typo: Point 1

naoyam · 2022-09-22T16:16:25Z

torch/csrc/jit/codegen/cuda/lower_interleaved_loop.cpp

+  // [Supported Interleaving Cases]
+  // All the expressions that are inside the main loop or subloop can
+  //  only be 3 cases:
+  // 1. It's double/circular buffered across a loop that's either at or on the


Can't follow this case.

What I can't yet figure out is what the underlying generic condition this transformation must satisfy. Generally speaking, it seems safe if there's no data dependency between the subloop TVs, which basically corresponds to the Point 3. In the case of Point 2, it is also safe despite the data dependency because the dependency is constrained inside the sub loop, right? I can't wrap my head around the Point 1 yet, though.

naoyam · 2022-09-22T18:28:35Z

torch/csrc/jit/codegen/cuda/lower_interleaved_loop.cpp

+    if (concrete_main_loop_ == concrete_loop_id &&
+        fl->doubleBufferLoopStage() == DoubleBufferLoopStage::Main) {
+      handleMainLoop(fl);
+    } else {
+      kir::ExprMutator::handle(fl);


Does this mean the interleave main loop must also be a main loop of double buffering?

naoyam · 2022-09-22T19:17:56Z

torch/csrc/jit/codegen/cuda/lower_double_buffer.cpp

+    // Need to insert commits for multi-stage circular buffering
+    //  on the prologs, but do not need to wait for them until
+    //  the main loop.
+    if (stage_depth > 2 && loop_type_ == DoubleBufferLoopStage::Prolog) {


Is this a generic bug fix or is it related to the interleaving transformation?

naoyam · 2022-09-22T19:29:13Z

torch/csrc/jit/codegen/cuda/lower_double_buffer.cpp

+      if (need_insert_commit) {
+        main_loop->body().insert_before(
+            *block_sync_it, IrBuilder::create<kir::CpAsyncCommit>());
+      }


Not completely following what should be done here, but the above comment on need_insert_commit indicates a commit should be inserted before the wait, but this seems to insert a commit after the wait inserted above. Am I missing something?

shmsong added 7 commits August 29, 2022 12:00

add loop interleaving pass

5a6aa34

use interleaving in matmul scheduler

68ce333

Merge remote-tracking branch 'origin/index_codegen' into loop_interle…

9918f3d

…aving

[MOVE] circular buffer fix

222e053

make interleaving factor an option

bfc41f4

comments ; clean up

e648eea

optionally reorder the tiles to support legacy test (FIXME)

a98f510

shmsong changed the title ~~WIP: [Not ready for review] loop interleaving pass to interleave double buffered unrolled loops~~ loop interleaving pass to interleave double buffered unrolled loops Sep 21, 2022

naoyam reviewed Sep 22, 2022

View reviewed changes

naoyam and others added 4 commits September 28, 2022 15:11

Quick cleanup of things I noticed while reviewing

8a262b0

Merge branch 'index_codegen-rebase' into loop_interleaving-rebase

531d3e2

fixes

2d66bd3

Merge branch 'index_codegen' into loop_interleaving

77fdf76

csarofeen changed the title ~~loop interleaving pass to interleave double buffered unrolled loops~~ [MatMul] loop interleaving pass to interleave double buffered unrolled loops Oct 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[MatMul] loop interleaving pass to interleave double buffered unrolled loops #1975

[MatMul] loop interleaving pass to interleave double buffered unrolled loops #1975

Uh oh!

shmsong commented Sep 13, 2022

Uh oh!

naoyam Sep 22, 2022

Uh oh!

naoyam Sep 22, 2022

Uh oh!

naoyam Sep 22, 2022 •

edited

Loading

Uh oh!

naoyam Sep 22, 2022

Uh oh!

naoyam Sep 22, 2022

Uh oh!

naoyam Sep 22, 2022

Uh oh!

naoyam Sep 22, 2022

Uh oh!

Uh oh!

[MatMul] loop interleaving pass to interleave double buffered unrolled loops #1975

Are you sure you want to change the base?

[MatMul] loop interleaving pass to interleave double buffered unrolled loops #1975

Uh oh!

Conversation

shmsong commented Sep 13, 2022

Uh oh!

naoyam Sep 22, 2022

Choose a reason for hiding this comment

Uh oh!

naoyam Sep 22, 2022

Choose a reason for hiding this comment

Uh oh!

naoyam Sep 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

naoyam Sep 22, 2022

Choose a reason for hiding this comment

Uh oh!

naoyam Sep 22, 2022

Choose a reason for hiding this comment

Uh oh!

naoyam Sep 22, 2022

Choose a reason for hiding this comment

Uh oh!

naoyam Sep 22, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

naoyam Sep 22, 2022 •

edited

Loading