Skip to content

Conversation

shmsong
Copy link

@shmsong shmsong commented Sep 13, 2022

The loop interleaving optimization in this PR is needed on ampere with cp.async.

The transform this pass enables is the following:

original code:

for i0 in 0..4
  expr1

for i1 in 0..8
  expr2

for i2 in 0..4
  expr3

with some simple conservative checking that expr 1-3 have no direct dependencies, the pass transforms the above into:

for i0 in {0}
  expr1

for i1 in {0,1}
  expr2

for i2 in {0}
  expr3

for i0 in {1}
  expr1

for i1 in {2,3}
  expr2

for i2 in {1}
  expr3

for i0 in {2}
  expr1

for i1 in {4,5}
  expr2

for i2 in {2}
  expr3

...

The particular use case is the following:

for i0 in 0..4
  cp.async

for i1 in 0..8
  load.shared

// In here we are accumulating a lot of instructions
//  that either read or write shared memory and we will
//  see slow down due to congestions on the hardware

for i2 in 0..4
  mma

The interleaving essentially optimizes away the congestion mentioned above on the comment.

@shmsong shmsong changed the title WIP: [Not ready for review] loop interleaving pass to interleave double buffered unrolled loops loop interleaving pass to interleave double buffered unrolled loops Sep 21, 2022
@@ -627,6 +637,9 @@ class TORCH_CUDA_CU_API TensorView : public Val {
//! Indicates if the prolog of the double buffer loop of double
//! buffer tensor will be lifted out of the main loop.
bool skew_double_buffer_loop_ = false;

// Loop where the next level of unrolled loops are interleaved.
c10::optional<std::pair<int, int>> maybe_interleave_axis_and_factor_;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add comments on the pair

// If we see main loop before seeing the double buffer axis,
// it cannot be proven safe to interleave by double buffering
// but the other two points might apply.
can_interleave = false;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't a break be added here?

continue;
}

// Double buffered tv doesn't need to be checked, see Point 2 above:
Copy link
Collaborator

@naoyam naoyam Sep 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: Point 1

// [Supported Interleaving Cases]
// All the expressions that are inside the main loop or subloop can
// only be 3 cases:
// 1. It's double/circular buffered across a loop that's either at or on the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't follow this case.

What I can't yet figure out is what the underlying generic condition this transformation must satisfy. Generally speaking, it seems safe if there's no data dependency between the subloop TVs, which basically corresponds to the Point 3. In the case of Point 2, it is also safe despite the data dependency because the dependency is constrained inside the sub loop, right? I can't wrap my head around the Point 1 yet, though.

Comment on lines +425 to +429
if (concrete_main_loop_ == concrete_loop_id &&
fl->doubleBufferLoopStage() == DoubleBufferLoopStage::Main) {
handleMainLoop(fl);
} else {
kir::ExprMutator::handle(fl);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean the interleave main loop must also be a main loop of double buffering?

// Need to insert commits for multi-stage circular buffering
// on the prologs, but do not need to wait for them until
// the main loop.
if (stage_depth > 2 && loop_type_ == DoubleBufferLoopStage::Prolog) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a generic bug fix or is it related to the interleaving transformation?

Comment on lines +849 to +852
if (need_insert_commit) {
main_loop->body().insert_before(
*block_sync_it, IrBuilder::create<kir::CpAsyncCommit>());
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not completely following what should be done here, but the above comment on need_insert_commit indicates a commit should be inserted before the wait, but this seems to insert a commit after the wait inserted above. Am I missing something?

@csarofeen csarofeen changed the title loop interleaving pass to interleave double buffered unrolled loops [MatMul] loop interleaving pass to interleave double buffered unrolled loops Oct 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants