Skip to content

Conversation

zasdfgbnm
Copy link
Collaborator

@zasdfgbnm zasdfgbnm commented Oct 29, 2022

The biggest change in this PR is to change the index lowering of swizzle. Currently in devel branch, in index_compute.cpp the lowering of swizzle still creates a swizzle op, which in codegen.cpp will be replaced with strings like "Xor", "ZShape", etc., which are functions defined in swizzle.cu. This PR changes the lowering of swizzle as follows:

  • The definition of swizzle is pulled from swizzle.cu into ops/swizzle.{h, cpp} as composite operators.
    • Swizzle is no longer different from other composite ops like dropout, layer_norm, etc.
      • We can just apply swizzle to TensorViews to obtain new TensorViews, making it more flexible to test and play with. See FusionSwizzleExample*, which uses PyTorch's advanced indexing to visualize the memory layout of swizzled tensor.
      • The bank conflict checker is now able to work with swizzle.
    • The swizzle.cu will be cleaned up in later PR
    • The special handling of swizzle op in codegen.cpp is also deprecated, and will be removed in a followup PR.
  • In index_compute.cpp, instead of creating a swizzle op in index math, it just calls these composite operators in swizzle.h to create the index math.
  • In the future, I will add swizzle support to the transpose scheduler. The getTransposeHeuristics will take advantage of the bank conflict checker to pick the best swizzle strategy for each shared memory buffer.

Besides, I also:

Performance checked against #2022, no perf regression.

@zasdfgbnm zasdfgbnm marked this pull request as ready for review October 29, 2022 07:06
@zasdfgbnm zasdfgbnm changed the title [WIP][Not ready for review] Swizzle changes Some misc swizzle changes Oct 29, 2022
"TV4 is not redundantly used but not detected.");
}

// Test a basic swizzle pattern
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All swizzle tests in this file are moved to test_gpu_swizzle.cpp

using namespace torch::jit::fuser::cuda;

// Test a basic swizzle pattern
TEST_F(NVFuserTest, FusionSimpleSwizzle0_CUDA) {
Copy link
Collaborator Author

@zasdfgbnm zasdfgbnm Oct 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved from test_gpu3.cpp with trivial modification.

}

// Test swizzle inlining
TEST_F(NVFuserTest, FusionSimpleSwizzle1_CUDA) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved from test_gpu3.cpp without modification.

// Test sync insertion and memory check in parallelized swizzles.
// In this test, data is parallel written into smem in zcurve
// pattern and then read out and output to global mem unswizzled.
TEST_F(NVFuserTest, FusionSimpleSwizzle2_CUDA) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved from test_gpu3.cpp without modification.

}

// Test BestEffortReplay behavior with swizzle op
TEST_F(NVFuserTest, FusionSwizzleMapping_CUDA) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved from test_gpu3.cpp without modification.

}

// Test a basic loop swizzle pattern
TEST_F(NVFuserTest, FusionLoopSwizzle0_CUDA) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved from test_gpu3.cpp without modification.

}

// Outer block zshape pattern
TEST_F(NVFuserTest, FusionLoopSwizzle1_CUDA) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved from test_gpu3.cpp without modification.

}

// Test assertion in unsupported pattern: non-leaf loop swizzle.
TEST_F(NVFuserTest, FusionLoopSwizzleCheck0_CUDA) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved from test_gpu3.cpp without modification.

}

// Test assertion in unsupported pattern: half-inlined loop swizzle.
TEST_F(NVFuserTest, FusionLoopSwizzleCheck1_CUDA) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved from test_gpu3.cpp without modification.

ASSERT_ANY_THROW(fe.compileFusion(&fusion));
}

TEST_F(NVFuserTest, FusionSwizzleVectorize_CUDA) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a new test, please review.

ASSERT_ANY_THROW(GpuLower lower(&fusion));
}

TEST_F(NVFuserTest, FusionTransposeBankConflictSwizzle1_CUDA) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a new test, please review.

}
}

TEST_F(NVFuserTest, FusionDataSwizzleGlobal_CUDA) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a new test, please review.


} // namespace

TEST_F(NVFuserTest, FusionSwizzleExampleZShape_CUDA) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a new test, please review.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice way to test the swizzle ops!

TORCH_CHECK(at::allclose(input, unswizzled));
}

TEST_F(NVFuserTest, FusionSwizzleExampleXor_CUDA) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a new test, please review.

TORCH_CHECK(at::allclose(input, unswizzled));
}

TEST_F(NVFuserTest, FusionSwizzleExampleCyclicShift_CUDA) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a new test, please review.

@zasdfgbnm zasdfgbnm requested review from csarofeen and naoyam October 29, 2022 07:42
Copy link
Collaborator

@naoyam naoyam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks very good. One question is whether we would need to expose the new swizzle functions as they seem to be only used as part of the lowering. Would you expect they could be directly used by the user as well?


} // namespace

TEST_F(NVFuserTest, FusionSwizzleExampleZShape_CUDA) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice way to test the swizzle ops!

@zasdfgbnm zasdfgbnm requested a review from naoyam October 30, 2022 00:04
@zasdfgbnm zasdfgbnm mentioned this pull request Oct 30, 2022
Copy link
Collaborator

@naoyam naoyam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zasdfgbnm zasdfgbnm merged commit 292ebef into devel Oct 30, 2022
@zasdfgbnm zasdfgbnm deleted the swizzle-changes branch October 30, 2022 16:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants