-
Notifications
You must be signed in to change notification settings - Fork 7
Some misc swizzle changes #2138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
"TV4 is not redundantly used but not detected."); | ||
} | ||
|
||
// Test a basic swizzle pattern |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All swizzle tests in this file are moved to test_gpu_swizzle.cpp
using namespace torch::jit::fuser::cuda; | ||
|
||
// Test a basic swizzle pattern | ||
TEST_F(NVFuserTest, FusionSimpleSwizzle0_CUDA) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved from test_gpu3.cpp
with trivial modification.
} | ||
|
||
// Test swizzle inlining | ||
TEST_F(NVFuserTest, FusionSimpleSwizzle1_CUDA) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved from test_gpu3.cpp
without modification.
// Test sync insertion and memory check in parallelized swizzles. | ||
// In this test, data is parallel written into smem in zcurve | ||
// pattern and then read out and output to global mem unswizzled. | ||
TEST_F(NVFuserTest, FusionSimpleSwizzle2_CUDA) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved from test_gpu3.cpp
without modification.
} | ||
|
||
// Test BestEffortReplay behavior with swizzle op | ||
TEST_F(NVFuserTest, FusionSwizzleMapping_CUDA) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved from test_gpu3.cpp
without modification.
} | ||
|
||
// Test a basic loop swizzle pattern | ||
TEST_F(NVFuserTest, FusionLoopSwizzle0_CUDA) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved from test_gpu3.cpp
without modification.
} | ||
|
||
// Outer block zshape pattern | ||
TEST_F(NVFuserTest, FusionLoopSwizzle1_CUDA) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved from test_gpu3.cpp
without modification.
} | ||
|
||
// Test assertion in unsupported pattern: non-leaf loop swizzle. | ||
TEST_F(NVFuserTest, FusionLoopSwizzleCheck0_CUDA) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved from test_gpu3.cpp
without modification.
} | ||
|
||
// Test assertion in unsupported pattern: half-inlined loop swizzle. | ||
TEST_F(NVFuserTest, FusionLoopSwizzleCheck1_CUDA) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved from test_gpu3.cpp
without modification.
ASSERT_ANY_THROW(fe.compileFusion(&fusion)); | ||
} | ||
|
||
TEST_F(NVFuserTest, FusionSwizzleVectorize_CUDA) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a new test, please review.
ASSERT_ANY_THROW(GpuLower lower(&fusion)); | ||
} | ||
|
||
TEST_F(NVFuserTest, FusionTransposeBankConflictSwizzle1_CUDA) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a new test, please review.
} | ||
} | ||
|
||
TEST_F(NVFuserTest, FusionDataSwizzleGlobal_CUDA) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a new test, please review.
|
||
} // namespace | ||
|
||
TEST_F(NVFuserTest, FusionSwizzleExampleZShape_CUDA) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a new test, please review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice way to test the swizzle ops!
TORCH_CHECK(at::allclose(input, unswizzled)); | ||
} | ||
|
||
TEST_F(NVFuserTest, FusionSwizzleExampleXor_CUDA) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a new test, please review.
TORCH_CHECK(at::allclose(input, unswizzled)); | ||
} | ||
|
||
TEST_F(NVFuserTest, FusionSwizzleExampleCyclicShift_CUDA) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a new test, please review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks very good. One question is whether we would need to expose the new swizzle functions as they seem to be only used as part of the lowering. Would you expect they could be directly used by the user as well?
|
||
} // namespace | ||
|
||
TEST_F(NVFuserTest, FusionSwizzleExampleZShape_CUDA) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice way to test the swizzle ops!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
The biggest change in this PR is to change the index lowering of swizzle. Currently in
devel
branch, inindex_compute.cpp
the lowering of swizzle still creates a swizzle op, which incodegen.cpp
will be replaced with strings like"Xor"
,"ZShape"
, etc., which are functions defined inswizzle.cu
. This PR changes the lowering of swizzle as follows:swizzle.cu
intoops/swizzle.{h, cpp}
as composite operators.dropout
,layer_norm
, etc.FusionSwizzleExample*
, which uses PyTorch's advanced indexing to visualize the memory layout of swizzled tensor.swizzle.cu
will be cleaned up in later PRcodegen.cpp
is also deprecated, and will be removed in a followup PR.index_compute.cpp
, instead of creating a swizzle op in index math, it just calls these composite operators inswizzle.h
to create the index math.getTransposeHeuristics
will take advantage of the bank conflict checker to pick the best swizzle strategy for each shared memory buffer.Besides, I also:
VectorizeValidator
to check bothX
andY
to reject vectorized swizzled ID. See [MatMul] Prolog build out, adding automatic swizzle generator for a few tile sizes #1900 (comment)Performance checked against #2022, no perf regression.