Rewrite `reducePredicateRegisterUsage` #2533

zasdfgbnm · 2023-03-01T01:13:38Z

The former approach does not make sense because it does a lot of reordering, even if there are no register usage savings. This reordering can be annoying because it makes the code very hard to read. I am rewriting this pass so that it only reorders things when there is a register saving.

zasdfgbnm · 2023-03-01T09:19:48Z

Marking this as ready, but I would like to wait for #2500 because I don't want this to conflict with the new assertCUDAKernels in loop rotation tests.

zasdfgbnm · 2023-03-01T10:00:44Z

third_party/nvfuser/CMakeLists.txt

@@ -337,7 +337,7 @@ if(BUILD_TEST)
  list(APPEND JIT_TEST_SRCS ${NVFUSER_ROOT}/test/test_gpu2.cpp)
  list(APPEND JIT_TEST_SRCS ${NVFUSER_ROOT}/test/test_gpu3.cpp)
  list(APPEND JIT_TEST_SRCS ${NVFUSER_ROOT}/test/test_gpu_compute_with.cpp)
-  list(APPEND JIT_TEST_SRCS ${NVFUSER_ROOT}/test/test_gpu_expr_simplifier.cpp)
+  list(APPEND JIT_TEST_SRCS ${NVFUSER_ROOT}/test/test_expr_simplifier.cpp)


I think we prefix our test with test_gpu_ because we were inside torchscript's test directory, and we need to distinguish our tests with other torch jit tests. But I don't think this prefix makes sense anymore.

zasdfgbnm · 2023-03-01T10:03:14Z

third_party/nvfuser/test/test_expr_simplifier.cpp

@@ -47,7 +47,9 @@ void assertSimplifiedMod(Val* x, Val* y, Val* z) {

 } // namespace

-TEST_F(NVFuserTest, FusionAssociativeAndCommutativeReordering_CUDA) {


We were using this naming convention during the time when we link our tests with other jit tests and this naming convention helped us to find our tests from other jit tests. Now our tests has a standalone executable, so I don't think these naming conventions provide us with any benefit any more. So I am removing it to get a shorter test name.

I remember there was some CI setting that relies on this naming convention. CC: @jjsjann123

Checked with @jjsjann123 offline, we should keep the _CUDA suffix, but we can change the remaining. Updated this PR.

zasdfgbnm · 2023-03-01T10:05:16Z

third_party/nvfuser/test/test_utils.h

-    // This is failing ?!
-    // setAssertOutOfBound(true);


It is OK to fail due to the limitation of this feature.

naoyam · 2023-03-06T18:49:55Z

when there is a register saving.

I haven't looked into the PR yet, but how do you know if there's register saving?

zasdfgbnm · 2023-03-06T19:25:38Z

when there is a register saving.

I haven't looked into the PR yet, but how do you know if there's register saving?

I change this pass to only consider register saving on unrolled loop. For example, in threadIdx.x + 3 < T0.size[0], there is no unrolled loop, so changing it to threadIdx.x - T0.size[0] < -3 does not save anything, so now we will not move terms across the < boundary. This pass works by finding all terms that has unrolled loop index dependency, compute its register type, and compare the register type of the remaining terms. If there is a save (that is, remaining has gp register and unroll has uniform or imm, or remaining has uniform and unroll has imm), then move terms.

For example, if I have

#pragma unroll
for i = 0..8:
  threadIdx.x / 128 + i * 32 == 256 + blockIdx.y * i

Then I have register type:

gp,no_unroll + imm,unroll == imm,no_unroll + uniform,unroll

So I will need 8 general purpose register for the left and 8 uniform register for the right.

If I do

gp,no_unroll - imm,no_unroll == uniform,unroll - imm,unroll

Then I will need 1 general purpose register for the left, and 8 uniform register for the right, which saves 7 general purpose registers.

naoyam · 2023-03-06T19:38:01Z

when there is a register saving.

I haven't looked into the PR yet, but how do you know if there's register saving?

I change this pass to only consider register saving on unrolled loop. For example, in threadIdx.x + 3 < T0.size[0], there is no unrolled loop, so changing it to threadIdx.x - T0.size[0] < -3 does not save anything, so now we will not move terms across the < boundary. This pass works by finding all terms that has unrolled loop index dependency, compute its register type, and compare the register type of the remaining terms. If there is a save (that is, remaining has gp register and unroll has uniform or imm, or remaining has uniform and unroll has imm), then move terms.

For example, if I have
#pragma unroll
for i = 0..8:
  threadIdx.x / 128 + i * 32 == 256 + blockIdx.y * i
Then I have register type:
gp,no_unroll + imm,unroll == imm,no_unroll + uniform,unroll
So I will need 8 general purpose register for the left and 8 uniform register for the right.

If I do
gp,no_unroll - imm,no_unroll == uniform,unroll - imm,unroll
Then I will need 1 general purpose register for the left, and 8 uniform register for the right, which saves 7 general purpose registers.

Thanks for the explanation. Yeah, I found this part (https://github.com/csarofeen/pytorch/pull/2533/files#diff-7853cbfc8ac2e2e18643fb0ba06777e2ee9f4d43065973e9ffb38ebc0fcc0f68R1697), and it makes sense.

naoyam

LGTM. Please just make sure the test name change doesn't invalidate anything around CI.

naoyam · 2023-03-02T04:52:43Z

third_party/nvfuser/test/test_expr_simplifier.cpp

@@ -47,7 +47,9 @@ void assertSimplifiedMod(Val* x, Val* y, Val* z) {

 } // namespace

-TEST_F(NVFuserTest, FusionAssociativeAndCommutativeReordering_CUDA) {


I remember there was some CI setting that relies on this naming convention. CC: @jjsjann123

naoyam · 2023-03-06T19:13:04Z

third_party/nvfuser/csrc/expr_simplifier.cpp

-  } else {
-    redist_lhs({bop->lhs()});
+
+  auto [lhs_unroll, lhs_unroll_rtype, lhs_other, lhs_other_rtype] = classify(


I didn't know about this syntax. I'm assuming it has the same effect as using std::tie, right?

It is similar, but not the same. This is called structured binding. My understanding is, structured binding is mostly used to declare and initialize a variable. std::tie is mostly for assignment.

Oh, so this can be used for variable declarations. That sounds handy.

…cateRegisterUsage

zasdfgbnm added 2 commits February 28, 2023 16:33

test

956164f

fix

411ffec

zasdfgbnm marked this pull request as ready for review March 1, 2023 09:19

zasdfgbnm requested a review from naoyam March 1, 2023 09:20

zasdfgbnm commented Mar 1, 2023

View reviewed changes

zasdfgbnm added 2 commits March 2, 2023 10:34

Merge branch 'devel' of github.com:csarofeen/pytorc

5160bc1

save

eea7897

zasdfgbnm mentioned this pull request Mar 3, 2023

Add prove::{lessThan,lessEqual,greaterThan,greaterEqual} #2541

Merged

naoyam approved these changes Mar 6, 2023

View reviewed changes

zasdfgbnm added 3 commits March 6, 2023 14:02

Merge branch 'devel' of github.com:csarofeen/pytorch into reducePredi…

e76dc45

…cateRegisterUsage

_CUDA

7d3b537

fix test

e55f397

zasdfgbnm merged commit 5a69c1b into devel Mar 6, 2023

zasdfgbnm deleted the reducePredicateRegisterUsage branch March 6, 2023 22:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Rewrite `reducePredicateRegisterUsage` #2533

Rewrite `reducePredicateRegisterUsage` #2533

Uh oh!

zasdfgbnm commented Mar 1, 2023 •

edited

Loading

Uh oh!

zasdfgbnm commented Mar 1, 2023

Uh oh!

zasdfgbnm Mar 1, 2023

Uh oh!

zasdfgbnm Mar 1, 2023 •

edited

Loading

Uh oh!

naoyam Mar 2, 2023

Uh oh!

zasdfgbnm Mar 6, 2023

Uh oh!

zasdfgbnm Mar 1, 2023

Uh oh!

naoyam commented Mar 6, 2023

Uh oh!

zasdfgbnm commented Mar 6, 2023 •

edited

Loading

Uh oh!

naoyam commented Mar 6, 2023

Uh oh!

naoyam left a comment

Uh oh!

naoyam Mar 2, 2023

Uh oh!

naoyam Mar 6, 2023

Uh oh!

zasdfgbnm Mar 6, 2023

Uh oh!

naoyam Mar 6, 2023

Uh oh!

Uh oh!

		@@ -47,7 +47,9 @@ void assertSimplifiedMod(Val* x, Val* y, Val* z) {

		} // namespace

		TEST_F(NVFuserTest, FusionAssociativeAndCommutativeReordering_CUDA) {

Rewrite reducePredicateRegisterUsage #2533

Rewrite reducePredicateRegisterUsage #2533

Uh oh!

Conversation

zasdfgbnm commented Mar 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zasdfgbnm commented Mar 1, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zasdfgbnm Mar 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

naoyam commented Mar 6, 2023

Uh oh!

zasdfgbnm commented Mar 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

naoyam commented Mar 6, 2023

Uh oh!

naoyam left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Rewrite `reducePredicateRegisterUsage` #2533

Rewrite `reducePredicateRegisterUsage` #2533

zasdfgbnm commented Mar 1, 2023 •

edited

Loading

zasdfgbnm Mar 1, 2023 •

edited

Loading

zasdfgbnm commented Mar 6, 2023 •

edited

Loading