Expr simplifier: simplification passes for matmul #2275

zasdfgbnm · 2022-12-17T06:27:40Z

Warning: this PR contains #2258 and #2273. Please review this PR after I have merged these two PRs and rebased this PR.

This PR adds a few more passes that are capable of simplifying matmul indexing well. The newly added passes are: cancelDivMod, distributeDivisibleDivMod, and distributeMul. The most helpful pass for matmul is distributeDivisibleDivMod. It simplifies indices like:

(threadIdx.x + 16 * i1) % 8

into

threadIdx.x % 8

which helps removing data dependency on i1 so that the index can be hoisted outside of the i1 loop.

Example matmul kernel code

Command:

$CUDA_VISIBLE_DEVICES=1 $PYTORCH_NVFUSER_DUMP="cuda_kernel,expr_simplify,ptxas_verbose" ./build/bin/nvfuser_bench --benchmark_filter=Nvfuser_Matmul_8warp4stage/no_quant_nvfuser_8warp_NT_Legacy/2048/3456/4096/manual_time

Kernel diff compare (this PR + #1900) vs #1900 alone
https://www.diffchecker.com/pcxGCQkn

Matmul perf benchmark

Compare (this PR + #1900) vs #1900 alone

Command:

$CUDA_VISIBLE_DEVICES=1 ./build/bin/nvfuser_bench --benchmark_filter=.*Matmul.*Legacy/2048/3456/4096.*

Before:

---------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                       Time             CPU   Iterations
---------------------------------------------------------------------------------------------------------------------------------
Nvfuser_Matmul_4warp3stage/no_quant_nvfuser_4warp_TT_Legacy/2048/3456/4096/manual_time       1209 us         1387 us          527
Nvfuser_Matmul_4warp3stage/no_quant_nvfuser_4warp_TN_Legacy/2048/3456/4096/manual_time       1204 us         1382 us          504
Nvfuser_Matmul_4warp3stage/no_quant_nvfuser_4warp_NT_Legacy/2048/3456/4096/manual_time       1126 us         1304 us          538
Nvfuser_Matmul_4warp4stage/no_quant_nvfuser_4warp_TT_Legacy/2048/3456/4096/manual_time       2018 us         2200 us          312
Nvfuser_Matmul_4warp4stage/no_quant_nvfuser_4warp_TN_Legacy/2048/3456/4096/manual_time       2082 us         2271 us          273
Nvfuser_Matmul_4warp4stage/no_quant_nvfuser_4warp_NT_Legacy/2048/3456/4096/manual_time       1957 us         2138 us          305
Nvfuser_Matmul_8warp3stage/no_quant_nvfuser_8warp_TT_Legacy/2048/3456/4096/manual_time       1297 us         1491 us          460
Nvfuser_Matmul_8warp3stage/no_quant_nvfuser_8warp_TN_Legacy/2048/3456/4096/manual_time       1358 us         1580 us          460
Nvfuser_Matmul_8warp3stage/no_quant_nvfuser_8warp_NT_Legacy/2048/3456/4096/manual_time       1314 us         1493 us          454
Nvfuser_Matmul_8warp4stage/no_quant_nvfuser_8warp_TT_Legacy/2048/3456/4096/manual_time       1411 us         1590 us          423
Nvfuser_Matmul_8warp4stage/no_quant_nvfuser_8warp_TN_Legacy/2048/3456/4096/manual_time       1407 us         1587 us          424
Nvfuser_Matmul_8warp4stage/no_quant_nvfuser_8warp_NT_Legacy/2048/3456/4096/manual_time       1318 us         1497 us          453
EagerModeMatmul/no_quant_eagermode_TT_Legacy/2048/3456/4096/manual_time                       878 us          942 us          806
EagerModeMatmul/no_quant_eagermode_TN_Legacy/2048/3456/4096/manual_time                       904 us          971 us          747
EagerModeMatmul/no_quant_eagermode_NT_Legacy/2048/3456/4096/manual_time                       834 us          903 us          856

After:

---------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                       Time             CPU   Iterations
---------------------------------------------------------------------------------------------------------------------------------
Nvfuser_Matmul_4warp3stage/no_quant_nvfuser_4warp_TT_Legacy/2048/3456/4096/manual_time        929 us         1109 us          678
Nvfuser_Matmul_4warp3stage/no_quant_nvfuser_4warp_TN_Legacy/2048/3456/4096/manual_time        951 us         1138 us          655
Nvfuser_Matmul_4warp3stage/no_quant_nvfuser_4warp_NT_Legacy/2048/3456/4096/manual_time        893 us         1083 us          680
Nvfuser_Matmul_4warp4stage/no_quant_nvfuser_4warp_TT_Legacy/2048/3456/4096/manual_time       1066 us         1249 us          595
Nvfuser_Matmul_4warp4stage/no_quant_nvfuser_4warp_TN_Legacy/2048/3456/4096/manual_time       1177 us         1358 us          529
Nvfuser_Matmul_4warp4stage/no_quant_nvfuser_4warp_NT_Legacy/2048/3456/4096/manual_time       1032 us         1217 us          604
Nvfuser_Matmul_8warp3stage/no_quant_nvfuser_8warp_TT_Legacy/2048/3456/4096/manual_time        936 us         1117 us          669
Nvfuser_Matmul_8warp3stage/no_quant_nvfuser_8warp_TN_Legacy/2048/3456/4096/manual_time        911 us         1182 us          670
Nvfuser_Matmul_8warp3stage/no_quant_nvfuser_8warp_NT_Legacy/2048/3456/4096/manual_time        928 us         1109 us          670
Nvfuser_Matmul_8warp4stage/no_quant_nvfuser_8warp_TT_Legacy/2048/3456/4096/manual_time        914 us         1117 us          671
Nvfuser_Matmul_8warp4stage/no_quant_nvfuser_8warp_TN_Legacy/2048/3456/4096/manual_time        936 us         1115 us          668
Nvfuser_Matmul_8warp4stage/no_quant_nvfuser_8warp_NT_Legacy/2048/3456/4096/manual_time        926 us         1108 us          669
EagerModeMatmul/no_quant_eagermode_TT_Legacy/2048/3456/4096/manual_time                       876 us          957 us          811
EagerModeMatmul/no_quant_eagermode_TN_Legacy/2048/3456/4096/manual_time                       903 us          977 us          755
EagerModeMatmul/no_quant_eagermode_NT_Legacy/2048/3456/4096/manual_time                       829 us          905 us          840

…ivial-mod

…sign-check

…ivial-mod

…into distribute-divmod

zasdfgbnm · 2023-01-17T00:48:13Z

@naoyam This is ready for review

…divmod

naoyam · 2023-01-18T03:41:53Z

This PR adds a few more passes that are capable of simplifying matmul indexing well. The newly added passes are: cancelDivMod, distributeDivisibleDivMod, and distributeMul. The most helpful pass for matmul is distributeDivisibleDivMod. It simplifies indices like:
(threadIdx.x + 16 * i1) % 8
into
threadIdx.x

Why is this legal? Is threadIdx.x assumed to be less than 8?

naoyam · 2023-01-18T04:04:27Z

third_party/nvfuser/csrc/expr_simplifier.cpp

 }

+BinaryOp* toDivModOp(Expr* expr) {
+  if (auto bop = dynamic_cast<BinaryOp*>(expr)) {


nit: Most of the functions have this pattern of conditional branches where we could reduce indentation levels by negating the condition and exit.

changed most of them

zasdfgbnm · 2023-01-18T05:10:09Z

This PR adds a few more passes that are capable of simplifying matmul indexing well. The newly added passes are: cancelDivMod, distributeDivisibleDivMod, and distributeMul. The most helpful pass for matmul is distributeDivisibleDivMod. It simplifies indices like:
(threadIdx.x + 16 * i1) % 8
into
threadIdx.x
Why is this legal? Is threadIdx.x assumed to be less than 8?

Oh, sorry. I meant to say threadIdx.x % 8...

naoyam

LGTM

…divmod

zasdfgbnm added 30 commits December 12, 2022 19:16

trivial mod

df5f5fe

simplifyZeroMod

7ab0654

recurseDown doc

110b160

fix

be528a2

save

0cfa8cf

update

73e1cad

save

5f67a3d

Merge branch 'devel' of github.com:csarofeen/pytorch into simplify-tr…

1adf8a4

…ivial-mod

format

7ae1981

save

975b950

move

2e039b1

rename

604ac82

fix

ee8867b

Merge branch 'devel' of github.com:csarofeen/pytorch into simplify-tr…

6d9726f

…ivial-mod

save

222f884

fix

024f6c2

fix

9aa3ab3

save

9530702

save

2794f43

non zero check

3a9c899

prove::isPositive, prove::isNonNegative, prove::isNonZero

0d739cd

save

6bcff97

more

6655ea3

Merge branch 'devel' of github.com:csarofeen/pytorch into compatible-…

fd5c2f4

…sign-check

save

659df58

more fix

1053cb1

Merge branch 'devel' of github.com:csarofeen/pytorch into simplify-tr…

52a1bce

…ivial-mod

fix

769cbc5

save

b8f9c08

save

1d9d93a

zasdfgbnm added 11 commits January 16, 2023 15:33

Merge branch 'compatible-sign-check' of github.com:csarofeen/pytorch …

6400b7d

…into distribute-divmod

fix

9ffa5a0

Merge branch 'compatible-sign-check' of github.com:csarofeen/pytorch …

39f0e02

…into distribute-divmod

cleanup

c824a1b

cleanup

eec6b5a

rename

6b35c43

save

a31cc24

VarInfoMap comment

f6c1522

Merge branch 'compatible-sign-check' of github.com:csarofeen/pytorch …

43c5ec5

…into distribute-divmod

fix

77252a1

fix

55b7f7b

zasdfgbnm requested a review from naoyam January 17, 2023 00:46

zasdfgbnm added 5 commits January 16, 2023 21:22

distributeMul restrict

b4539d0

distribute all

8f7afbb

fix

26b3c92

FusionAssociativeAndCommutativeReordering_CUDA skip

07097f1

FusionAssociativeAndCommutativeReordering_CUDA

0cb7472

Base automatically changed from compatible-sign-check to devel January 17, 2023 23:58

zasdfgbnm mentioned this pull request Jan 17, 2023

Expr simplifier: implement prove::isPositive, prove::isNonNegative, prove::isNonZero #2273

Merged

zasdfgbnm added 2 commits January 17, 2023 16:03

Merge branch 'devel' of github.com:csarofeen/pytorch into distribute-…

72b4e9c

…divmod

comment

dbe16cd

naoyam reviewed Jan 18, 2023

View reviewed changes

naoyam approved these changes Jan 18, 2023

View reviewed changes

zasdfgbnm added 2 commits January 18, 2023 11:57

Merge branch 'devel' of github.com:csarofeen/pytorch into distribute-…

7aef8da

…divmod

indent

6250809

zasdfgbnm merged commit cead7ad into devel Jan 18, 2023

zasdfgbnm deleted the distribute-divmod branch January 18, 2023 20:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Expr simplifier: simplification passes for matmul #2275

Expr simplifier: simplification passes for matmul #2275

Uh oh!

zasdfgbnm commented Dec 17, 2022 •

edited

Loading

Uh oh!

zasdfgbnm commented Jan 17, 2023

Uh oh!

naoyam commented Jan 18, 2023

Uh oh!

naoyam Jan 18, 2023

Uh oh!

zasdfgbnm Jan 18, 2023

Uh oh!

zasdfgbnm commented Jan 18, 2023

Uh oh!

naoyam left a comment

Uh oh!

Uh oh!

Expr simplifier: simplification passes for matmul #2275

Expr simplifier: simplification passes for matmul #2275

Uh oh!

Conversation

zasdfgbnm commented Dec 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Example matmul kernel code

Matmul perf benchmark

Uh oh!

zasdfgbnm commented Jan 17, 2023

Uh oh!

naoyam commented Jan 18, 2023

Uh oh!

naoyam Jan 18, 2023

Choose a reason for hiding this comment

Uh oh!

zasdfgbnm Jan 18, 2023

Choose a reason for hiding this comment

Uh oh!

zasdfgbnm commented Jan 18, 2023

Uh oh!

naoyam left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zasdfgbnm commented Dec 17, 2022 •

edited

Loading