Ampere async copy ep.2: circular buffering extension to support pipelined matmul operand load #1827

shmsong · 2022-07-15T04:10:37Z

(Majority of changes in this PR is test case. ) This is conceptually just a generalization of double buffer logic to handle circular buffering.

There are still other usage patterns of cp.async and this one mainly focuses on the use case in matmul main loop.

Note:

Conceptually circular buffer is a generalization of double buffer, where double buffer is just a circular buffer with depth of 2. But use-case wise circular buffer only makes sense with cp.async on ampere while double buffering is a lot more widely applicable.

More details see comment section [Cicular buffer].

Co-authored-by: Gao, Xiang <[email protected]>

into matmul_propagator

This reverts commit d12a90f.

…fering

csarofeen

LGTM, just minor comments.

csarofeen · 2022-07-30T02:19:30Z

torch/csrc/jit/codegen/cuda/test/test_gpu_tensorcore.cpp

+}
+
+// MMA unit test on Ampere
+TEST_F(NVFuserTest, FusionAmpereMMATT_CUDA) {


Is this test case using the circular or double buffering?

Never mind, blaming how git is rendering the changes.

Yes it's just rebasing noise. Will sort them out and cleanup all tests in follow up PRs.

csarofeen · 2022-07-30T02:21:29Z

torch/csrc/jit/codegen/cuda/test/test_gpu_tensorcore.cpp

+  TORCH_CHECK(cg_outputs[0].allclose(tref, 0.0001, 0.0001));
+}
+
+// Matmul test on Ampere


Maybe I'm missing what you're testing new in the Ampere tests before this?

Never mind, looks like it's just the way github decided to mark the new tests, makes these tests look like they're being modified below.

csarofeen · 2022-07-30T02:23:59Z

torch/csrc/jit/codegen/cuda/test/test_gpu_tensorcore.cpp

+}
+
+// Matmul test on Turing
+TEST_F(NVFuserTest, FusionTuringMatmulTNRegDoubleBuffer_CUDA) {


Is double buffering on by default? Don't see it turned on here.

No it's not. There as a option in matmul schedule params that controls it. This was git rendering issue before rebasing.

csarofeen · 2022-07-30T02:28:10Z

torch/csrc/jit/codegen/cuda/lower_double_buffer.h

+//
+// for i in 0..N: // main loop
+//   for j in ...
+//     if pred:


Pred protects on the first iteration and the last D-1 iterations, right?

Yes that's correct. The last D-1 iteration of main loop would not load new data.

csarofeen · 2022-07-30T02:29:16Z

torch/csrc/jit/codegen/cuda/lower_double_buffer.h

+//
+// (Epilog omitted since this only makes sense in using
+// cp.async, where producer will be in global mem and consumer will
+// be in shared mem).


I'm not sure I follow this, isn't it just not necessary based on how you set the loops/predicates up?

The current double buffer/circular buffer infra would generate an epilog if the double buffered tensor is loading from shared mem. This comment just saying cp.async would only read from gmem so in this case there'd not be an epilog.

It could change over time.

csarofeen · 2022-07-30T02:36:17Z

torch/csrc/jit/codegen/cuda/index_compute.cpp

+    //  prolog loop as well covering the first N-1 iterations, N being the
+    //  stage depth.
+    if (!is_prolog || is_circular_buffer_loop) {
+      if (is_circular_buffer_loop && is_prolog) {


Nit: Just looks strange that you switched the position of these in subsequent if statements.

Updated. Thanks!

Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ Code changes includes: - codegen improvements: 1. removes un-necessary sync from redundant thread compute analysis 2. symmetric API for BestEffortReplay 3. support merge on trivial reductions 4. Ampere async copy improvements - bug fixes: 1. vectorization bug fixes 2. type inference patch : fixes upstream pytorch#81725 3. segmenter bug fix with deterministic iteration ordering - parser update 1. added leaky_relu - scheduler 1. normalization scheduler clean up. 2. simplifies matmul scheduling with new transform propagator 3. merge all dimensions in PW scheduler 4. various gemm related improvements - debuggability 1. nsight compute support 2. debug dump for InlinePropagator 3. Add `UnaryOpType::Print` Squashed commits to WAR github API Commits that's actually in this PR from the devel branch: ``` dfe02f3 Merge remote-tracking branch 'csarofeen/devel' into HEAD 1617373 Add `TensorViewBuilder::shape(std::vector<Val*> shape)` (#1884) 7cfb779 Merge pull request #1887 from csarofeen/upstream_merge_0803 3399f6d Merge remote-tracking branch 'origin/viable/strict' into HEAD 01208f5 Add `UnaryOpType::Print` which can be helpful for debugging (#1878) 0646522 Remove redundant TORCH_INTERNAL_ASSERT in lower_magic_zero.cpp (#1881) 7bc76aa Fix most inlined propagator for mismatched dims (#1875) 501f4aa Nonaffine swizzle formulation ep.2: Loop swizzle variant. (#1826) d863d69 Ampere async copy ep.2: circular buffering extension to support pipelined matmul operand load (#1827) e0ae11a Larger sized mma instructions to support full vectorization (#1824) 9bb4cf7 fragment iteration to support fully unrolled mma ops (#1823) a48270a Merge all dims in pointwise scheduler (#1872) 172fb36 Make MostInlined and BestEffort inline propagation no longer assert replayed (#1868) a64462a Allow trivial reduction to be merged (#1871) 440102b Symmetric API for BestEffortReplay (#1870) d1caf33 Some misc cleanups/refactor split out from #1854 (#1867) 1013eda Remove some welford specific logic. (#1864) 51589d3 Some cleanups on tests and heuristics params (#1866) a6b3e70 Segmenter bug fix, and deterministic iteration ordering. (#1865) 1b665b9 Add nullptr checks to IrBuilder (#1861) 1cd9451 Simplify matmul scheduling with the new transform propagator. (#1817) bbc1fb9 Add leaky_relu operation (#1852) e842a9b Minor cleanup in pointwise scheduler (#1858) 9ee850c Fix stringstream usage (#1857) 20a36c1 Improve nsight compute support (#1855) 4059103 Remove debugging `true ||` from getPointwiseHeuristics (#1822) 01117bf Misc cleanup (#1853) 5cc6494 Apply the magic-zero protection to each indexed domain individually for predicate indexing (#1846) 92e6f02 Cleanup normalization scheduler (#1845) db89c65 Type inference patch (#1848) 102fe93 Add debug dump for InlinePropagator (#1847) b7a4d93 Redundant thread compute analysis to avoid un-necessary sync insertion (#1687) 942be5b Upstream ci build fixes (#1842) 0b83645 Fix vectorization bug introduced in #1831 (#1840) 63630f1 Move MaxProducerPosUpdater into InlinePropagator::tearDown (#1825) 9135a96 Fix transpose benchmark dtype (#1839) 2c9a6c0 Add extra configurability to `parallelizeAllLike` (#1831) ``` RUN_TORCHBENCH: nvfuser Differential Revision: [D38543000](https://our.internmc.facebook.com/intern/diff/D38543000) Pull Request resolved: pytorch#83067 Approved by: https://github.com/davidberard98

shmsong and others added 30 commits July 11, 2022 22:15

use custom propagator in ampere TN

6f5ba21

add tile ordering utilities

2329caf

initial matmul scheduler implementation

121af43

use matmul scheduler prototype on ampere and turing test cases

f958c53

extend to support Volta

397f74c

minor cleanup

00d9a57

comment cleanup

d7035aa

minor fix

9ffc61d

add fragment iteration and use it in matmul scheduler

ed0f525

use scheduler params for tests

c972116

fragment support in double buffer

d12a90f

add register double buffering test cases

c306b9b

add circular buffer infrastructure

c8bc31d

use circular buffer in matmul scheduler

b23e0a8

minor update

d2d2636

clean up custom transform propagator

63f561f

Merge remote-tracking branch 'origin/devel' into matmul_propagator

3d47c1f

rebase fix

29f88c7

comment

d029b9f

move bounded selector to common area

5ac053f

Add logic to handle fake boundary tensors in selection.

b51d247

naming and comment

aba5087

remove unused parameters from mma node

426c381

remove unnecessary parameters from mma ir node

6d4f377

rename scheduling variables

5e1f41f

change accumulator tv interface

1960da9

Update torch/csrc/jit/codegen/cuda/scheduler/utils.h

3a411c2

Co-authored-by: Gao, Xiang <[email protected]>

PR feedback

8f2e4da

Merge branch 'matmul_propagator' of https://github.com/csarofeen/pytorch

eef3a97

into matmul_propagator

pipe through parallel type position

6ad2967

shmsong added 7 commits July 20, 2022 16:27

Merge remote-tracking branch 'origin/devel' into matmul_propagator

65c8f0a

Revert "fragment support in double buffer"

cd03b00

This reverts commit d12a90f.

Merge branch 'matmul_propagator' into fragment_iter

380dd66

use cache op to handle double buffer input

6ce6ff6

add more comment in matmul scheduler

62f09fc

more comments

538aa8b

comment fix

91f44fd

Base automatically changed from matmul_propagator to devel July 23, 2022 21:41

shmsong added 2 commits July 25, 2022 15:44

Merge remote-tracking branch 'origin/devel' into fragment_iter

75d51a5

rebase fix

546844a

shmsong changed the base branch from devel to fragment_iter July 25, 2022 23:00

shmsong added 7 commits July 25, 2022 16:10

Merge remote-tracking branch 'origin/fragment_iter' into circular_buf…

81e88ea

…fering

format

9692076

update interface and comment

bf8b6f3

rebase fix

00e5a89

comment

eac5bd9

comment

9535dd2

minor cleanup

65196a4

shmsong changed the title ~~WIP: [Not ready for review] Ampere async copy ep.2: circular buffering extension to support pipelined matmul operand load~~ Ampere async copy ep.2: circular buffering extension to support pipelined matmul operand load Jul 26, 2022

shmsong requested a review from csarofeen July 26, 2022 06:56

Base automatically changed from fragment_iter to devel July 29, 2022 18:29

csarofeen approved these changes Jul 30, 2022

View reviewed changes

shmsong added 2 commits July 30, 2022 18:55

Merge remote-tracking branch 'origin/devel' into circular_buffering

8b841e2

cleanup

6aa28eb

shmsong merged commit d863d69 into devel Jul 31, 2022

shmsong deleted the circular_buffering branch July 31, 2022 05:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Ampere async copy ep.2: circular buffering extension to support pipelined matmul operand load #1827

Ampere async copy ep.2: circular buffering extension to support pipelined matmul operand load #1827

Uh oh!

shmsong commented Jul 15, 2022 •

edited

Loading

Uh oh!

csarofeen left a comment

Uh oh!

csarofeen Jul 30, 2022

Uh oh!

csarofeen Jul 30, 2022

Uh oh!

shmsong Jul 31, 2022

Uh oh!

csarofeen Jul 30, 2022

Uh oh!

csarofeen Jul 30, 2022

Uh oh!

csarofeen Jul 30, 2022

Uh oh!

shmsong Jul 31, 2022

Uh oh!

csarofeen Jul 30, 2022

Uh oh!

shmsong Jul 31, 2022

Uh oh!

csarofeen Jul 30, 2022

Uh oh!

shmsong Jul 31, 2022

Uh oh!

csarofeen Jul 30, 2022

Uh oh!

shmsong Jul 31, 2022

Uh oh!

Uh oh!

Ampere async copy ep.2: circular buffering extension to support pipelined matmul operand load #1827

Ampere async copy ep.2: circular buffering extension to support pipelined matmul operand load #1827

Uh oh!

Conversation

shmsong commented Jul 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

csarofeen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

shmsong commented Jul 15, 2022 •

edited

Loading