Transpose scheduler, step 1 #1854

zasdfgbnm · 2022-07-21T06:34:14Z

This PR adds a transpose scheduler. It includes the scheduler and the heuristics. But in the scheduler entry, I just return false in canScheduleCompileTime because I want to do more validation before turning it on by default.

Remaining things:

Add a lot more unit tests on a lot of different shapes and dtypes of inputs/outputs, different DAG structure, different transpose pattern. Also transpose with broadcasting, transpose with view, etc.
Investigate shared memory usage: vectorization, tile size, bank conflicting, swizzle, etc.
Fill in canScheduleCompileTime
Benchmark on more devices, A100 and V100

I think the scheduling logic here is fine, and further tuning should not change that much, just the parameters like tile size needs more tunning. I also think this PR is can be merged alone, because it should not change any existing behavior. But I am also OK with doing the remaining things in this PR.

…chedule

zasdfgbnm · 2022-07-22T22:01:36Z

Marking this as ready for review to start hearing feedbacks cc @csarofeen @shmsong

zasdfgbnm · 2022-07-22T22:04:40Z

torch/csrc/jit/codegen/cuda/scheduler/transpose.h

+namespace fuser {
+namespace cuda {
+
+// Note [Transpose scheduling]


This note could be a good starting point for review

zasdfgbnm · 2022-07-28T17:40:59Z

torch/csrc/jit/codegen/cuda/test/test_gpu_transpose.cpp

+
+// x->transpose--add->z
+// y->broadcast-/
+TEST_F(NVFuserTest, FusionScheduleTransposeBroadcast_CUDA) {


Silent wrong result, working on a minimum repro

#1880 to track

zasdfgbnm · 2022-07-28T17:41:47Z

torch/csrc/jit/codegen/cuda/test/test_gpu_transpose.cpp

+
+// x->broadcast--add->z
+// y->broadcast-/
+TEST_F(NVFuserTest, FusionScheduleBroadcastOnly_CUDA) {


Indexing error, working on a minimum repro

#1873 to track this one.

Thanks, @shmsong, for the help and fast response on this. I think both the scheduler and the indexing are doing something wrong. I will push new commits to my scheduler to detect this and error out early, and #1873 should be sufficient to track the indexing issue.

…chedule

zasdfgbnm · 2022-07-29T18:15:41Z

Remaining things

Fix FusionScheduleTransposeBroadcast_CUDA (silent wrong result)
Add many tests with broadcasting
Investigate trivial reduction support
Add many tests on view
Investigate small tile sizes to support the case where the transposed dims are small
Investigate shared memory usage, bank conflicting, etc.
Fill in canScheduleRunTime
Benchmark on more devices, A100 and V100

…chedule

…o transpose-schedule

csarofeen

LGTM

torch/csrc/jit/codegen/cuda/scheduler/registry.cpp

csarofeen · 2022-08-06T20:46:39Z

torch/csrc/jit/codegen/cuda/scheduler/transpose.cpp

+  //   t5 = cos(t1)
+  //   outputs: t3, t4, t5
+  //
+  // Then we should have group {t0, t3, t4} and {t1, t5}


Are you sure? I was thinking about a case where we had:

tv0[i0, i1, i2] tv1[i0, i2, i1] tv2[i0]

If we're simply broadcasting tv2 on dimensions not related to the inner dimensions of tv0 and tv wouldn't it be fine to just ignore tv2 is even there? Seems it would give decent perf if we just schedule based on tv0 and tv1.

csarofeen · 2022-08-08T13:36:19Z

torch/csrc/jit/codegen/cuda/scheduler/transpose.cpp

+    SetSelector selector({all_tvs_except1.begin(), all_tvs_except1.end()});
+    MaxRootDomainInfoSpanningTree entire_dag_except1(reference2, &selector);
+    TransformPropagator propagator(reference2);
+    entire_dag_except1.traverse(&propagator);


I think what I was missing (which makes a lot of sense) is that you're overwriting the major portion of the DAG since you want to schedule things similarly that in the end won't be connected (based on common transformations). So what you're doing is transforming everything consistently with one reference (blocking out the other reference), rescheduling the other reference, transforming everything (except what you want transformed as the first reference) as the other reference.

I just conceptually missed the part where the middle section of the graph is re-transformed.

csarofeen · 2022-08-08T14:11:31Z

torch/csrc/jit/codegen/cuda/scheduler/transpose.cpp

+
+    // Only unrolled the axes that exactly maps to the unrolled axes
+    //  on reference as support for permissively mapped axes are not
+    //  yet clearly defined.


I'm a little surprised, is it producing invalid results if you unroll non-exact mapped axes?

Yes, exactly. This is the workaround @shmsong push.

…chedule

Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ Code changes includes: - codegen improvements: 1. removes un-necessary sync from redundant thread compute analysis 2. symmetric API for BestEffortReplay 3. support merge on trivial reductions 4. Ampere async copy improvements - bug fixes: 1. vectorization bug fixes 2. type inference patch : fixes upstream pytorch#81725 3. segmenter bug fix with deterministic iteration ordering - parser update 1. added leaky_relu - scheduler 1. normalization scheduler clean up. 2. simplifies matmul scheduling with new transform propagator 3. merge all dimensions in PW scheduler 4. various gemm related improvements - debuggability 1. nsight compute support 2. debug dump for InlinePropagator 3. Add `UnaryOpType::Print` Squashed commits to WAR github API Commits that's actually in this PR from the devel branch: ``` dfe02f3 Merge remote-tracking branch 'csarofeen/devel' into HEAD 1617373 Add `TensorViewBuilder::shape(std::vector<Val*> shape)` (#1884) 7cfb779 Merge pull request #1887 from csarofeen/upstream_merge_0803 3399f6d Merge remote-tracking branch 'origin/viable/strict' into HEAD 01208f5 Add `UnaryOpType::Print` which can be helpful for debugging (#1878) 0646522 Remove redundant TORCH_INTERNAL_ASSERT in lower_magic_zero.cpp (#1881) 7bc76aa Fix most inlined propagator for mismatched dims (#1875) 501f4aa Nonaffine swizzle formulation ep.2: Loop swizzle variant. (#1826) d863d69 Ampere async copy ep.2: circular buffering extension to support pipelined matmul operand load (#1827) e0ae11a Larger sized mma instructions to support full vectorization (#1824) 9bb4cf7 fragment iteration to support fully unrolled mma ops (#1823) a48270a Merge all dims in pointwise scheduler (#1872) 172fb36 Make MostInlined and BestEffort inline propagation no longer assert replayed (#1868) a64462a Allow trivial reduction to be merged (#1871) 440102b Symmetric API for BestEffortReplay (#1870) d1caf33 Some misc cleanups/refactor split out from #1854 (#1867) 1013eda Remove some welford specific logic. (#1864) 51589d3 Some cleanups on tests and heuristics params (#1866) a6b3e70 Segmenter bug fix, and deterministic iteration ordering. (#1865) 1b665b9 Add nullptr checks to IrBuilder (#1861) 1cd9451 Simplify matmul scheduling with the new transform propagator. (#1817) bbc1fb9 Add leaky_relu operation (#1852) e842a9b Minor cleanup in pointwise scheduler (#1858) 9ee850c Fix stringstream usage (#1857) 20a36c1 Improve nsight compute support (#1855) 4059103 Remove debugging `true ||` from getPointwiseHeuristics (#1822) 01117bf Misc cleanup (#1853) 5cc6494 Apply the magic-zero protection to each indexed domain individually for predicate indexing (#1846) 92e6f02 Cleanup normalization scheduler (#1845) db89c65 Type inference patch (#1848) 102fe93 Add debug dump for InlinePropagator (#1847) b7a4d93 Redundant thread compute analysis to avoid un-necessary sync insertion (#1687) 942be5b Upstream ci build fixes (#1842) 0b83645 Fix vectorization bug introduced in #1831 (#1840) 63630f1 Move MaxProducerPosUpdater into InlinePropagator::tearDown (#1825) 9135a96 Fix transpose benchmark dtype (#1839) 2c9a6c0 Add extra configurability to `parallelizeAllLike` (#1831) ``` RUN_TORCHBENCH: nvfuser Differential Revision: [D38543000](https://our.internmc.facebook.com/intern/diff/D38543000) Pull Request resolved: pytorch#83067 Approved by: https://github.com/davidberard98

Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ Code changes includes: - codegen improvements: 1. double support in expression evaluator - bug fixes: 1. dropout fix - rework RNG to support broadcasted dropout (Fixes pytorch#82784) 2. expand fix - Patch expand+reduction, expand+view, rework view analysis and guard - scheduler: 1. manual transpose schedule example 2. WIP transpose scheduler Commits that's in this PR from the devel branch: ``` b7435af Transpose scheduler, step 1 (#1854) 8a45dbf Add an example on how to manually schedule transpose (#1889) 83dbf56 Patch dropout fix (#1898) 69d3519 Expand+Reduction, Expand+View support, rework View analysis and guards (#1883) 15091c4 Rework RNG to correctly support broadcasted dropout (#1888) aafe2d0 Make ExpressionEvaluator support Double (#1885) ``` RUN_TORCHBENCH: nvfuser Differential Revision: [D38657074](https://our.internmc.facebook.com/intern/diff/D38657074) Pull Request resolved: pytorch#83239 Approved by: https://github.com/davidberard98

zasdfgbnm and others added 26 commits July 14, 2022 09:24

Move MaxProducerPosUpdater into InlinePropagator::tearDown

bb3940f

cleanup

644b81d

add null scheduler and matching for "statically null"

d552315

Merge branch 'devel' of github.com:csarofeen/pytorch into ip-td

0ea589f

Merge branch 'ip-td' into transpose-schedule

07fe5b7

save

09a9843

draft that compiles

887d9fa

save

225a63e

save

9f533e9

More test

6c52fe7

cleanup tests

e5b513a

save

f993007

fix

96bdcb5

Merge branch 'devel' of github.com:csarofeen/pytorch into transpose-s…

0dc891c

…chedule

more test

858d6ce

new

7df6e2f

fix

9b0f02b

Merge branch 'devel' of github.com:csarofeen/pytorch into transpose-s…

419d4be

…chedule

fix

0c9cdd4

save

7d28da1

lint

498ab05

cleanup

b1ed2a7

writings

a6e2215

save

3d94ecc

save

536817a

save

eb97cbd

zasdfgbnm changed the title ~~[Not ready for review] Transpose scheduler~~ Transpose scheduler, step 1 Jul 22, 2022

zasdfgbnm marked this pull request as ready for review July 22, 2022 22:01

zasdfgbnm commented Jul 22, 2022

View reviewed changes

zasdfgbnm commented Jul 28, 2022

View reviewed changes

zasdfgbnm added 9 commits July 28, 2022 14:04

fix reference tensor finding

d0a026b

make broadcasting test work

9205565

cleanup

cacd1f7

Merge branch 'devel' of github.com:csarofeen/pytorch into transpose-s…

fc65d23

…chedule

revert

cd01c0f

clean

9e5d394

enable view without testing

821d027

merge all dims

1b425d2

disable FusionScheduleTransposeBroadcast_CUDA

eab982d

zasdfgbnm and others added 5 commits July 31, 2022 00:03

cleanup & simplify things

9dd5aac

Merge branch 'devel' of github.com:csarofeen/pytorch into transpose-s…

d7f0ea4

…chedule

Merge branch 'transpose-schedule' of github.com:csarofeen/pytorch int…

c19bbf1

…o transpose-schedule

skip FusionScheduleTransposeBroadcast_CUDA

a5861a0

war for transpose split support

65598b2

csarofeen mentioned this pull request Aug 3, 2022

Silent wrong result on broadcasting with split and merge #1880

Open

zasdfgbnm added 2 commits August 3, 2022 17:26

fix

9bea3bf

FusionScheduleTransposeComplexDAG1_CUDA

55c8298

csarofeen approved these changes Aug 8, 2022

View reviewed changes

zasdfgbnm added 5 commits August 10, 2022 20:27

Merge branch 'devel' of github.com:csarofeen/pytorch into transpose-s…

4dfd569

…chedule

manual test

6881cc6

save

4539019

save

9545036

save

6c369cf

zasdfgbnm merged commit b7435af into devel Aug 11, 2022

zasdfgbnm deleted the transpose-schedule branch August 11, 2022 04:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Transpose scheduler, step 1 #1854

Transpose scheduler, step 1 #1854

Uh oh!

zasdfgbnm commented Jul 21, 2022 •

edited

Loading

Uh oh!

zasdfgbnm commented Jul 22, 2022

Uh oh!

zasdfgbnm Jul 22, 2022

Uh oh!

zasdfgbnm Jul 28, 2022 •

edited

Loading

Uh oh!

zasdfgbnm Jul 30, 2022

Uh oh!

zasdfgbnm Jul 28, 2022

Uh oh!

shmsong Jul 28, 2022

Uh oh!

zasdfgbnm Jul 28, 2022

Uh oh!

zasdfgbnm commented Jul 29, 2022 •

edited

Loading

Uh oh!

csarofeen left a comment

Uh oh!

Uh oh!

csarofeen Aug 6, 2022

Uh oh!

csarofeen Aug 8, 2022

Uh oh!

csarofeen Aug 8, 2022

Uh oh!

zasdfgbnm Aug 11, 2022

Uh oh!

Uh oh!

Transpose scheduler, step 1 #1854

Transpose scheduler, step 1 #1854

Uh oh!

Conversation

zasdfgbnm commented Jul 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zasdfgbnm commented Jul 22, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zasdfgbnm Jul 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zasdfgbnm commented Jul 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

csarofeen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zasdfgbnm commented Jul 21, 2022 •

edited

Loading

zasdfgbnm Jul 28, 2022 •

edited

Loading

zasdfgbnm commented Jul 29, 2022 •

edited

Loading