Skip to content

Conversation

shmsong
Copy link

@shmsong shmsong commented Jul 15, 2022

(Majority of changes in this PR is test case. ) This is conceptually just a generalization of double buffer logic to handle circular buffering.

There are still other usage patterns of cp.async and this one mainly focuses on the use case in matmul main loop.

Note:

Conceptually circular buffer is a generalization of double buffer, where double buffer is just a circular buffer with depth of 2. But use-case wise circular buffer only makes sense with cp.async on ampere while double buffering is a lot more widely applicable.

More details see comment section [Cicular buffer].

Base automatically changed from matmul_propagator to devel July 23, 2022 21:41
@shmsong shmsong changed the base branch from devel to fragment_iter July 25, 2022 23:00
@shmsong shmsong changed the title WIP: [Not ready for review] Ampere async copy ep.2: circular buffering extension to support pipelined matmul operand load Ampere async copy ep.2: circular buffering extension to support pipelined matmul operand load Jul 26, 2022
@shmsong shmsong requested a review from csarofeen July 26, 2022 06:56
Base automatically changed from fragment_iter to devel July 29, 2022 18:29
Copy link
Owner

@csarofeen csarofeen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just minor comments.

}

// MMA unit test on Ampere
TEST_F(NVFuserTest, FusionAmpereMMATT_CUDA) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this test case using the circular or double buffering?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never mind, blaming how git is rendering the changes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it's just rebasing noise. Will sort them out and cleanup all tests in follow up PRs.

TORCH_CHECK(cg_outputs[0].allclose(tref, 0.0001, 0.0001));
}

// Matmul test on Ampere
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm missing what you're testing new in the Ampere tests before this?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never mind, looks like it's just the way github decided to mark the new tests, makes these tests look like they're being modified below.

}

// Matmul test on Turing
TEST_F(NVFuserTest, FusionTuringMatmulTNRegDoubleBuffer_CUDA) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is double buffering on by default? Don't see it turned on here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No it's not. There as a option in matmul schedule params that controls it. This was git rendering issue before rebasing.

//
// for i in 0..N: // main loop
// for j in ...
// if pred:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pred protects on the first iteration and the last D-1 iterations, right?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's correct. The last D-1 iteration of main loop would not load new data.

//
// (Epilog omitted since this only makes sense in using
// cp.async, where producer will be in global mem and consumer will
// be in shared mem).
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow this, isn't it just not necessary based on how you set the loops/predicates up?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current double buffer/circular buffer infra would generate an epilog if the double buffered tensor is loading from shared mem. This comment just saying cp.async would only read from gmem so in this case there'd not be an epilog.

It could change over time.

// prolog loop as well covering the first N-1 iterations, N being the
// stage depth.
if (!is_prolog || is_circular_buffer_loop) {
if (is_circular_buffer_loop && is_prolog) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Just looks strange that you switched the position of these in subsequent if statements.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. Thanks!

@shmsong shmsong merged commit d863d69 into devel Jul 31, 2022
@shmsong shmsong deleted the circular_buffering branch July 31, 2022 05:36
jjsjann123 added a commit that referenced this pull request Aug 29, 2022
Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/

Code changes includes:

- codegen improvements:
  1. removes un-necessary sync from redundant thread compute analysis
  2. symmetric API for BestEffortReplay
  3. support merge on trivial reductions
  4. Ampere async copy improvements
- bug fixes:
  1. vectorization bug fixes
  2. type inference patch : fixes upstream pytorch#81725
  3. segmenter bug fix with deterministic iteration ordering
- parser update
  1. added leaky_relu
- scheduler
  1. normalization scheduler clean up.
  2. simplifies matmul scheduling with new transform propagator
  3. merge all dimensions in PW scheduler
  4. various gemm related improvements
- debuggability
  1. nsight compute support
  2. debug dump for InlinePropagator
  3. Add `UnaryOpType::Print`

Squashed commits to WAR github API
Commits that's actually in this PR from the devel branch:

```
dfe02f3 Merge remote-tracking branch 'csarofeen/devel' into HEAD
1617373 Add `TensorViewBuilder::shape(std::vector<Val*> shape)` (#1884)
7cfb779 Merge pull request #1887 from csarofeen/upstream_merge_0803
3399f6d Merge remote-tracking branch 'origin/viable/strict' into HEAD
01208f5 Add `UnaryOpType::Print` which can be helpful for debugging (#1878)
0646522 Remove redundant TORCH_INTERNAL_ASSERT in lower_magic_zero.cpp (#1881)
7bc76aa Fix most inlined propagator for mismatched dims (#1875)
501f4aa Nonaffine swizzle formulation ep.2: Loop swizzle variant. (#1826)
d863d69 Ampere async copy ep.2: circular buffering extension to support pipelined matmul operand load (#1827)
e0ae11a Larger sized mma instructions to support full vectorization (#1824)
9bb4cf7 fragment iteration to support fully unrolled mma ops (#1823)
a48270a Merge all dims in pointwise scheduler (#1872)
172fb36 Make MostInlined and BestEffort inline propagation no longer assert replayed (#1868)
a64462a Allow trivial reduction to be merged (#1871)
440102b Symmetric API for BestEffortReplay (#1870)
d1caf33 Some misc cleanups/refactor split out from #1854 (#1867)
1013eda Remove some welford specific logic. (#1864)
51589d3 Some cleanups on tests and heuristics params (#1866)
a6b3e70 Segmenter bug fix, and deterministic iteration ordering.  (#1865)
1b665b9 Add nullptr checks to IrBuilder (#1861)
1cd9451 Simplify matmul scheduling with the new transform propagator.  (#1817)
bbc1fb9 Add leaky_relu operation (#1852)
e842a9b Minor cleanup in pointwise scheduler (#1858)
9ee850c Fix stringstream usage (#1857)
20a36c1 Improve nsight compute support (#1855)
4059103 Remove debugging `true ||` from getPointwiseHeuristics (#1822)
01117bf Misc cleanup (#1853)
5cc6494 Apply the magic-zero protection to each indexed domain individually for predicate indexing (#1846)
92e6f02 Cleanup normalization scheduler (#1845)
db89c65 Type inference patch (#1848)
102fe93 Add debug dump for InlinePropagator (#1847)
b7a4d93 Redundant thread compute analysis to avoid un-necessary sync insertion (#1687)
942be5b Upstream ci build fixes (#1842)
0b83645 Fix vectorization bug introduced in #1831 (#1840)
63630f1 Move MaxProducerPosUpdater into InlinePropagator::tearDown (#1825)
9135a96 Fix transpose benchmark dtype (#1839)
2c9a6c0 Add extra configurability to `parallelizeAllLike` (#1831)
```

RUN_TORCHBENCH: nvfuser

Differential Revision: [D38543000](https://our.internmc.facebook.com/intern/diff/D38543000)
Pull Request resolved: pytorch#83067
Approved by: https://github.com/davidberard98
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants