Skip to content

Nvfuser code bump 030122 #73627

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 690 commits into from

Conversation

jjsjann123
Copy link
Collaborator

@jjsjann123 jjsjann123 commented Mar 2, 2022

Things changed in this PR that requires review:

test/forward_backward_compatibility/check_forward_backward_compatibility.py

Our previous function overload extension names were wrong and has been updated in this PR, hence the compatibility list updated.

nvfuser code updates with bug fixes towards failures we encountered in OpInfoTests as well as failures reported by AOTAutograd team.

naoyam and others added 30 commits September 28, 2021 16:38
* Use WARP_SIZE instead of 32
* Fix computation of thread predicate with broadcast

Previously, a broadcasted input resets a thread predicate of any other input.
Channels Last support in nvfuser

Background:
To support channels last in nvfuser with optimal performance, we want to allow dimension collapsing in generated code on channels-last tensors, which greatly simplifies indexing.
Current API in codegen only allows dimensional collapsing on neighboring axes. The unfortunate thing is that memory format design in PyTorch is implicitly marked by strides, while the semantics meaning of axes remain unchanged. i.e. A 4d tensor with axes [N, C, H, W] would have the same shape in both format, while contiguous tensor carries strides [CHW, HW, W, 1] and channels-last tensor [HWC, 1, WC, C].

Approach:
We identify input tensor in channels-last format and permute them to NHWC. This creates an inconsistency between codegen tensor and TorchScript tensor. Our parser handles and propagates memory format accordingly. I.e., consumes and produces channels-last inputs when it can, while transposes inputs to original format and output non-permuted outputs.
Fusion inputs/outputs in channels-last format is marked and permuted before/after fusion execution to ensure correctness on the interfacing between nvfuser and TorchScript.

 add simple cpp test to ensure simplified indexing in generated code.
add python tests to verify nhwc fp16 inputs is handled properly. It has been handled in recent bfloat PR
… (pytorch#1170)

* Revert "Revert D30752939: [pytorch][PR] nvfuser update" (pytorch#65137)

Summary:
This reverts commit 03389dc.

Attempt again for PR: pytorch#63745
Fixes the windows build failure.

Pull Request resolved: pytorch#65137

Reviewed By: seemethere, dzhulgakov, heitorschueroff

Differential Revision: D30994556

Pulled By: malfet

fbshipit-source-id: f1925b6c5cc1a1a441a96499667c91e8dfc1b53d

* review comments addressed

* clang-tidy non-private member variables

* clang-format

* quick fix on skipping logic
Fixes pytorch#1129

Thread predicates are missing in generating unswitch conditions. This PR collects thread predicates from unswitched expressions, merge them and append the merged one into the generated unswitch Bool val.

The main new logic is the merging of thread predicates at: ThreadPredicateMap::mergeForUnswitch. Other changes are mostly minor cosmetic ones.

Co-authored-by: Naoya Maruyama <[email protected]>
Co-authored-by: jiej <[email protected]>
…xing (pytorch#1152)

Indices of unused loops are mapped to zero, so that fact is currently
used to find which loops are not used. This is fine for now, but not
if shift and unswitch are combined. With shift, a lower bound
position may need to be predicated as well, so that loop would get zero
as its index, even though the loop is used. To disambiguate this,
zero_loops and zero_domains are explicitly managed starting from
indexMapFromTV.
…ions (pytorch#1131)

Fixes pytorch#1102 

This PR implements the second approach mentioned in pytorch#1102 For example, indexing and predicates are changed from:

```
      = T0[(((((nvfuser_index_t)blockIdx.x) * ((nvfuser_index_t)blockDim.y)) + ((nvfuser_index_t)threadIdx.y)) * T0.stride[0])]
```

to:

```
 = T0[(((((nvfuser_index_t)blockIdx.x) * 4) + ((nvfuser_index_t)threadIdx.y)) * T0.stride[0])]
```

The use of `blockDim.y` is replaced by the extent of the second axis of `T0`, which is `4` in this case. This change only matters when a parallel type is not exact (in this case `TIDy`). 

The indexing change only needed to change `getExtent` in index_compute.cpp. However, we also need to predicate `threadIdx` and `blockIdx` to be smaller than IterDomain extents. That's implemented as `ParallelizedDomainPredicate` in predicate_compute.h.
* Allow setting contiguity of tensors
…ins (pytorch#1160)

* Place unswitched shared memory allocations outside of unswitched domains

In lower allocations, the position of allocation and initialization are
separately tracked. They are the same except with unswitched shared
memory allocations.
code cleaning for clang-tidy
…ces" (pytorch#66176) (pytorch#1178)

Summary:
enabling pytorch#63940

Pull Request resolved: pytorch#66176

Reviewed By: ngimel

Differential Revision: D31423920

Pulled By: dzhulgakov

fbshipit-source-id: 06b1e0f757f4fb5b31ee1fa464bcd689df919b9c
* Thread predicate map must be created before validating parallelization

* Use the loop map to find corresponding axes in validating
parallelization between producers and consumers
…65064) (pytorch#1179)

Summary:
Pull Request resolved: pytorch#65064

The problem appears when nvfuser is triggered from LazyTensor.
Because LT maintains its own thread pool, the thread used for the first-time
compilation does CUDA context initialization properly, but later
cached execution may use a different thread which does not have
a proper CUDA context.

Test Plan: Imported from OSS

Reviewed By: saketh-are

Differential Revision: D31269691

Pulled By: desertfire

fbshipit-source-id: 384362025c087d61e8b625ff938379df283ef8b2

Co-authored-by: Bin Bao <[email protected]>
* Inline thread predicates even when unswitched
support dtype argument in softmax and softmax backward to accommodate the no-fusion issue with updated LTC IR
…predicates (pytorch#1182)

* Predicating threadIdx/blockIdx at unswitch isn't necessary

When generating unswitch predicates, maximum index values are used to
generate predicates at root domains, so it's redundant to predicate
threadIdx/blockIdx at leaf domains even for non-exact threading
dimensions.
…gic (pytorch#1145)

* Merge the predication logic for shift/gather into the main one

One of the major changes needed to integrate the predicate logic for
shift/gather is to support predication at the start position of an
IterDomain. Because of that, there's a lot of "start_xyz" and
"stop_xyz".

Another complexity comes from the extension to support unswitching with
shift/gather. In addition to start and stop predicates, existence of
halo means that the expressions unswitched at a domain may have
different halo sizes (or none at all), so picking just whatever
predicate per predicated root domain (and how it's parallelized) does
not work. The most naive approach would be to gather all of the
predicates for halo-extended root domains, but that's not efficient
since some would be redundant.

What's done in this PR is to try to select the most restrictive
predicate by comparing the deviation from the baseline predicate.
Suppose one stop predicate is composed as "x < extension". With halo, it
would look like "x + a < extension", where "a" varies based on the halo
width of the predicated domain. When "a" is a static constant, we find
the maximum value and only use that predicate since that's the most
restrictive one. Start predicates are analyzed similarly as well.
* Keep NVTX on by default. Use PYTORCH_NVFUSER_DISABLE_NVTX to disable it
Create type_promotion tests for unary, binary, and ternary ops
* Rename test_data_compatibility to test_unary_ops
Make sure benchmark sizes are built out and as consistent as possible. Add backwards benchmarks for BatchNorm, LayerNorm, and Softmax.
@facebook-github-bot
Copy link
Contributor

@malfet has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@jjsjann123
Copy link
Collaborator Author

Wondering if there's any update~

@eellison
Copy link
Contributor

@jjsjann123 looks like merge conflict :/

@facebook-github-bot
Copy link
Contributor

@eellison has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@eellison
Copy link
Contributor

getting error internally :

//caffe2/test/cpp/jit:jit references non-existing file or directory 'caffe2/test/cpp/jit/test_gpu_validator.h'

@eellison
Copy link
Contributor

Maybe

#include "test_gpu_validator.h"
should use full include path ?

@jjsjann123
Copy link
Collaborator Author

jjsjann123 commented Mar 28, 2022

Maybe

#include "test_gpu_validator.h"

should use full include path ?

Thought I have fixed that already.. Let me take another pass.

Forwarding the question from slack in case someone else know the answer:

Looks a little strange here. I think that file has already be renamed & moved to nvfuser directory and include has been updated to use full path.

A stale build somehow?

@jjsjann123
Copy link
Collaborator Author

Per the readme file here https://github.com/pytorch/pytorch/blob/4537f6c9eba4d016f31f352f97056a7594d83142/test/cpp/jit/README.md, cpp jit tests should be placed under that given folder, is that a hard requirement? I don't see anything requires that in the CMakeLists.txt

I'm trying to move our specific test files to a different folder, which works for me locally (relevant changes are here):
29693a6
I've also changed include header test_gpu_validator.h to use the full path
#include <torch/csrc/jit/codegen/cuda/test/test_gpu_validator.h>.

But somehow this still fails on internal build with

the error is references non-existing file or directory 'caffe2/test/cpp/jit/test_gpu_validator.h'

Wondering if there's some internal build file that should be udpated? cc'ing @malfet @eellison

@facebook-github-bot
Copy link
Contributor

@malfet has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot pushed a commit that referenced this pull request Mar 31, 2022
Summary:
Things changed in this PR that requires review:

test/forward_backward_compatibility/check_forward_backward_compatibility.py

Our previous function overload extension names were wrong and has been updated in this PR, hence the compatibility list updated.

nvfuser code updates with bug fixes towards failures we encountered in OpInfoTests as well as failures reported by AOTAutograd team.

Pull Request resolved: #73627

Reviewed By: Chillee

Differential Revision: D34765458

Pulled By: davidberard98

fbshipit-source-id: c81f3d6a1b723fb3a8ba419b7f82227f70440ca7
@github-actions
Copy link
Contributor

Hey @jjsjann123.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

davidberard98 added a commit to pytorch/benchmark that referenced this pull request Apr 5, 2022
Waiting on pytorch/pytorch#73627 to land, because some of these don't pass without it.

[ghstack-poisoned]
davidberard98 added a commit to pytorch/benchmark that referenced this pull request Apr 5, 2022
Waiting on pytorch/pytorch#73627 to land, because some of these don't pass without it.

[ghstack-poisoned]
davidberard98 added a commit to pytorch/benchmark that referenced this pull request Apr 5, 2022
Waiting on pytorch/pytorch#73627 to land, because some of these don't pass without it.

ghstack-source-id: a3028c6
Pull Request resolved: #801
davidberard98 added a commit to pytorch/benchmark that referenced this pull request Apr 15, 2022
Waiting on pytorch/pytorch#73627 to land, because some of these don't pass without it.

[ghstack-poisoned]
davidberard98 added a commit to pytorch/benchmark that referenced this pull request Apr 15, 2022
Waiting on pytorch/pytorch#73627 to land, because some of these don't pass without it.

[ghstack-poisoned]
davidberard98 added a commit to pytorch/benchmark that referenced this pull request Apr 15, 2022
Waiting on pytorch/pytorch#73627 to land, because some of these don't pass without it.

ghstack-source-id: 13d2a15
Pull Request resolved: #801
facebook-github-bot pushed a commit to pytorch/benchmark that referenced this pull request Apr 19, 2022
Summary:
Pull Request resolved: #801

Waiting on pytorch/pytorch#73627 to land, because some of these don't pass without it.

Test Plan: Imported from OSS

Reviewed By: eellison

Differential Revision: D35732497

Pulled By: davidberard98

fbshipit-source-id: 6dae74e628def71344958c93a8907f48539ae2fb
jjsjann123 added a commit to jjsjann123/nvfuser that referenced this pull request Oct 29, 2022
Summary:
Things changed in this PR that requires review:

test/forward_backward_compatibility/check_forward_backward_compatibility.py

Our previous function overload extension names were wrong and has been updated in this PR, hence the compatibility list updated.

nvfuser code updates with bug fixes towards failures we encountered in OpInfoTests as well as failures reported by AOTAutograd team.

Pull Request resolved: pytorch/pytorch#73627

Reviewed By: Chillee

Differential Revision: D34765458

Pulled By: davidberard98

fbshipit-source-id: c81f3d6a1b723fb3a8ba419b7f82227f70440ca7
(cherry picked from commit b6a2c362c37051e44fac31687b2fe272f776551e)
jjsjann123 added a commit to jjsjann123/nvfuser that referenced this pull request Nov 10, 2022
Summary:
Things changed in this PR that requires review:

test/forward_backward_compatibility/check_forward_backward_compatibility.py

Our previous function overload extension names were wrong and has been updated in this PR, hence the compatibility list updated.

nvfuser code updates with bug fixes towards failures we encountered in OpInfoTests as well as failures reported by AOTAutograd team.

Pull Request resolved: pytorch/pytorch#73627

Reviewed By: Chillee

Differential Revision: D34765458

Pulled By: davidberard98

fbshipit-source-id: c81f3d6a1b723fb3a8ba419b7f82227f70440ca7
(cherry picked from commit b6a2c362c37051e44fac31687b2fe272f776551e)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla signed oncall: jit Add this issue/PR to JIT oncall triage queue open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.