Nvfuser code bump 030122 #73627

jjsjann123 · 2022-03-02T04:08:54Z

Things changed in this PR that requires review:

test/forward_backward_compatibility/check_forward_backward_compatibility.py

Our previous function overload extension names were wrong and has been updated in this PR, hence the compatibility list updated.

nvfuser code updates with bug fixes towards failures we encountered in OpInfoTests as well as failures reported by AOTAutograd team.

* Use WARP_SIZE instead of 32

* Fix computation of thread predicate with broadcast Previously, a broadcasted input resets a thread predicate of any other input.

Channels Last support in nvfuser Background: To support channels last in nvfuser with optimal performance, we want to allow dimension collapsing in generated code on channels-last tensors, which greatly simplifies indexing. Current API in codegen only allows dimensional collapsing on neighboring axes. The unfortunate thing is that memory format design in PyTorch is implicitly marked by strides, while the semantics meaning of axes remain unchanged. i.e. A 4d tensor with axes [N, C, H, W] would have the same shape in both format, while contiguous tensor carries strides [CHW, HW, W, 1] and channels-last tensor [HWC, 1, WC, C]. Approach: We identify input tensor in channels-last format and permute them to NHWC. This creates an inconsistency between codegen tensor and TorchScript tensor. Our parser handles and propagates memory format accordingly. I.e., consumes and produces channels-last inputs when it can, while transposes inputs to original format and output non-permuted outputs. Fusion inputs/outputs in channels-last format is marked and permuted before/after fusion execution to ensure correctness on the interfacing between nvfuser and TorchScript. add simple cpp test to ensure simplified indexing in generated code. add python tests to verify nhwc fp16 inputs is handled properly. It has been handled in recent bfloat PR

… (pytorch#1170) * Revert "Revert D30752939: [pytorch][PR] nvfuser update" (pytorch#65137) Summary: This reverts commit 03389dc. Attempt again for PR: pytorch#63745 Fixes the windows build failure. Pull Request resolved: pytorch#65137 Reviewed By: seemethere, dzhulgakov, heitorschueroff Differential Revision: D30994556 Pulled By: malfet fbshipit-source-id: f1925b6c5cc1a1a441a96499667c91e8dfc1b53d * review comments addressed * clang-tidy non-private member variables * clang-format * quick fix on skipping logic

Fixes pytorch#1129 Thread predicates are missing in generating unswitch conditions. This PR collects thread predicates from unswitched expressions, merge them and append the merged one into the generated unswitch Bool val. The main new logic is the merging of thread predicates at: ThreadPredicateMap::mergeForUnswitch. Other changes are mostly minor cosmetic ones. Co-authored-by: Naoya Maruyama <[email protected]> Co-authored-by: jiej <[email protected]>

…xing (pytorch#1152) Indices of unused loops are mapped to zero, so that fact is currently used to find which loops are not used. This is fine for now, but not if shift and unswitch are combined. With shift, a lower bound position may need to be predicated as well, so that loop would get zero as its index, even though the loop is used. To disambiguate this, zero_loops and zero_domains are explicitly managed starting from indexMapFromTV.

…ions (pytorch#1131) Fixes pytorch#1102 This PR implements the second approach mentioned in pytorch#1102 For example, indexing and predicates are changed from: ``` = T0[(((((nvfuser_index_t)blockIdx.x) * ((nvfuser_index_t)blockDim.y)) + ((nvfuser_index_t)threadIdx.y)) * T0.stride[0])] ``` to: ``` = T0[(((((nvfuser_index_t)blockIdx.x) * 4) + ((nvfuser_index_t)threadIdx.y)) * T0.stride[0])] ``` The use of `blockDim.y` is replaced by the extent of the second axis of `T0`, which is `4` in this case. This change only matters when a parallel type is not exact (in this case `TIDy`). The indexing change only needed to change `getExtent` in index_compute.cpp. However, we also need to predicate `threadIdx` and `blockIdx` to be smaller than IterDomain extents. That's implemented as `ParallelizedDomainPredicate` in predicate_compute.h.

* Allow setting contiguity of tensors

…ins (pytorch#1160) * Place unswitched shared memory allocations outside of unswitched domains In lower allocations, the position of allocation and initialization are separately tracked. They are the same except with unswitched shared memory allocations.

code cleaning for clang-tidy

…ces" (pytorch#66176) (pytorch#1178) Summary: enabling pytorch#63940 Pull Request resolved: pytorch#66176 Reviewed By: ngimel Differential Revision: D31423920 Pulled By: dzhulgakov fbshipit-source-id: 06b1e0f757f4fb5b31ee1fa464bcd689df919b9c

* Thread predicate map must be created before validating parallelization * Use the loop map to find corresponding axes in validating parallelization between producers and consumers

…65064) (pytorch#1179) Summary: Pull Request resolved: pytorch#65064 The problem appears when nvfuser is triggered from LazyTensor. Because LT maintains its own thread pool, the thread used for the first-time compilation does CUDA context initialization properly, but later cached execution may use a different thread which does not have a proper CUDA context. Test Plan: Imported from OSS Reviewed By: saketh-are Differential Revision: D31269691 Pulled By: desertfire fbshipit-source-id: 384362025c087d61e8b625ff938379df283ef8b2 Co-authored-by: Bin Bao <[email protected]>

* Inline thread predicates even when unswitched

support dtype argument in softmax and softmax backward to accommodate the no-fusion issue with updated LTC IR

…predicates (pytorch#1182) * Predicating threadIdx/blockIdx at unswitch isn't necessary When generating unswitch predicates, maximum index values are used to generate predicates at root domains, so it's redundant to predicate threadIdx/blockIdx at leaf domains even for non-exact threading dimensions.

…l. (pytorch#1184) Co-authored-by: Naoya Maruyama <[email protected]>

…gic (pytorch#1145) * Merge the predication logic for shift/gather into the main one One of the major changes needed to integrate the predicate logic for shift/gather is to support predication at the start position of an IterDomain. Because of that, there's a lot of "start_xyz" and "stop_xyz". Another complexity comes from the extension to support unswitching with shift/gather. In addition to start and stop predicates, existence of halo means that the expressions unswitched at a domain may have different halo sizes (or none at all), so picking just whatever predicate per predicated root domain (and how it's parallelized) does not work. The most naive approach would be to gather all of the predicates for halo-extended root domains, but that's not efficient since some would be redundant. What's done in this PR is to try to select the most restrictive predicate by comparing the deviation from the baseline predicate. Suppose one stop predicate is composed as "x < extension". With halo, it would look like "x + a < extension", where "a" varies based on the halo width of the predicated domain. When "a" is a static constant, we find the maximum value and only use that predicate since that's the most restrictive one. Start predicates are analyzed similarly as well.

* Keep NVTX on by default. Use PYTORCH_NVFUSER_DISABLE_NVTX to disable it

Create type_promotion tests for unary, binary, and ternary ops * Rename test_data_compatibility to test_unary_ops

Make sure benchmark sizes are built out and as consistent as possible. Add backwards benchmarks for BatchNorm, LayerNorm, and Softmax.

facebook-github-bot · 2022-03-22T21:49:32Z

@malfet has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

jjsjann123 · 2022-03-25T19:09:10Z

Wondering if there's any update~

eellison · 2022-03-28T17:25:31Z

@jjsjann123 looks like merge conflict :/

facebook-github-bot · 2022-03-28T18:21:02Z

@eellison has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

eellison · 2022-03-28T18:56:06Z

getting error internally :

//caffe2/test/cpp/jit:jit references non-existing file or directory 'caffe2/test/cpp/jit/test_gpu_validator.h'

eellison · 2022-03-28T18:56:43Z

Maybe

pytorch/test/cpp/jit/test_gpu_shift.cpp

Line 31 in 2d110d5

#include "test_gpu_validator.h"

should use full include path ?

jjsjann123 · 2022-03-28T19:03:24Z

Maybe

pytorch/test/cpp/jit/test_gpu_shift.cpp

Line 31 in 2d110d5

#include "test_gpu_validator.h"

should use full include path ?

Thought I have fixed that already.. Let me take another pass.

Forwarding the question from slack in case someone else know the answer:

Looks a little strange here. I think that file has already be renamed & moved to nvfuser directory and include has been updated to use full path.

A stale build somehow?

jjsjann123 · 2022-03-29T07:24:15Z

Per the readme file here https://github.com/pytorch/pytorch/blob/4537f6c9eba4d016f31f352f97056a7594d83142/test/cpp/jit/README.md, cpp jit tests should be placed under that given folder, is that a hard requirement? I don't see anything requires that in the CMakeLists.txt

I'm trying to move our specific test files to a different folder, which works for me locally (relevant changes are here):
29693a6
I've also changed include header test_gpu_validator.h to use the full path
#include <torch/csrc/jit/codegen/cuda/test/test_gpu_validator.h>.

But somehow this still fails on internal build with

the error is references non-existing file or directory 'caffe2/test/cpp/jit/test_gpu_validator.h'

Wondering if there's some internal build file that should be udpated? cc'ing @malfet @eellison

facebook-github-bot · 2022-03-30T18:37:41Z

@malfet has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Summary: Things changed in this PR that requires review: test/forward_backward_compatibility/check_forward_backward_compatibility.py Our previous function overload extension names were wrong and has been updated in this PR, hence the compatibility list updated. nvfuser code updates with bug fixes towards failures we encountered in OpInfoTests as well as failures reported by AOTAutograd team. Pull Request resolved: #73627 Reviewed By: Chillee Differential Revision: D34765458 Pulled By: davidberard98 fbshipit-source-id: c81f3d6a1b723fb3a8ba419b7f82227f70440ca7

github-actions · 2022-03-31T08:18:57Z

Hey @jjsjann123.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

Waiting on pytorch/pytorch#73627 to land, because some of these don't pass without it. [ghstack-poisoned]

Waiting on pytorch/pytorch#73627 to land, because some of these don't pass without it. ghstack-source-id: a3028c6 Pull Request resolved: #801

Waiting on pytorch/pytorch#73627 to land, because some of these don't pass without it. [ghstack-poisoned]

Waiting on pytorch/pytorch#73627 to land, because some of these don't pass without it. ghstack-source-id: 13d2a15 Pull Request resolved: #801

Summary: Pull Request resolved: #801 Waiting on pytorch/pytorch#73627 to land, because some of these don't pass without it. Test Plan: Imported from OSS Reviewed By: eellison Differential Revision: D35732497 Pulled By: davidberard98 fbshipit-source-id: 6dae74e628def71344958c93a8907f48539ae2fb

Summary: Things changed in this PR that requires review: test/forward_backward_compatibility/check_forward_backward_compatibility.py Our previous function overload extension names were wrong and has been updated in this PR, hence the compatibility list updated. nvfuser code updates with bug fixes towards failures we encountered in OpInfoTests as well as failures reported by AOTAutograd team. Pull Request resolved: pytorch/pytorch#73627 Reviewed By: Chillee Differential Revision: D34765458 Pulled By: davidberard98 fbshipit-source-id: c81f3d6a1b723fb3a8ba419b7f82227f70440ca7 (cherry picked from commit b6a2c362c37051e44fac31687b2fe272f776551e)

naoyam and others added 30 commits September 28, 2021 16:38

Print smem error info (pytorch#1157)

157c57b

Use WARP_SIZE instead of 32 (pytorch#1158)

68dec55

* Use WARP_SIZE instead of 32

Prevent unused variable warning (pytorch#1159)

9ebcb2a

Fix computation of thread predicate with broadcast (pytorch#1163)

21884ea

* Fix computation of thread predicate with broadcast Previously, a broadcasted input resets a thread predicate of any other input.

Merge remote-tracking branch 'upstream/master' into HEAD

e8ecaa3

Allow setting contiguity of tensors (pytorch#1161)

9f3ebec

* Allow setting contiguity of tensors

range-based for loop (pytorch#1176)

b4b1954

code cleaning for clang-tidy

cleanup (pytorch#1177)

8d973d2

Detect parallelization with predicated parallel types (pytorch#1166)

b6e1b10

* Thread predicate map must be created before validating parallelization * Use the loop map to find corresponding axes in validating parallelization between producers and consumers

Inline thread predicates even when unswitched (pytorch#1174)

239621f

* Inline thread predicates even when unswitched

softmax/backward dtype argument support (pytorch#1180)

38d1870

support dtype argument in softmax and softmax backward to accommodate the no-fusion issue with updated LTC IR

adding removeInplaceOperations pass for nvfuser (pytorch#1186)

3c84a33

Fix predicates and indexing for vectorization with unswitch and unrol…

dd8ed0b

…l. (pytorch#1184) Co-authored-by: Naoya Maruyama <[email protected]>

Disable NVTX recording with PYTORCH_NVFUSER_DIABLE_NVTX (pytorch#1192)

bee312c

* Keep NVTX on by default. Use PYTORCH_NVFUSER_DISABLE_NVTX to disable it

issue 1189 repro and fix (pytorch#1193)

f10afcd

Parser refactor (pytorch#1191)

c5cd42d

Type Promotion and Special Number Test Cases (pytorch#1188)

47e33bf

Create type_promotion tests for unary, binary, and ternary ops * Rename test_data_compatibility to test_unary_ops

Perf Tuning and Schedulers refactor (pytorch#1073)

5154c53

Benchmark refactoring, add backwards benchmarks. (pytorch#1190)

ffb9b24

Make sure benchmark sizes are built out and as consistent as possible. Add backwards benchmarks for BatchNorm, LayerNorm, and Softmax.

Benchmark fix for warp reduced softmax (pytorch#1195)

bed9edc

jjsjann123 added 4 commits March 22, 2022 10:00

silence clang-tidy warning on goto from gtest

5af17a1

clang-tidy

63d8a8f

clang-format

7d97b52

clang-tidy one last time

4f19cbb

suo removed the ciflow/default label Mar 22, 2022

Merge remote-tracking branch 'upstream/master' into HEAD

4537f6c

pytorchmergebot closed this in 873ced7 Mar 31, 2022

davidberard98 added a commit to pytorch/benchmark that referenced this pull request Apr 5, 2022

Update base for Update on "[WIP] Add more NVFuser microbenchmarks"

011b738

Waiting on pytorch/pytorch#73627 to land, because some of these don't pass without it. [ghstack-poisoned]

davidberard98 added a commit to pytorch/benchmark that referenced this pull request Apr 5, 2022

Update on "[WIP] Add more NVFuser microbenchmarks"

d55ade5

Waiting on pytorch/pytorch#73627 to land, because some of these don't pass without it. [ghstack-poisoned]

davidberard98 added a commit to pytorch/benchmark that referenced this pull request Apr 5, 2022

[WIP] Add more NVFuser microbenchmarks

49b76ed

Waiting on pytorch/pytorch#73627 to land, because some of these don't pass without it. ghstack-source-id: a3028c6 Pull Request resolved: #801

davidberard98 added a commit to pytorch/benchmark that referenced this pull request Apr 15, 2022

Update base for Update on "[WIP] Add more NVFuser microbenchmarks"

b503461

Waiting on pytorch/pytorch#73627 to land, because some of these don't pass without it. [ghstack-poisoned]

davidberard98 added a commit to pytorch/benchmark that referenced this pull request Apr 15, 2022

Update on "[WIP] Add more NVFuser microbenchmarks"

c8951ea

Waiting on pytorch/pytorch#73627 to land, because some of these don't pass without it. [ghstack-poisoned]

davidberard98 added a commit to pytorch/benchmark that referenced this pull request Apr 15, 2022

[WIP] Add more NVFuser microbenchmarks

8d04b3e

Waiting on pytorch/pytorch#73627 to land, because some of these don't pass without it. ghstack-source-id: 13d2a15 Pull Request resolved: #801

jjsjann123 mentioned this pull request Apr 26, 2022

test_native_layer_norm_bfloat fails with "Non-divisible split with vectorization is detected. Extent: 975. Factor: 2" #76107

Closed

WBobby mentioned this pull request Aug 17, 2022

Add ROCm5.2.3/AMDGPU support for PyTorch WBobby/pytorch#2

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Nvfuser code bump 030122 #73627

Nvfuser code bump 030122 #73627

Uh oh!

jjsjann123 commented Mar 2, 2022 •

edited

Loading

Uh oh!

facebook-github-bot commented Mar 22, 2022

Uh oh!

jjsjann123 commented Mar 25, 2022

Uh oh!

eellison commented Mar 28, 2022

Uh oh!

facebook-github-bot commented Mar 28, 2022

Uh oh!

eellison commented Mar 28, 2022

Uh oh!

eellison commented Mar 28, 2022

Uh oh!

jjsjann123 commented Mar 28, 2022 •

edited

Loading

Uh oh!

jjsjann123 commented Mar 29, 2022

Uh oh!

facebook-github-bot commented Mar 30, 2022

Uh oh!

github-actions bot commented Mar 31, 2022

Uh oh!

Uh oh!

Nvfuser code bump 030122 #73627

Nvfuser code bump 030122 #73627

Uh oh!

Conversation

jjsjann123 commented Mar 2, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Mar 22, 2022

Uh oh!

jjsjann123 commented Mar 25, 2022

Uh oh!

eellison commented Mar 28, 2022

Uh oh!

facebook-github-bot commented Mar 28, 2022

Uh oh!

eellison commented Mar 28, 2022

Uh oh!

eellison commented Mar 28, 2022

Uh oh!

jjsjann123 commented Mar 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jjsjann123 commented Mar 29, 2022

Uh oh!

facebook-github-bot commented Mar 30, 2022

Uh oh!

github-actions bot commented Mar 31, 2022

Uh oh!

Uh oh!

jjsjann123 commented Mar 2, 2022 •

edited

Loading

jjsjann123 commented Mar 28, 2022 •

edited

Loading