-
Notifications
You must be signed in to change notification settings - Fork 24.2k
Nvfuser code bump 030122 #73627
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nvfuser code bump 030122 #73627
Conversation
* Use WARP_SIZE instead of 32
* Fix computation of thread predicate with broadcast Previously, a broadcasted input resets a thread predicate of any other input.
Channels Last support in nvfuser Background: To support channels last in nvfuser with optimal performance, we want to allow dimension collapsing in generated code on channels-last tensors, which greatly simplifies indexing. Current API in codegen only allows dimensional collapsing on neighboring axes. The unfortunate thing is that memory format design in PyTorch is implicitly marked by strides, while the semantics meaning of axes remain unchanged. i.e. A 4d tensor with axes [N, C, H, W] would have the same shape in both format, while contiguous tensor carries strides [CHW, HW, W, 1] and channels-last tensor [HWC, 1, WC, C]. Approach: We identify input tensor in channels-last format and permute them to NHWC. This creates an inconsistency between codegen tensor and TorchScript tensor. Our parser handles and propagates memory format accordingly. I.e., consumes and produces channels-last inputs when it can, while transposes inputs to original format and output non-permuted outputs. Fusion inputs/outputs in channels-last format is marked and permuted before/after fusion execution to ensure correctness on the interfacing between nvfuser and TorchScript. add simple cpp test to ensure simplified indexing in generated code. add python tests to verify nhwc fp16 inputs is handled properly. It has been handled in recent bfloat PR
… (pytorch#1170) * Revert "Revert D30752939: [pytorch][PR] nvfuser update" (pytorch#65137) Summary: This reverts commit 03389dc. Attempt again for PR: pytorch#63745 Fixes the windows build failure. Pull Request resolved: pytorch#65137 Reviewed By: seemethere, dzhulgakov, heitorschueroff Differential Revision: D30994556 Pulled By: malfet fbshipit-source-id: f1925b6c5cc1a1a441a96499667c91e8dfc1b53d * review comments addressed * clang-tidy non-private member variables * clang-format * quick fix on skipping logic
Fixes pytorch#1129 Thread predicates are missing in generating unswitch conditions. This PR collects thread predicates from unswitched expressions, merge them and append the merged one into the generated unswitch Bool val. The main new logic is the merging of thread predicates at: ThreadPredicateMap::mergeForUnswitch. Other changes are mostly minor cosmetic ones. Co-authored-by: Naoya Maruyama <[email protected]> Co-authored-by: jiej <[email protected]>
…xing (pytorch#1152) Indices of unused loops are mapped to zero, so that fact is currently used to find which loops are not used. This is fine for now, but not if shift and unswitch are combined. With shift, a lower bound position may need to be predicated as well, so that loop would get zero as its index, even though the loop is used. To disambiguate this, zero_loops and zero_domains are explicitly managed starting from indexMapFromTV.
…ions (pytorch#1131) Fixes pytorch#1102 This PR implements the second approach mentioned in pytorch#1102 For example, indexing and predicates are changed from: ``` = T0[(((((nvfuser_index_t)blockIdx.x) * ((nvfuser_index_t)blockDim.y)) + ((nvfuser_index_t)threadIdx.y)) * T0.stride[0])] ``` to: ``` = T0[(((((nvfuser_index_t)blockIdx.x) * 4) + ((nvfuser_index_t)threadIdx.y)) * T0.stride[0])] ``` The use of `blockDim.y` is replaced by the extent of the second axis of `T0`, which is `4` in this case. This change only matters when a parallel type is not exact (in this case `TIDy`). The indexing change only needed to change `getExtent` in index_compute.cpp. However, we also need to predicate `threadIdx` and `blockIdx` to be smaller than IterDomain extents. That's implemented as `ParallelizedDomainPredicate` in predicate_compute.h.
* Allow setting contiguity of tensors
…ins (pytorch#1160) * Place unswitched shared memory allocations outside of unswitched domains In lower allocations, the position of allocation and initialization are separately tracked. They are the same except with unswitched shared memory allocations.
code cleaning for clang-tidy
…ces" (pytorch#66176) (pytorch#1178) Summary: enabling pytorch#63940 Pull Request resolved: pytorch#66176 Reviewed By: ngimel Differential Revision: D31423920 Pulled By: dzhulgakov fbshipit-source-id: 06b1e0f757f4fb5b31ee1fa464bcd689df919b9c
* Thread predicate map must be created before validating parallelization * Use the loop map to find corresponding axes in validating parallelization between producers and consumers
…65064) (pytorch#1179) Summary: Pull Request resolved: pytorch#65064 The problem appears when nvfuser is triggered from LazyTensor. Because LT maintains its own thread pool, the thread used for the first-time compilation does CUDA context initialization properly, but later cached execution may use a different thread which does not have a proper CUDA context. Test Plan: Imported from OSS Reviewed By: saketh-are Differential Revision: D31269691 Pulled By: desertfire fbshipit-source-id: 384362025c087d61e8b625ff938379df283ef8b2 Co-authored-by: Bin Bao <[email protected]>
* Inline thread predicates even when unswitched
support dtype argument in softmax and softmax backward to accommodate the no-fusion issue with updated LTC IR
…predicates (pytorch#1182) * Predicating threadIdx/blockIdx at unswitch isn't necessary When generating unswitch predicates, maximum index values are used to generate predicates at root domains, so it's redundant to predicate threadIdx/blockIdx at leaf domains even for non-exact threading dimensions.
…l. (pytorch#1184) Co-authored-by: Naoya Maruyama <[email protected]>
…gic (pytorch#1145) * Merge the predication logic for shift/gather into the main one One of the major changes needed to integrate the predicate logic for shift/gather is to support predication at the start position of an IterDomain. Because of that, there's a lot of "start_xyz" and "stop_xyz". Another complexity comes from the extension to support unswitching with shift/gather. In addition to start and stop predicates, existence of halo means that the expressions unswitched at a domain may have different halo sizes (or none at all), so picking just whatever predicate per predicated root domain (and how it's parallelized) does not work. The most naive approach would be to gather all of the predicates for halo-extended root domains, but that's not efficient since some would be redundant. What's done in this PR is to try to select the most restrictive predicate by comparing the deviation from the baseline predicate. Suppose one stop predicate is composed as "x < extension". With halo, it would look like "x + a < extension", where "a" varies based on the halo width of the predicated domain. When "a" is a static constant, we find the maximum value and only use that predicate since that's the most restrictive one. Start predicates are analyzed similarly as well.
* Keep NVTX on by default. Use PYTORCH_NVFUSER_DISABLE_NVTX to disable it
Create type_promotion tests for unary, binary, and ternary ops * Rename test_data_compatibility to test_unary_ops
Make sure benchmark sizes are built out and as consistent as possible. Add backwards benchmarks for BatchNorm, LayerNorm, and Softmax.
@malfet has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Wondering if there's any update~ |
@jjsjann123 looks like merge conflict :/ |
@eellison has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
getting error internally :
|
Maybe pytorch/test/cpp/jit/test_gpu_shift.cpp Line 31 in 2d110d5
|
Thought I have fixed that already.. Let me take another pass. Forwarding the question from slack in case someone else know the answer: Looks a little strange here. I think that file has already be renamed & moved to nvfuser directory and include has been updated to use full path. A stale build somehow? |
Per the readme file here https://github.com/pytorch/pytorch/blob/4537f6c9eba4d016f31f352f97056a7594d83142/test/cpp/jit/README.md, cpp jit tests should be placed under that given folder, is that a hard requirement? I don't see anything requires that in the I'm trying to move our specific test files to a different folder, which works for me locally (relevant changes are here): But somehow this still fails on internal build with
Wondering if there's some internal build file that should be udpated? cc'ing @malfet @eellison |
@malfet has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Summary: Things changed in this PR that requires review: test/forward_backward_compatibility/check_forward_backward_compatibility.py Our previous function overload extension names were wrong and has been updated in this PR, hence the compatibility list updated. nvfuser code updates with bug fixes towards failures we encountered in OpInfoTests as well as failures reported by AOTAutograd team. Pull Request resolved: #73627 Reviewed By: Chillee Differential Revision: D34765458 Pulled By: davidberard98 fbshipit-source-id: c81f3d6a1b723fb3a8ba419b7f82227f70440ca7
Hey @jjsjann123. |
Waiting on pytorch/pytorch#73627 to land, because some of these don't pass without it. [ghstack-poisoned]
Waiting on pytorch/pytorch#73627 to land, because some of these don't pass without it. [ghstack-poisoned]
Waiting on pytorch/pytorch#73627 to land, because some of these don't pass without it. ghstack-source-id: a3028c6 Pull Request resolved: #801
Waiting on pytorch/pytorch#73627 to land, because some of these don't pass without it. [ghstack-poisoned]
Waiting on pytorch/pytorch#73627 to land, because some of these don't pass without it. [ghstack-poisoned]
Waiting on pytorch/pytorch#73627 to land, because some of these don't pass without it. ghstack-source-id: 13d2a15 Pull Request resolved: #801
Summary: Pull Request resolved: #801 Waiting on pytorch/pytorch#73627 to land, because some of these don't pass without it. Test Plan: Imported from OSS Reviewed By: eellison Differential Revision: D35732497 Pulled By: davidberard98 fbshipit-source-id: 6dae74e628def71344958c93a8907f48539ae2fb
Summary: Things changed in this PR that requires review: test/forward_backward_compatibility/check_forward_backward_compatibility.py Our previous function overload extension names were wrong and has been updated in this PR, hence the compatibility list updated. nvfuser code updates with bug fixes towards failures we encountered in OpInfoTests as well as failures reported by AOTAutograd team. Pull Request resolved: pytorch/pytorch#73627 Reviewed By: Chillee Differential Revision: D34765458 Pulled By: davidberard98 fbshipit-source-id: c81f3d6a1b723fb3a8ba419b7f82227f70440ca7 (cherry picked from commit b6a2c362c37051e44fac31687b2fe272f776551e)
Summary: Things changed in this PR that requires review: test/forward_backward_compatibility/check_forward_backward_compatibility.py Our previous function overload extension names were wrong and has been updated in this PR, hence the compatibility list updated. nvfuser code updates with bug fixes towards failures we encountered in OpInfoTests as well as failures reported by AOTAutograd team. Pull Request resolved: pytorch/pytorch#73627 Reviewed By: Chillee Differential Revision: D34765458 Pulled By: davidberard98 fbshipit-source-id: c81f3d6a1b723fb3a8ba419b7f82227f70440ca7 (cherry picked from commit b6a2c362c37051e44fac31687b2fe272f776551e)
Things changed in this PR that requires review:
test/forward_backward_compatibility/check_forward_backward_compatibility.py
Our previous function overload extension names were wrong and has been updated in this PR, hence the compatibility list updated.
nvfuser code updates with bug fixes towards failures we encountered in OpInfoTests as well as failures reported by AOTAutograd team.