Add nvFuser support for torch.native_batch_norm #85562

IvanYashchuk · 2022-09-23T18:58:53Z

This PR adds nvFuser's implementation for batch_norm as there's no reference yet (#81191) and no in-place copy support (#84545).

pytorch-bot · 2022-09-23T18:58:55Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/85562

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 4 Failures

As of commit d523e29:

The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

kevinstephano · 2022-09-28T06:16:53Z

torch/csrc/jit/codegen/cuda/python_frontend/python_bindings.cpp

+         nvfuser::Tensor bias,
+         nvfuser::Tensor running_mean,
+         nvfuser::Tensor running_var,
+         bool kTraining,


Nit: Why is the training parameter preficed with a k? I thought the k usually denoted an enum value.

Probably I copied it from

pytorch/torch/csrc/jit/codegen/cuda/ops/normalization.cpp

Line 465 in 614d6f1

const bool kTraining,

I'll rename it.

kevinstephano · 2022-09-28T06:18:23Z

torch/csrc/jit/codegen/cuda/python_frontend/fusion_record.h

+            {},
+            std::move(_outputs),
+            "null_tensor",
+            RecordType::Tensor) {}


I would add a new Op Type since you have a unique Record like RecordType::NullTensor.

kevinstephano

Generally looks fine, I think a new Record type should be added for NullTensor.

…tch-norm

IvanYashchuk · 2022-09-30T14:50:19Z

@pytorchbot merge -g

pytorchmergebot · 2022-09-30T14:54:26Z

@pytorchbot successfully started a merge job. Check the current status here.
The merge job was triggered with the green (-g) flag. This means that your change will be merged once all checks on your PR have passed (ETA: 0-4 Hours). If this is not the intended behavior, feel free to use some of the other merge options in the wiki.
Please reach out to the PyTorch DevX Team with feedback or questions!

pytorchmergebot · 2022-09-30T15:59:48Z

Merge failed

Reason: The following mandatory check(s) failed (Rule superuser):

pull

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

Fixes BN inference. I'm stealing Ivan's changes from pytorch#85562 We are returning mini-batch stats during inference run in aten, this is not the right behavior and we should have changed that instead. But for the time being, let's change nvfuser behavior just to get CI green. Also, the extra set here to avoid trivial forwarding should be removed once #1995 is merged.

IvanYashchuk · 2022-09-30T17:38:13Z

OOM failure:

2022-09-30T15:29:35.8006856Z test_ops.py::TestCommonCUDA::test_python_ref__refs_diag_embed_cuda_complex32 FAILED [ 23%]
2022-09-30T15:29:35.8006880Z 
2022-09-30T15:29:35.8007041Z =================================== FAILURES ===================================
2022-09-30T15:29:35.8007247Z ________ TestCommonCUDA.test_python_ref__refs_diag_embed_cuda_complex32 ________
2022-09-30T15:29:35.8007389Z Traceback (most recent call last):
2022-09-30T15:29:35.8007772Z   File "/opt/conda/lib/python3.10/site-packages/torch/testing/_comparison.py", line 1073, in assert_equal
2022-09-30T15:29:35.8007888Z     pair.compare()
2022-09-30T15:29:35.8008229Z   File "/opt/conda/lib/python3.10/site-packages/torch/testing/_comparison.py", line 620, in compare
2022-09-30T15:29:35.8008381Z     self._compare_values(actual, expected)
2022-09-30T15:29:35.8008715Z   File "/opt/conda/lib/python3.10/site-packages/torch/testing/_comparison.py", line 721, in _compare_values
2022-09-30T15:29:35.8008940Z     compare_fn(actual, expected, rtol=self.rtol, atol=self.atol, equal_nan=self.equal_nan)
2022-09-30T15:29:35.8009316Z   File "/opt/conda/lib/python3.10/site-packages/torch/testing/_comparison.py", line 853, in _compare_regular_values_close
2022-09-30T15:29:35.8009536Z     matches = torch.isclose(actual, expected, rtol=rtol, atol=atol, equal_nan=equal_nan)
2022-09-30T15:29:35.8010173Z torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 34.00 MiB (GPU 0; 7.44 GiB total capacity; 402.93 MiB already allocated; 29.19 MiB free; 1.64 GiB allowed; 818.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

…tch-norm

IvanYashchuk · 2022-10-03T15:01:03Z

@pytorchbot merge -f "XLA failure is unrelated, see 86093"

pytorchmergebot · 2022-10-03T15:03:04Z

@pytorchbot successfully started a merge job. Check the current status here.
The merge job was triggered with the force (-f) flag. This means your change will be merged immediately, bypassing any CI checks (ETA: 1-5 minutes). If this is not the intended behavior, feel free to use some of the other merge options in the wiki.
Please reach out to the PyTorch DevX Team with feedback or questions!

github-actions · 2022-10-03T15:03:52Z

Hey @IvanYashchuk.
You've committed this PR, but it does not have both a 'release notes: ...' and 'topics: ...' label. Please add one of each to the PR. The 'release notes: ...' label should represent the part of PyTorch that this PR changes (fx, autograd, distributed, etc) and the 'topics: ...' label should represent the kind of PR it is (not user facing, new feature, bug fix, perf improvement, etc). The list of valid labels can be found here for the 'release notes: ...' and here for the 'topics: ...'.
For changes that are 'topic: not user facing' there is no need for a release notes label.

soulitzer · 2022-10-04T00:50:13Z

@pytorchbot revert -m "Periodic tests failing test_nvfuser_extremal_values_native_batch_norm_cuda_float64 (main.TestCudaFuserOpInfoCUDA)" -c nosignal

linux-foundation-easycla · 2022-10-04T00:53:20Z

❌ - login: @IvanYashchuk / name: Ivan Yashchuk . The commit (9320779, 2f191a5, 38d87c2, 2793d70, d870af3, a1cf525, 657a74a, 9f27c06, f0fed34, 3adc1b7, 1f8ad1b, cdc327a, 2639dfc, 5d230e5, 13fdfa3, 80df150, ef5da01, b7bb71e, 7ae4af1, 8bebe7f, 45a08b5, 4563ce2, c6626e0, d523e29) is not authorized under a signed CLA. Please click here to be authorized. For further assistance with EasyCLA, please submit a support request ticket.

pytorchmergebot · 2022-10-04T00:53:36Z

@pytorchbot successfully started a revert job. Check the current status here.
Please reach out to the PyTorch DevX Team with feedback or questions!

pytorchmergebot · 2022-10-04T00:53:38Z

Reverting PR 85562 failed

Reason: The following mandatory check(s) failed (Rule superuser):

EasyCLA

Dig deeper by viewing the failures on hud

Details for Dev Infra team

Raised by workflow job

soulitzer · 2022-10-04T01:13:44Z

I'm unable to revert this PR (you'll need to sign the new CLA @IvanYashchuk), disabling the failing tests for now

facebook-github-bot · 2022-10-04T03:52:42Z

/easycla

As part of the transition to the PyTorch Foundation, this project now requires contributions be covered under the new CLA. See #85559 for additional details.

This comment will trigger a new check of this PR. If you are already covered, you will simply see a new "EasyCLA" check that passes. If you are not covered, a bot will leave a new comment with a link to sign.

This PR adds nvFuser's implementation for batch_norm as there's no reference yet (pytorch/pytorch#81191) and no in-place copy support (pytorch/pytorch#84545). Pull Request resolved: pytorch/pytorch#85562 Approved by: https://github.com/kevinstephano, https://github.com/ngimel

IvanYashchuk added 9 commits September 23, 2022 15:05

Add fd.ops.batch_norm

9320779

Add nvprims.native_batch_norm

2f191a5

Add define_null_tensor; allow None for nvprims.native_batch_norm

38d87c2

Count the number of aliased outputs in C++

2793d70

Add native_batch_norm OpInfo

d870af3

Add ops.nvprims.native_batch_norm PythonRefInfo

a1cf525

Add test_native_batch_norm_nvprims

657a74a

Fix sample_inputs_native_batch_norm

9f27c06

Make the test passing

f0fed34

IvanYashchuk added module: nvfuser module: primTorch labels Sep 23, 2022

IvanYashchuk requested review from jjsjann123 and kevinstephano September 23, 2022 18:58

pytorch-bot bot added the release notes: jit release notes category label Sep 23, 2022

facebook-github-bot added cla signed oncall: jit Add this issue/PR to JIT oncall triage queue labels Sep 23, 2022

pytorchbot added the open source label Sep 23, 2022

IvanYashchuk marked this pull request as ready for review September 27, 2022 16:21

IvanYashchuk requested review from mruberry and ngimel as code owners September 27, 2022 16:21

kevinstephano reviewed Sep 28, 2022

View reviewed changes

IvanYashchuk added 6 commits September 28, 2022 14:49

Merge remote-tracking branch 'upstream/viable/strict' into nvprims-ba…

3adc1b7

…tch-norm

Rename kTraining -> training

1f8ad1b

Add RecordType::NullTensor

cdc327a

Skip 0 numel

2639dfc

In training mode return mean anv invstd same as eager CUDA

5d230e5

Xfail on CPU test_python_ref_meta_ops_nvprims_native_batch_norm

13fdfa3

Update common_methods_invocations.py

c6626e0

Merge remote-tracking branch 'upstream/viable/strict' into nvprims-ba…

d523e29

…tch-norm

pytorchmergebot added the Merged label Oct 3, 2022

pytorchmergebot closed this in 68a6113 Oct 3, 2022

soulitzer reopened this Oct 4, 2022

soulitzer added the ciflow/periodic Trigger jobs ran periodically on master (periodic.yml) on the PR label Oct 4, 2022

IvanYashchuk closed this Oct 13, 2022

Add nvFuser support for torch.native_batch_norm #85562

Add nvFuser support for torch.native_batch_norm #85562

Uh oh!

Conversation

IvanYashchuk commented Sep 23, 2022

Uh oh!

pytorch-bot bot commented Sep 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/85562

❌ 4 Failures

Uh oh!

kevinstephano Sep 28, 2022

Choose a reason for hiding this comment

Uh oh!

IvanYashchuk Sep 28, 2022

Choose a reason for hiding this comment

Uh oh!

kevinstephano Sep 28, 2022

Choose a reason for hiding this comment

Uh oh!

kevinstephano left a comment

Choose a reason for hiding this comment

Uh oh!

IvanYashchuk commented Sep 30, 2022

Uh oh!

pytorchmergebot commented Sep 30, 2022

Uh oh!

pytorchmergebot commented Sep 30, 2022

Merge failed

Uh oh!

IvanYashchuk commented Sep 30, 2022

Uh oh!

IvanYashchuk commented Oct 3, 2022

Uh oh!

pytorchmergebot commented Oct 3, 2022

Uh oh!

github-actions bot commented Oct 3, 2022

Uh oh!

soulitzer commented Oct 4, 2022

Uh oh!

linux-foundation-easycla bot commented Oct 4, 2022

Uh oh!

pytorchmergebot commented Oct 4, 2022

Uh oh!

pytorchmergebot commented Oct 4, 2022

Reverting PR 85562 failed

Uh oh!

soulitzer commented Oct 4, 2022

Uh oh!

facebook-github-bot commented Oct 4, 2022

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 23, 2022 •

edited

Loading