Add variance_mean composite function using Welford #1907

IvanYashchuk · 2022-08-12T11:57:20Z

Currently complex, half, bfloat16 types give compilation errors with Welford, so these types raise an explicit error for now.

I'm not sure how to include this function in parser.cpp to be tied with aten::var_mean.correction correctly, so integration with TorchScript is omitted from this PR.

Currently complex, half, bfloat16 types give compilation errors with Welford, so these types raise an explicit error for now.

jjsjann123

LGTM. Don't worry about TS support at this moment. I can patch a PR for that after this one is merged.

jjsjann123 · 2022-08-12T21:10:54Z

torch/csrc/jit/codegen/cuda/ops/normalization.cpp

+  TORCH_CHECK(
+      correction >= 0, "correction must be non-negative, but got ", correction);
+
+  // There are compilation errors for half precision


Should we consider just promote math to float for reduced precision then? that seems to be an universal way used by all others computing variance.

I handle type promotion on the Python side.

jjsjann123 · 2022-08-12T21:12:38Z

torch/csrc/jit/codegen/cuda/ops/normalization.cpp

+
+  if (isComplexType(x->getDataType().value())) {
+    // There are compilation errors:
+    // __tmp_kernel1.cu(6727): error: namespace "CudaCodeGen::std" has no member


Doesn't sound very right to me. cc'ing @zasdfgbnm

This is definitely not the correct behavior, but I am not surprised at all😜. I got moved to more important tasks before finishing complex support, so currently, complex support in nvfuser is very broken. Please consider complex as not supported when doing other works.

jjsjann123

Sorry I forgot to stamp earlier.

jjsjann123 · 2022-08-12T21:18:13Z

torch/csrc/jit/codegen/cuda/ops/normalization.cpp

+        sub(num_features, IrBuilder::create<Int>(x->container(), correction));
+  }
+
+  auto welford_out = Welford(x, dims);


I would hope that we can have variance to also use welford, so we can unify the interface for variance and variance_mean.
But WelfordOp appears to be a single op in codegen, does this means the generated kernel would still write mean to a register somewhere, even though they are not used at all? That we probably not want.

cc'ing @naoyam @shmsong on this.

We can of course try using

TensorView* variance( TensorView* x, const std::vector<int>& dims, int64_t correction, bool keepdim) { auto var_mean = variance_mean(x, dims, correction, keepdim); return var_mean.var; }

If it positively impacts the performance. That's how ATen implements this function and it would justify var being a prim in PrimTorch.

IvanYashchuk · 2022-08-16T11:09:06Z

What's the procedure for moving this further? Who runs the tests and clicks "merge"?

jjsjann123 · 2022-08-16T15:30:34Z

What's the procedure for moving this further? Who runs the tests and clicks "merge"?

We don't have CI, so you'll need to run tests locally.

./build/bin/test_jit --gtest_filter="*NVFuser*"
python test/test_jit_cuda_fuser.py

^^^ These two are the usual ones that we run before merging each PR. You've already got my stamp, once test is verified to pass, you can click the squash and merge.

IvanYashchuk · 2022-08-19T10:10:08Z

I see two test failures that seem to be unrelated:

[ RUN      ] NVFuserTest.FusionBiasGeluBwd_CUDA
unknown file: Failure
C++ exception with description "aten_output_tensor.allclose( fusion_output_tensor.to(aten_output_tensor.dtype()), tolerance_values.second, tolerance_values.first) INTERNAL ASSERT FAILED at "/home/iyashchuk/dev/pytorch/jit-devel/torch/csrc/jit/codegen/cuda/test/test_gpu_validator.h":416, please report a bug to PyTorch.

Validation error in output 0 on line 11691 in file /home/iyashchuk/dev/pytorch/jit-devel/torch/csrc/jit/codegen/cuda/test/test_gpu.cpp.
  Detected abs error of: 1.66893e-06
    absolute tolerance was set to 1.51992e-06
    and relative tolerance set to 2.23704e-06
Exception raised from testValidate at /home/iyashchuk/dev/pytorch/jit-devel/torch/csrc/jit/codegen/cuda/test/test_gpu_validator.h:416 (most recent call first):

[ RUN      ] NVFuserTest.FusionScheduleTransposeMultipleInputOutput_CUDA
unknown file: Failure
C++ exception with description "CUDA out of memory. Tried to allocate 1024.00 MiB (GPU 0; 11.77 GiB total capacity; 8.25 GiB already allocated; 800.44 MiB free; 9.77 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Exception raised from malloc at /home/iyashchuk/dev/pytorch/jit-devel/c10/cuda/CUDACachingAllocator.cpp:639 (most recent call first):

Syncing nvfuser devel branch to upstream master. https://github.com/csarofeen/pytorch/ Codegen changes include: - codegen improvement: i. improved view support on pointwise and transpose scheduler ii. grouped grid welford added for better outer-norm grid persistence in normalization - misc: i. new composite ops added: variance_mean , arange, ii. fixes misaligned address for transpose scheduler iii. refactor on separation of compilation API from execution API to prepare us for async compilation iv. double type support on expression evaluator v. PYTORCH_NVFUSER_DUMP refactor to save PTX and CUBIN Commits that's in this PR from the devel branch: ``` 89330aa Tensor factories must set the output shape as its input (#1939) b2fd01e arange support (#1933) 56c00fd Double support on all expression evaluators (#1937) 371f282 Improve trivial reduction merge support (#1931) 1d0c267 Test `rand` in a fusion with zero tensor input (#1932) 0dab160 Fix softmax bwd sizes. (#1890) ef98f36 Fix a bug (#1936) 63132a0 Propagate permissive mapping information into indexing pass (#1929) b4ac2c8 Map IterationDomains through view operations. (#1919) c0a187a do not use deprecated functions (#1935) 88de85e Upstream cherry pick fixes 0811 (#1934) b247dcf Separate kernel compilation API from kernel execution API (#1914) b34e3b9 Fix `ir_utils::hasBlockSync` + misc fixes in transpose scheduler (#1924) 14a53e6 Nullary RNGOp (#1892) 3c3c89e Misc fixes/tuning for transpose scheduler (#1912) 20cf109 Grouped grid welford (#1921) 6cf7eb0 Transpose scheduler small dim sizes better support (#1910) 9341ea9 Disabled ViewPersistentShmoo sizes that results in NAN (#1922) 057237f Fix CUDA driver error: misaligned address for transpose scheduler (#1918) 3fb3d80 Add variance_mean function using Welford (#1907) 98febf6 Remove DisableOption::UnrollWithRng (#1913) ee8ef33 Minor fix for the debug interface of using PTX directly (#1917) 6e8f953 Add PYTORCH_NVFUSER_DUMP options to save PTX and CUBIN (#1916) 5eefa9a dopt is only available since nvrtc 11.7 (#1915) 2ec8fc7 Kill computeAtBetween (#1911) d0d106a Improve view support on pointwise and transpose scheduler (#1906) e71e1ec Fix name clash of RNG with shared memory (#1904) 3381793 Fix mutator and sameAs for expanded IterDomain (#1902) ``` RUN_TORCHBENCH: nvfuser Differential Revision: [D39324552](https://our.internmc.facebook.com/intern/diff/D39324552) Pull Request resolved: pytorch#84626 Approved by: https://github.com/malfet

Add variance_mean function using Welford

1b24b50

Currently complex, half, bfloat16 types give compilation errors with Welford, so these types raise an explicit error for now.

IvanYashchuk requested a review from jjsjann123 August 12, 2022 11:57

jjsjann123 reviewed Aug 12, 2022

View reviewed changes

jjsjann123 approved these changes Aug 12, 2022

View reviewed changes

IvanYashchuk merged commit 3fb3d80 into csarofeen:devel Aug 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add variance_mean composite function using Welford #1907

Add variance_mean composite function using Welford #1907

Uh oh!

IvanYashchuk commented Aug 12, 2022

Uh oh!

jjsjann123 left a comment

Uh oh!

jjsjann123 Aug 12, 2022

Uh oh!

IvanYashchuk Aug 16, 2022

Uh oh!

jjsjann123 Aug 12, 2022

Uh oh!

zasdfgbnm Aug 12, 2022

Uh oh!

jjsjann123 left a comment

Uh oh!

jjsjann123 Aug 12, 2022

Uh oh!

IvanYashchuk Aug 16, 2022

Uh oh!

IvanYashchuk commented Aug 16, 2022

Uh oh!

jjsjann123 commented Aug 16, 2022

Uh oh!

IvanYashchuk commented Aug 19, 2022

Uh oh!

Uh oh!

Add variance_mean composite function using Welford #1907

Add variance_mean composite function using Welford #1907

Uh oh!

Conversation

IvanYashchuk commented Aug 12, 2022

Uh oh!

jjsjann123 left a comment

Choose a reason for hiding this comment

Uh oh!

jjsjann123 Aug 12, 2022

Choose a reason for hiding this comment

Uh oh!

IvanYashchuk Aug 16, 2022

Choose a reason for hiding this comment

Uh oh!

jjsjann123 Aug 12, 2022

Choose a reason for hiding this comment

Uh oh!

zasdfgbnm Aug 12, 2022

Choose a reason for hiding this comment

Uh oh!

jjsjann123 left a comment

Choose a reason for hiding this comment

Uh oh!

jjsjann123 Aug 12, 2022

Choose a reason for hiding this comment

Uh oh!

IvanYashchuk Aug 16, 2022

Choose a reason for hiding this comment

Uh oh!

IvanYashchuk commented Aug 16, 2022

Uh oh!

jjsjann123 commented Aug 16, 2022

Uh oh!

IvanYashchuk commented Aug 19, 2022

Uh oh!

Uh oh!