reduction on complex numbers, added volatile copy and assignment #2453

liqiangxl · 2023-02-13T20:54:34Z

copy constructor must be declared within class. So I need to directly patch aten/src/ATen/cuda/llvm_complex.cpp to allow the use of volatile in reduction.

Fixes #2369

naoyam · 2023-02-14T19:05:54Z

While not ideal, it should be possible to make the reductions work without modifying the LLVM header file. For serial reductions, they are dynamically generated, so we could tweak our codegen to emit different code for volatile complex types. For block/grid reductions, we are using device functions provided as part of the nvfuser "runtime". For those cases, we could provide a copy functor to the device functions so that the overload cases for volatile types aren't necessary.

It's certainly easier from the codegen side if we could just modify the header file as done in the PR. However, if we don't want to make a modification of the external code, there's an alternative option. Another option would be inserting the overload definitions programmatically into the strings returned from at::cuda::get_complex_body_string(), etc.

liqiangxl · 2023-02-15T14:47:34Z

aten/src/ATen/cuda/llvm_complex.cpp in Pytorch is copy-pasted (with modification) from the original llvm file. If we don't want to modify that file, can we just make a copy-past to our own runtime folder and combine everything we added to that file for the support of complex number?

zasdfgbnm

Could you skim through our tests in third_party/nvfuser/test to find the tests that check reductions with different dtypes, and add complex to the dtype list? I do this by searching DataType::Double and read all matches manually to determine if a test is interesting to me. Here is a list of tests I found: FusionGroupAllreduce5_CUDA, FusionReductionSchedulerNoODimShmoo_CUDA, FusionReductionSchedulerDimShmoo_CUDA, FusionWelfordShmoo_CUDA

third_party/nvfuser/runtime/complex_number.cu

assignment

csarofeen · 2023-02-27T16:35:09Z

Is this ready for another review?

third_party/nvfuser/csrc/codegen.cpp

third_party/nvfuser/csrc/lower_utils.cpp

zasdfgbnm · 2023-02-27T21:52:16Z

third_party/nvfuser/runtime/basic_type_traits.cu

@@ -0,0 +1,172 @@
+namespace std {


Should we protect this file with

#ifdef __NVCC__ #include <type_traits> #else .... #endif

Please try a PYTORCH_NVFUSER_DUMP="cuda_to_file" run on something, and use nvcc to compile the dumped file to check if it compiles.

nice catch! we need this protection. However, the generated code still can't compile with nvcc due to the lack of non-volatile to volatile copy of complex numbers in <complex>. error msg:

__tmp_kernel1.cu(3426): error: no operator "=" matches these operands operand types are: volatile std::complex<float> = const std::complex<float> detected during: instantiation of "void CudaCodeGen::TupleCopy<DstType, SrcType, num_vals>::copy(DstType &, CudaCodeGen::nvfuser_index_t, const SrcType &, CudaCodeGen::nvfuser_index_t) [with DstType=CudaCodeGen::PtrTupleBase<true, float, double, CudaCodeGen::int64_t, std::complex<float>>, SrcType=CudaCodeGen::LocalTuple<float, double, CudaCodeGen::int64_t, std::complex<float>>, num_vals=4]"

Are you saying that, if the kernel is complex, then we are still unable to compile with nvcc. But if it is not complex then it is working? If so, then it is fine for now. I believe @mmigdal-nv is using nvcc with matmul quite often, and we need to make sure this PR doesn't break non-complex matmul use case.

It's only influence reduction of complex numbers.

third_party/nvfuser/runtime/basic_type_traits.cu

third_party/nvfuser/runtime/complex_number.cu

third_party/nvfuser/test/test_gpu2.cpp

third_party/nvfuser/runtime/helpers.cu

zasdfgbnm · 2023-02-27T21:58:44Z

third_party/nvfuser/test/test_gpu_fused_reduction.cpp


+  auto outFDI = add(
+      add(castOp(DataType::Double, tv3), tv7), castOp(DataType::Double, tv11));
+  auto out = add(outFDI, castOp(DataType::Double, tv15));


Should we cast to complex double here?

this is just pick the real part and cast to double. All results from other data types are casted to double, so does the reference results from torch.

Should we cast all results from other data types and reference results to complex double as well? Casting to double discard imag so we are not checking the correctness of imag.

updated this test with all results converted to complex

zasdfgbnm

LGTM, thanks for adding reduction support for complex numbers!

liqiangxl requested a review from kevinstephano February 13, 2023 20:54

liqiangxl force-pushed the llu/complex_number_reduction branch 2 times, most recently from cc457bf to 3032f57 Compare February 23, 2023 13:58

liqiangxl requested a review from zasdfgbnm February 23, 2023 16:50

zasdfgbnm reviewed Feb 23, 2023

View reviewed changes

third_party/nvfuser/runtime/complex_number.cu Outdated Show resolved Hide resolved

liqiangxl force-pushed the llu/complex_number_reduction branch from 3032f57 to 59cbbb0 Compare February 24, 2023 18:54

liqiangxl added 3 commits February 25, 2023 20:38

reduction on complex numbers, added volatile copy and

52bf589

assignment

add complex number runtime

919e278

move ops on complex numbers from helpers dot cu to complex_number dot cu

b6882a1

add cpp test

e4a9055

liqiangxl force-pushed the llu/complex_number_reduction branch from 59cbbb0 to e4a9055 Compare February 27, 2023 18:42

liqiangxl added 2 commits February 27, 2023 18:58

fix

ecfb67b

fix

2230e26