-
Notifications
You must be signed in to change notification settings - Fork 7
reduction on complex numbers, added volatile copy and assignment #2453
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
While not ideal, it should be possible to make the reductions work without modifying the LLVM header file. For serial reductions, they are dynamically generated, so we could tweak our codegen to emit different code for volatile complex types. For block/grid reductions, we are using device functions provided as part of the nvfuser "runtime". For those cases, we could provide a copy functor to the device functions so that the overload cases for volatile types aren't necessary. It's certainly easier from the codegen side if we could just modify the header file as done in the PR. However, if we don't want to make a modification of the external code, there's an alternative option. Another option would be inserting the overload definitions programmatically into the strings returned from |
|
cc457bf
to
3032f57
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you skim through our tests in third_party/nvfuser/test
to find the tests that check reductions with different dtypes, and add complex to the dtype list? I do this by searching DataType::Double
and read all matches manually to determine if a test is interesting to me. Here is a list of tests I found: FusionGroupAllreduce5_CUDA
, FusionReductionSchedulerNoODimShmoo_CUDA
, FusionReductionSchedulerDimShmoo_CUDA
, FusionWelfordShmoo_CUDA
3032f57
to
59cbbb0
Compare
Is this ready for another review? |
59cbbb0
to
e4a9055
Compare
@@ -0,0 +1,172 @@ | |||
namespace std { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we protect this file with
#ifdef __NVCC__
#include <type_traits>
#else
....
#endif
Please try a PYTORCH_NVFUSER_DUMP="cuda_to_file"
run on something, and use nvcc to compile the dumped file to check if it compiles.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice catch! we need this protection. However, the generated code still can't compile with nvcc due to the lack of non-volatile to volatile copy of complex numbers in <complex>
. error msg:
__tmp_kernel1.cu(3426): error: no operator "=" matches these operands
operand types are: volatile std::complex<float> = const std::complex<float>
detected during:
instantiation of "void CudaCodeGen::TupleCopy<DstType, SrcType, num_vals>::copy(DstType &, CudaCodeGen::nvfuser_index_t, const SrcType &, CudaCodeGen::nvfuser_index_t) [with DstType=CudaCodeGen::PtrTupleBase<true, float, double, CudaCodeGen::int64_t, std::complex<float>>, SrcType=CudaCodeGen::LocalTuple<float, double, CudaCodeGen::int64_t, std::complex<float>>, num_vals=4]"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you saying that, if the kernel is complex, then we are still unable to compile with nvcc. But if it is not complex then it is working? If so, then it is fine for now. I believe @mmigdal-nv is using nvcc with matmul quite often, and we need to make sure this PR doesn't break non-complex matmul use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's only influence reduction of complex numbers.
|
||
auto outFDI = add( | ||
add(castOp(DataType::Double, tv3), tv7), castOp(DataType::Double, tv11)); | ||
auto out = add(outFDI, castOp(DataType::Double, tv15)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we cast to complex double here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is just pick the real part and cast to double. All results from other data types are casted to double, so does the reference results from torch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we cast all results from other data types and reference results to complex double as well? Casting to double discard imag so we are not checking the correctness of imag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated this test with all results converted to complex
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for adding reduction support for complex numbers!
copy constructor must be declared within class. So I need to directly patch
aten/src/ATen/cuda/llvm_complex.cpp
to allow the use of volatile in reduction.Fixes #2369