Skip to content

rocblas alt impl during backward pass only #978

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

hubertlu-tw
Copy link

Cherry-picked the commit to resolve Apex compiling issues related to rocblas_gemm_flags_fp16_alt_impl implementations in MLP and MHA extensions. The error message is like

/opt/rocm-5.0.1/bin/hipcc -I/var/lib/jenkins/apex/csrc -I/opt/conda/lib/python3.7/site-packages/torch/include -I/opt/conda/lib/python3.7/site-packages/torch/include/torch/csrc/api/include -I/opt/conda/lib/python3.7/site-packages/torch/include/TH -I/opt/conda/lib/python3.7/site-packages/torch/include/THC -I/opt/conda/lib/python3.7/site-packages/torch/include/THH -I/opt/rocm-5.0.1/include -I/opt/rocm-5.0.1/miopen/include -I/opt/conda/include/python3.7m -c /var/lib/jenkins/apex/csrc/mlp_hip.hip -o build/temp.linux-x86_64-3.7/var/lib/jenkins/apex/csrc/mlp_hip.o -fPIC -D__HIP_PLATFORM_HCC__=1 -DUSE_ROCM=1 -DCUDA_HAS_FP16=1 -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="_gcc" -DPYBIND11_STDLIB="_libstdcpp" -DPYBIND11_BUILD_ABI="_cxxabi1011" -DTORCH_EXTENSION_NAME=mlp_cuda -D_GLIBCXX_USE_CXX11_ABI=1 --amdgpu-target=gfx906 -fno-gpu-rdc -std=c++14
/var/lib/jenkins/apex/csrc/mlp_hip.hip:1518:20: error: no member named 'BackwardPassGuard' in namespace 'at'
        flag = at::BackwardPassGuard::is_backward_pass() ? rocblas_gemm_flags_fp16_alt_impl : 0;
               ~~~~^
1 error generated when compiling for gfx906.
error: command '/opt/rocm-5.0.1/bin/hipcc' failed with exit status 1

Copy link
Collaborator

@pruthvistony pruthvistony left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the object in_backward, doesnt seem to be used anywhere. Please check.

@@ -151,6 +151,9 @@ struct TORCH_API Node : std::enable_shared_from_this<Node> {
// probably operate with names.
at::NoNamesGuard no_names_guard;

// Keep track of backward pass for rocblas.
at::BackwardPassGuard in_backward;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this required?

Copy link
Collaborator

@pruthvistony pruthvistony Mar 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is in the Node object. Please ignore previous comment.

@jithunnair-amd
Copy link
Collaborator

CI failed unit tests, but not related to this PR afaict. Merging to unblock QA https://ontrack-internal.amd.com/browse/SWDEV-329776

@jithunnair-amd jithunnair-amd merged commit c68cb3c into rocm5.2_internal_testing Mar 29, 2022
@jithunnair-amd
Copy link
Collaborator

Kicking off CI to get cleaner CI report (w/ updated Magma and --continue-through-error):
test pytorch please

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants