Skip to content

Conversation

IMbackK
Copy link
Collaborator

@IMbackK IMbackK commented Sep 24, 2025

rocwmma 2.0.0 includes a bug in the code faking fp16 accumulation on CDNA

Current rocwmma as released with ROCM 7.0.0 and 7.0.1 includes an embarrassing compile time bug in the code that emulates fp16 accumulation via downcast on devices which do not support this in hardware.

This pr redesigns the conditions on which the WMMA fattn kernel is selected and avoids compiling and using the kernel on the following broken configurations:

CDNA and ROCWMMA 2.0.0
RDNA4 and ROCWMMA <2.0.0

if (NOT ${FOUND_ROCWMMA})
message(FATAL_ERROR "rocwmma has not been found")
endif()
endif()
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This condition never worked, for CHECK_INCLUDE_FILE_CXX cmake generates a cpp file that includes the header and then compiles it with the cxx compiler and checks if the compile is successful. In this case the compile can never be successful as
rocwmma.hpp includes hip extensions to c++, therefore FOUND_ROCWMMA was never set.
This was then masked by the condition NOT ${FOUND_ROCWMMA} being wrong, it should be NOT FOUND_ROCWMMA as NOT ${FOUND_ROCWMMA} expands to NOT "" when FOUND_ROCWMMA is not set, which evaluates to TRUE

@IMbackK IMbackK changed the title HIP: Disable ROCWMMA fatt on CDNA when compiled against ROCWMMA 2.0.0 HIP: Disable ROCWMMA fattn on CDNA when compiled against ROCWMMA 2.0.0 Sep 24, 2025
@IMbackK
Copy link
Collaborator Author

IMbackK commented Sep 24, 2025

fixes #16153

@IMbackK
Copy link
Collaborator Author

IMbackK commented Sep 24, 2025

Unfortunately we cant just accumulate @fp32 in the wmma kernel on cdna to avoid this bug, even though this would be more performant, as we dont have enough shared memory for this.
We could explore opportunistically doing so for the shapes where shared memory suffices.

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Sep 24, 2025
@IMbackK
Copy link
Collaborator Author

IMbackK commented Sep 24, 2025

Currently this cant build on ci, as the rocwmma installation on ci is incorrect.
Atm we simply clone the rocwmma repo:

git clone https://github.com/rocm/rocwmma --branch rocm-${{ env.ROCM_VERSION }} --depth 1

and then use rocwmma, which is header implemented, from there.
as this way we dont run rocwmma's cmake build system, rocwmma-version.hpp never gets generated from https://github.com/ROCm/rocWMMA/blob/develop/library/include/rocwmma/internal/rocwmma-version.hpp.in, and thus this header is missing.

Copy link
Collaborator

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI one of my long-term goals is to remove the WMMA kernel by expanding support for the mma kernel. The instructions that are still missing support are Volta tensor cores, AMD WMMA, and AMD MFMA. I'll need to think about how to organize my hardware, I'll definitely procure an RDNA GPU. For V100/Mi100 I'm not yet sure how to best obtain access.

@IMbackK
Copy link
Collaborator Author

IMbackK commented Sep 24, 2025

Do not commit this until the ci is fixed by properly installing rocwmma on ci. (pr for this will follow)

@IMbackK
Copy link
Collaborator Author

IMbackK commented Sep 24, 2025

FYI one of my long-term goals is to remove the WMMA kernel by expanding support for the mma kernel. The instructions that are still missing support are Volta tensor cores, AMD WMMA, and AMD MFMA. I'll need to think about how to organize my hardware, I'll definitely procure an RDNA GPU. For V100/Mi100 I'm not yet sure how to best obtain access.

@deepsek You have previously expressed interest in adding MFMA support to the fattn mma path, it would be helpful if you could share your current plans in this direction, if any.

@deepsek
Copy link
Contributor

deepsek commented Sep 24, 2025

@IMbackK, I was targeting a November PR to address fattn for MMA along with some other changes. But I'm currently stretched thin with other open-source projects. We might be delayed until 2026. If anyone in the community is taking up this effort, I would be happy to assist with issues!

@IMbackK
Copy link
Collaborator Author

IMbackK commented Oct 1, 2025

@slaren It seams @JohannesGaessler's approval is no longer sufficant, I belive due to the recent changes to codeowners, or perhaps some other configuration change i'm not aware of.

@slaren
Copy link
Member

slaren commented Oct 1, 2025

The number of people with write access has been reduced, see #16113 for more details. Merging based on Johannes' approval.

@slaren slaren merged commit e95fec6 into ggml-org:master Oct 1, 2025
60 of 68 checks passed
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Oct 2, 2025
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Oct 2, 2025
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Oct 2, 2025
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Oct 3, 2025
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Oct 4, 2025
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Oct 5, 2025
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Oct 7, 2025
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Oct 7, 2025
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Oct 9, 2025
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Oct 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants