Skip to content

Multi-thread Performance Improvement of GEMM with DIVIDE_RATE=1 for A64FX #5353

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

nakagawa-fj
Copy link
Contributor

Closes #5347
This PR improves the multi-thread performance of GEMM on A64FX by setting DIVIDE_RATE to 1.
The thread control in GEMM currently uses default value of DIVIDE_RATE=2, which always splits N dimension of matrix into two parts for computation. However, this splitting occurs even when N is small (e.g., N=2), leading to a decrease in computational efficiency.
For GEMM on A64FX, I tried DIVIDE_RATE=1 and confirmed performance improvements as shown in the graphs below.
While improvements were expected for narrow matrices with small N dimensions, performance gains were also observed for square matrices.

gemm_divide_rate_1
gemm_divide_rate_2
gemm_divide_rate_3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Inefficiency of thread control with DIVIDE_RATE in GEMM
1 participant