Skip to content

gemm/dgemm: add a way for an arch kernel to specify preferred sizes #1846

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Nov 2, 2018

Conversation

fenrus75
Copy link
Contributor

@fenrus75 fenrus75 commented Nov 1, 2018

sgemm/dgemm: add a way for an arch kernel to specify preferred sizes

The current gemm threading code can make very unfortunate choices, for
example on my 10 core system a 1024x1024x1024 matrix multiply ends up
chunking into blocks of 102... which is not a vector friendly size
and performance ends up horrible..

this patch adds a helper define where an architecture can specify
a preference for size multiples.
This is different from existing defines that are minimum sizes and such.

The performance increase with this patch for the 1024x1024x1024 sgemm
is 2.3x (!!)

in the threading code there are cases where N or M can become 0,
and the optimized beta code did not handle this well, leading
to a crash

during the audit for the crash a few edge conditions on the if statements
were found and fixed as well
The current gemm threading code can make very unfortunate choices, for
example on my 10 core system a 1024x1024x1024 matrix multiply ends up
chunking into blocks of 102... which is not a vector friendly size
and performance ends up horrible.

this patch adds a helper define where an architecture can specify
a preference for size multiples.
This is different from existing defines that are minimum sizes and such.

The performance increase with this patch for the 1024x1024x1024 sgemm
is 2.3x (!!)
@martin-frbg
Copy link
Collaborator

Interesting to say the least... but I expect the net effect is smaller on more mundane AVX2 hardware ?

@martin-frbg martin-frbg merged commit f1c0227 into OpenMathLib:develop Nov 2, 2018
@fenrus75 fenrus75 deleted the threadsize branch November 2, 2018 12:31
@fenrus75
Copy link
Contributor Author

fenrus75 commented Nov 2, 2018

only sort of. lets say 10 core (20 thread) system, 1024x1024x1024 matrix.
makes blocks of 51 wide...

so that is a 32 wide stride, then a 16 (both can still be performant) then a 2 and a 1
the 2 and 1 each do a full "k loop" down and have a large number of memory accesses, but no real parallel math to speak of so the 2 and 1 are same cost as the 16 .. even on AVX2.

by doing similar rounding up to nice multiples, this same thread does 2 "K loops" of 32 each instead of 4 "K loops"

@martin-frbg martin-frbg added this to the 0.3.4 milestone Nov 3, 2018
@zerothi
Copy link

zerothi commented Nov 5, 2018

A small nit-pick. It should be preferred no?
I can make a PR if desired?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants