gemm/dgemm: add a way for an arch kernel to specify preferred sizes #1846

fenrus75 · 2018-11-01T01:44:47Z

sgemm/dgemm: add a way for an arch kernel to specify preferred sizes

The current gemm threading code can make very unfortunate choices, for
example on my 10 core system a 1024x1024x1024 matrix multiply ends up
chunking into blocks of 102... which is not a vector friendly size
and performance ends up horrible..

this patch adds a helper define where an architecture can specify
a preference for size multiples.
This is different from existing defines that are minimum sizes and such.

The performance increase with this patch for the 1024x1024x1024 sgemm
is 2.3x (!!)

in the threading code there are cases where N or M can become 0, and the optimized beta code did not handle this well, leading to a crash during the audit for the crash a few edge conditions on the if statements were found and fixed as well

The current gemm threading code can make very unfortunate choices, for example on my 10 core system a 1024x1024x1024 matrix multiply ends up chunking into blocks of 102... which is not a vector friendly size and performance ends up horrible. this patch adds a helper define where an architecture can specify a preference for size multiples. This is different from existing defines that are minimum sizes and such. The performance increase with this patch for the 1024x1024x1024 sgemm is 2.3x (!!)

martin-frbg · 2018-11-02T12:17:57Z

Interesting to say the least... but I expect the net effect is smaller on more mundane AVX2 hardware ?

fenrus75 · 2018-11-02T12:31:58Z

only sort of. lets say 10 core (20 thread) system, 1024x1024x1024 matrix.
makes blocks of 51 wide...

so that is a 32 wide stride, then a 16 (both can still be performant) then a 2 and a 1
the 2 and 1 each do a full "k loop" down and have a large number of memory accesses, but no real parallel math to speak of so the 2 and 1 are same cost as the 16 .. even on AVX2.

by doing similar rounding up to nice multiples, this same thread does 2 "K loops" of 32 each instead of 4 "K loops"

zerothi · 2018-11-05T07:43:21Z

A small nit-pick. It should be preferred no?
I can make a PR if desired?

fenrus75 added 2 commits November 1, 2018 01:42

martin-frbg merged commit f1c0227 into OpenMathLib:develop Nov 2, 2018

fenrus75 deleted the threadsize branch November 2, 2018 12:31

martin-frbg added this to the 0.3.4 milestone Nov 3, 2018

fenrus75 mentioned this pull request Dec 3, 2018

Performance issue with many cores #1881

Closed

martin-frbg mentioned this pull request Feb 4, 2020

Poor performance on Power-9 hardware with GCC and SMT enabled #2380

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gemm/dgemm: add a way for an arch kernel to specify preferred sizes #1846

gemm/dgemm: add a way for an arch kernel to specify preferred sizes #1846

fenrus75 commented Nov 1, 2018

martin-frbg commented Nov 2, 2018

fenrus75 commented Nov 2, 2018

zerothi commented Nov 5, 2018

gemm/dgemm: add a way for an arch kernel to specify preferred sizes #1846

gemm/dgemm: add a way for an arch kernel to specify preferred sizes #1846

Conversation

fenrus75 commented Nov 1, 2018

martin-frbg commented Nov 2, 2018

fenrus75 commented Nov 2, 2018

zerothi commented Nov 5, 2018