Significant performance increase for gemm, but not uniform (v0.30.0 develop) #1360

chrisics · 2017-11-16T14:28:31Z

Hello,

We see that the for single threaded sgemm, the code from the develop branch (from early November 2017) shows a significant performance increase for most values for M, N and K.

Our development environment is Windows, with clang, on a haswell laptop, but we see the same thing on linux.

See the attached graphs.
The results were obtained by timing a single call to openBLAS, with randomized data, which explains why for small matrices, the actual throughput is low, since most of the time is lost waiting for the cache to be loaded.

We also see that for some sizes of N, M and K, there is a performance decrease compared to v0.20.0.
We are therefore reluctant to move to this version.

The spikes in the graphs for sgemv seems to indicate that the performance of gemv could be further improved.

I've read the Goto paper and I've tried to tune this by forcing a number of parameters in config.h via cpuid_x86.c, but to no avail.

Is there something I can try?
Do you expect a more uniform performance increase before this code is released in v0.30.0 ?

Thank you,

martin-frbg · 2017-11-16T14:53:52Z

I believe the performance improvements in GEMM are almost exclusively from @timmoon10 's work on the multihreading behaviour in PR #1320 , the discussion there and in his initial PR #1316 has some pointers. (Another change since 0.2.20 fixed the L1 cache size detection on Haswell, but I suspect that is not an issue here.) I do not remember GEMV being specifically looked at in the same context, so the shifting of the peak performance may be collateral damage - one change from v1 to v2 of his patch was to eliminate peaks at hardware-dependent optimum thread numbers in favor of higher and predictable overall performance.

martin-frbg · 2017-11-16T14:57:31Z

Ahem, seems I overlooked the "single threaded" in your post, are you sure of that ?

chrisics · 2017-11-16T15:29:07Z

Yes, unless there is a bug in the cmake build.

This is the cmake command on windows:

cmake -G "NMake Makefiles" .. -DARCH="x86_64" -DCMAKE_BUILD_TYPE=Release -DNUM_THREADS=1 -DBUILD_WITHOUT_LAPACK=TRUE -DGEMM_MULTITHREAD_THRESHOLD=1 -DBUILD_COMPLEX=FALSE -DBUILD_SINGLE=TRUE -DCMAKE_C_FLAGS="-DMAX_STACK_ALLOC=2048 -DFORCE_OPENBLAS_COMPLEX_STRUCT" -DCMAKE_ASM_FLAGS="-DMAX_STACK_ALLOC=2048 -DFORCE_OPENBLAS_COMPLEX_STRUCT"

brada4 · 2017-11-16T23:14:15Z

MN picture looks like shaky timer (or core migration on SMP)

isuruf · 2017-12-01T15:17:00Z

cmake -DNUM_THREADS=1 doesn't work as intended. #1377 fixes that

martin-frbg closed this as completed Mar 28, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Significant performance increase for gemm, but not uniform (v0.30.0 develop) #1360

Significant performance increase for gemm, but not uniform (v0.30.0 develop) #1360

chrisics commented Nov 16, 2017 •

edited

Loading

martin-frbg commented Nov 16, 2017

martin-frbg commented Nov 16, 2017

chrisics commented Nov 16, 2017

brada4 commented Nov 16, 2017

isuruf commented Dec 1, 2017

Significant performance increase for gemm, but not uniform (v0.30.0 develop) #1360

Significant performance increase for gemm, but not uniform (v0.30.0 develop) #1360

Comments

chrisics commented Nov 16, 2017 • edited Loading

martin-frbg commented Nov 16, 2017

martin-frbg commented Nov 16, 2017

chrisics commented Nov 16, 2017

brada4 commented Nov 16, 2017

isuruf commented Dec 1, 2017

chrisics commented Nov 16, 2017 •

edited

Loading