Thread scaling of concurrent small dgemm operations #2712

robertjharrison · 2020-07-10T01:54:19Z

My app performs many small dgemms, each invoked by a separate thread (via a task pool). As recommended I compiled OpenBlas 3.10 with USE_THREAD=0 and USE_LOCKING=1. This is on Cavium ThunderX2 with gcc 9.2.0.

The scaling with threads is really bad, presumably because of the locking. Replacing the OpenBlas dgemm call with a hand-written kernel using neon intrinsics gives comparable single thread performance, but when using 30 threads pinned to the cores of a single socket (that has 32 physical cores) the hand-written code is about 6x faster, entirely due to superior thread scaling.

The dominant dgemms sizes are (12,144)^T (12,12) and (16,256)^T (16,16).

I wonder if there is a lower-level interface to your optimized small-matrix kernels that I can invoke that bypasses the use of static memory blocks that need the lock protection?

Also, please note that the default RedHat EL8.3 openblas_serial package is not compiled with USE_LOCKING and so produces incorrect results in this use case.

Finally, many thanks for OpenBlas ... it is a tremendously valuable tool and I appreciate the effort it takes to make it happen.

Thanks

Robert

martin-frbg · 2020-07-10T07:50:06Z

This was implemented for SkylakeX SGEMM fairly recently (see interface/gemm.c and the x86_64 sgemm_kernel_direct it calls) but not ported to other architectures and functions yet.

brada4 · 2020-07-10T16:41:40Z

You can get locking by setting OPENBLAS_NUM_THREADS=1 and using pthreads version.

robertjharrison · 2020-07-10T17:18:29Z

Thanks Martin ... I will look at gemm.c and x86_64 sgemm_kernel_direct and see what gap needs filling for ARM64.

martin-frbg mentioned this issue Jan 3, 2025

Add ASIMD Small GEMM kernels #4963

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Thread scaling of concurrent small dgemm operations #2712

Thread scaling of concurrent small dgemm operations #2712

robertjharrison commented Jul 10, 2020

martin-frbg commented Jul 10, 2020

Uh oh!

brada4 commented Jul 10, 2020

Uh oh!

robertjharrison commented Jul 10, 2020

Uh oh!

Thread scaling of concurrent small dgemm operations #2712

Thread scaling of concurrent small dgemm operations #2712

Comments

robertjharrison commented Jul 10, 2020

martin-frbg commented Jul 10, 2020

Uh oh!

brada4 commented Jul 10, 2020

Uh oh!

robertjharrison commented Jul 10, 2020

Uh oh!