performance degrades with more threads on NUMA machine #611

nudles · 2015-08-03T11:59:20Z

I am using openblas in my program.
My machine has 4 CPUs each with 6 cores. Hyper-threading is turned on.
I did a test varying the OPENBLAS_NUM_THREADS from 1,2,4,8,16,32.
The running time is reduced from 1 to 8 threads.
But the performance of 16 threads is slightly worse than 8 threads.
The performance for 32 threads is even worse.
I tested with both small and large matrix multiplication, e.g., 100_3k and 3k_300 matrices.

Did anyone observe similar phenomenon?
Do I need to do any specific configuration when compiling openblas for NUMA machine?

Thanks.

jeromerobert · 2015-08-03T14:57:49Z

I did see problems with small matrices ( see #478 ) but not with large matrices. For small matrices the problem is that the cost of blas_memory_alloc increase a lot with the number of threads. The problem should be fixed for gemv and ger but remain for all other functions which use blas_memory_alloc.

Could you tell us:

the version you are using
the compilation options you used
the functions your are using (or best, give the source code of your test)

xianyi · 2015-08-17T20:33:28Z

@nudles , did you enable affinity when compiling?

Edit Makefile.rule, delete or comment NO_AFFINITY=1.

brada4 · 2015-11-28T21:16:08Z

Result is fairly normal, and what NUMA stands for - local memory close to processor is 2-20x faster than memory access across NUMA network.
Perfection is in 6-core version, and you can have 4 copies of that in parallel on your machine.

xianyi · 2015-11-30T21:25:18Z

Please delete NO_AFFINITY=1 at Makefile.rule, which will enable OpenBLAS NUMA mode.

brada4 · 2015-11-30T22:39:21Z

Thats gonna work if input matrix covers many NUMA strides so that it makes friends with kernel numa memory mover and/or numad. With small input matrix we get 3/4 of threads accessing it over slow QPI link and battling system numa balancing...

xianyi · 2015-12-02T17:40:25Z

@brada4 , oh, the small matrix may be a problem.

brada4 · 2015-12-02T18:29:06Z

100x3000 .. 3000x300 2.4MB..7.2MB when in Doubles...

nudles · 2015-12-03T07:48:21Z

We (@dbxinj) compiled openblas without NO_AFFINITY=1, but the performance did not change --- using more than 8 threads, the performance degraded.
The program trains a deep learning model using caffe https://github.com/BVLC/caffe/blob/master/examples/cifar10/cifar10_quick_train_test.prototxt.
We set # BLAS_LIB and $BLAS_INCLUDE in Makefile.conf.example to the OpenBLAS folder.

brada4 · 2015-12-03T08:50:15Z

You need at least 4 times NUMA stride sized data, with later being 2MB..256MB depending on your hardware.
6 threads will be faster than 8 on your small data and small CPU

dbxinj · 2015-12-07T11:25:45Z

We (@nudles) used openblas to do tests on CaffeNet model.
We used a 24-core server with 500GB memory. The 24 cores are distributed into 4 NUMA nodes (Intel Xeon 7540).
OPENBLAS_NUM_THREADS varies from 1, 2, 4, 8, 12,16, 20, 24, 28, 32.
The running time per iteration is reduced from 96s to 34s with the thread number ranging from 1 to 20, but it begins to slightly and continuously increase when threads number becomes larger than 20. The running time per iteration for 32 threads is about 48s.

xianyi · 2015-12-07T16:46:18Z

@nudles , what's input size for sgemm?

brada4 · 2015-12-07T19:41:36Z

It is a 6-core processor, hyperthreads dont count for HPC

ijingo · 2015-12-08T02:41:07Z

The sizes of operator matrix we(@nudles @dbxinj) used in CaffeNet model(https://github.com/BVLC/caffe/tree/master/models/bvlc_reference_caffenet) as input of sgemm are as follow:

size of A	size of B
96x363	363x3025
256x1200	1200x729
384x2304	2304x169
384x1728	1728x169
256x1728	1728x169
256x43264	43264x4096
256x4096	4096x4096
256x4096	4096x1000

brada4 · 2015-12-08T09:46:26Z

Only the biggest is good candidate for multiple CPUs, second biggest is 2MB + 32MB and will not gain from multiple CPUs (Nehalem NUMA stride is 64MB in intel's reference design)
Good news is you can schedule 4 of sgemm-s each with 6 CPUs in parallel using make or xargs.

niketanpansare mentioned this issue Dec 20, 2016

[SYSTEMML-769] Support for native BLAS and simplify deployment for GPU backend apache/systemds#307

Closed

martin-frbg mentioned this issue Mar 28, 2024

Cap the number of parallel threads for GEMM;GETRF and POTRF to ensure sensible workloads on big systems #4585

Merged

martin-frbg closed this as completed in #4585 Apr 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

performance degrades with more threads on NUMA machine #611

performance degrades with more threads on NUMA machine #611

nudles commented Aug 3, 2015

jeromerobert commented Aug 3, 2015

xianyi commented Aug 17, 2015

brada4 commented Nov 28, 2015

xianyi commented Nov 30, 2015

brada4 commented Nov 30, 2015

xianyi commented Dec 2, 2015

brada4 commented Dec 2, 2015

nudles commented Dec 3, 2015

brada4 commented Dec 3, 2015

dbxinj commented Dec 7, 2015

xianyi commented Dec 7, 2015

brada4 commented Dec 7, 2015

ijingo commented Dec 8, 2015

brada4 commented Dec 8, 2015

performance degrades with more threads on NUMA machine #611

performance degrades with more threads on NUMA machine #611

Comments

nudles commented Aug 3, 2015

jeromerobert commented Aug 3, 2015

xianyi commented Aug 17, 2015

brada4 commented Nov 28, 2015

xianyi commented Nov 30, 2015

brada4 commented Nov 30, 2015

xianyi commented Dec 2, 2015

brada4 commented Dec 2, 2015

nudles commented Dec 3, 2015

brada4 commented Dec 3, 2015

dbxinj commented Dec 7, 2015

xianyi commented Dec 7, 2015

brada4 commented Dec 7, 2015

ijingo commented Dec 8, 2015

brada4 commented Dec 8, 2015