Skip to content

performance degrades with more threads on NUMA machine #611

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
nudles opened this issue Aug 3, 2015 · 14 comments · Fixed by #4585
Closed

performance degrades with more threads on NUMA machine #611

nudles opened this issue Aug 3, 2015 · 14 comments · Fixed by #4585

Comments

@nudles
Copy link

nudles commented Aug 3, 2015

I am using openblas in my program.
My machine has 4 CPUs each with 6 cores. Hyper-threading is turned on.
I did a test varying the OPENBLAS_NUM_THREADS from 1,2,4,8,16,32.
The running time is reduced from 1 to 8 threads.
But the performance of 16 threads is slightly worse than 8 threads.
The performance for 32 threads is even worse.
I tested with both small and large matrix multiplication, e.g., 100_3k and 3k_300 matrices.

Did anyone observe similar phenomenon?
Do I need to do any specific configuration when compiling openblas for NUMA machine?

Thanks.

@jeromerobert
Copy link
Contributor

I did see problems with small matrices ( see #478 ) but not with large matrices. For small matrices the problem is that the cost of blas_memory_alloc increase a lot with the number of threads. The problem should be fixed for gemv and ger but remain for all other functions which use blas_memory_alloc.

Could you tell us:

  • the version you are using
  • the compilation options you used
  • the functions your are using (or best, give the source code of your test)

@xianyi
Copy link
Collaborator

xianyi commented Aug 17, 2015

@nudles , did you enable affinity when compiling?

Edit Makefile.rule, delete or comment NO_AFFINITY=1.

@brada4
Copy link
Contributor

brada4 commented Nov 28, 2015

Result is fairly normal, and what NUMA stands for - local memory close to processor is 2-20x faster than memory access across NUMA network.
Perfection is in 6-core version, and you can have 4 copies of that in parallel on your machine.

@xianyi
Copy link
Collaborator

xianyi commented Nov 30, 2015

Please delete NO_AFFINITY=1 at Makefile.rule, which will enable OpenBLAS NUMA mode.

@brada4
Copy link
Contributor

brada4 commented Nov 30, 2015

Thats gonna work if input matrix covers many NUMA strides so that it makes friends with kernel numa memory mover and/or numad. With small input matrix we get 3/4 of threads accessing it over slow QPI link and battling system numa balancing...

@xianyi
Copy link
Collaborator

xianyi commented Dec 2, 2015

@brada4 , oh, the small matrix may be a problem.

@brada4
Copy link
Contributor

brada4 commented Dec 2, 2015

100x3000 .. 3000x300 2.4MB..7.2MB when in Doubles...

@nudles
Copy link
Author

nudles commented Dec 3, 2015

We (@dbxinj) compiled openblas without NO_AFFINITY=1, but the performance did not change --- using more than 8 threads, the performance degraded.
The program trains a deep learning model using caffe https://github.com/BVLC/caffe/blob/master/examples/cifar10/cifar10_quick_train_test.prototxt.
We set # BLAS_LIB and $BLAS_INCLUDE in Makefile.conf.example to the OpenBLAS folder.

@brada4
Copy link
Contributor

brada4 commented Dec 3, 2015

You need at least 4 times NUMA stride sized data, with later being 2MB..256MB depending on your hardware.
6 threads will be faster than 8 on your small data and small CPU

@dbxinj
Copy link

dbxinj commented Dec 7, 2015

We (@nudles) used openblas to do tests on CaffeNet model.
We used a 24-core server with 500GB memory. The 24 cores are distributed into 4 NUMA nodes (Intel Xeon 7540).
OPENBLAS_NUM_THREADS varies from 1, 2, 4, 8, 12,16, 20, 24, 28, 32.
The running time per iteration is reduced from 96s to 34s with the thread number ranging from 1 to 20, but it begins to slightly and continuously increase when threads number becomes larger than 20. The running time per iteration for 32 threads is about 48s.

@xianyi
Copy link
Collaborator

xianyi commented Dec 7, 2015

@nudles , what's input size for sgemm?

@brada4
Copy link
Contributor

brada4 commented Dec 7, 2015

It is a 6-core processor, hyperthreads dont count for HPC

@ijingo
Copy link

ijingo commented Dec 8, 2015

The sizes of operator matrix we(@nudles @dbxinj) used in CaffeNet model(https://github.com/BVLC/caffe/tree/master/models/bvlc_reference_caffenet) as input of sgemm are as follow:

size of A size of B
96x363 363x3025
256x1200 1200x729
384x2304 2304x169
384x1728 1728x169
256x1728 1728x169
256x43264 43264x4096
256x4096 4096x4096
256x4096 4096x1000

@brada4
Copy link
Contributor

brada4 commented Dec 8, 2015

Only the biggest is good candidate for multiple CPUs, second biggest is 2MB + 32MB and will not gain from multiple CPUs (Nehalem NUMA stride is 64MB in intel's reference design)
Good news is you can schedule 4 of sgemm-s each with 6 CPUs in parallel using make or xargs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants