-
Notifications
You must be signed in to change notification settings - Fork 1.6k
performance degrades with more threads on NUMA machine #611
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I did see problems with small matrices ( see #478 ) but not with large matrices. For small matrices the problem is that the cost of Could you tell us:
|
@nudles , did you enable affinity when compiling? Edit |
Result is fairly normal, and what NUMA stands for - local memory close to processor is 2-20x faster than memory access across NUMA network. |
Please delete |
Thats gonna work if input matrix covers many NUMA strides so that it makes friends with kernel numa memory mover and/or numad. With small input matrix we get 3/4 of threads accessing it over slow QPI link and battling system numa balancing... |
@brada4 , oh, the small matrix may be a problem. |
100x3000 .. 3000x300 2.4MB..7.2MB when in Doubles... |
We (@dbxinj) compiled openblas without NO_AFFINITY=1, but the performance did not change --- using more than 8 threads, the performance degraded. |
You need at least 4 times NUMA stride sized data, with later being 2MB..256MB depending on your hardware. |
We (@nudles) used openblas to do tests on CaffeNet model. |
@nudles , what's input size for sgemm? |
It is a 6-core processor, hyperthreads dont count for HPC |
The sizes of operator matrix we(@nudles @dbxinj) used in CaffeNet model(https://github.com/BVLC/caffe/tree/master/models/bvlc_reference_caffenet) as input of sgemm are as follow:
|
Only the biggest is good candidate for multiple CPUs, second biggest is 2MB + 32MB and will not gain from multiple CPUs (Nehalem NUMA stride is 64MB in intel's reference design) |
I am using openblas in my program.
My machine has 4 CPUs each with 6 cores. Hyper-threading is turned on.
I did a test varying the OPENBLAS_NUM_THREADS from 1,2,4,8,16,32.
The running time is reduced from 1 to 8 threads.
But the performance of 16 threads is slightly worse than 8 threads.
The performance for 32 threads is even worse.
I tested with both small and large matrix multiplication, e.g., 100_3k and 3k_300 matrices.
Did anyone observe similar phenomenon?
Do I need to do any specific configuration when compiling openblas for NUMA machine?
Thanks.
The text was updated successfully, but these errors were encountered: