OpenMP thread placement and affinity #1653

brianborchers · 2018-06-30T17:00:22Z

In my testing, on a 4-core two-way hyperthreaded Xeon-W Skylake machine, I've found that the following environment variable settings produce consistently high performance:

OMP_NUM_THREADS 4
OMP_PLACES "{0,1,2,3}"
OMP_PROC_BIND spread

This tells the OpenMP library to allow up to 4 threads, tells the OpenMP library that it can start threads on cores 0-3 (and can't use the hyperthreaded siblings 4-7) and that threads should be spread out over the cores as they're started. I believe that it also implies thread affinity so that threads won't move between cores.

I find that if I don't set these environment variables, the performance is generally worse and can also be much more variable. e.g. on a simple test of matrix multiplication, with OMP_NUM_THREADS=4 the run times varied from 6.23 to 8.94 seconds in four tests. After setting OMP_PROC_BIND and OMP_PLACES, the run times varied from 5.44 to 5.49 seconds in four tests.

Is there any more general advice on how to control thread placement and affinity for the best performance with OpenBLAS? What about systems with more cores and multiple sockets? Could information about this be added to the documentation?

brada4 · 2018-06-30T17:10:37Z

There is no one size fits all guide... In effect you did equivalent of disabling hyperthreading using variables.

martin-frbg · 2018-06-30T17:17:43Z

@brada4 can you then suggest a better solution for this case ? E.g. would OMP_NUM_THREADS=8 work with the given definition of OMP_PLACES to add one hyperthread on each core, or are things not that simple ? I agree that documentation on this - in the github wiki or elsewhere would be helpful.

brianborchers · 2018-06-30T18:11:45Z

I believe that if I set OMP_NUM_THREADS=8 with the OMP_PLACES="{0,1,2,3}" then it would run 2 threads on each core with no hyperthreading- places 4-7 are the hyperthreaded siblings of cores 0-3.

brianborchers · 2018-06-30T18:12:59Z

I'll add that MKL seems to get this right without any intervention from the user.

martin-frbg · 2018-06-30T18:26:49Z

I'll comment that it would be kind of sad if it did not, with a team of paid professionals behind it.

martin-frbg · 2018-06-30T19:24:33Z

From http://forum.openmp.org/forum/viewtopic.php?f=3&t=1731 you could try setting (only) OMP_DISPLAY_ENV=TRUE to see what the libgomp default behaviour is, and OMP_PLACES=cores to get (probably) the same behaviour as from your explicit list of cores. (And does MKL make use of hyperthreading at all on your system ?)

brianborchers · 2018-06-30T21:17:41Z

Interestingly, I don't see MKL using more than 4 threads on this system, even on fairly large tasks and even with OMP_PLACES left unset and OMP_NUM_THREADS set to 8.

The OMP default environment is:

OPENMP DISPLAY ENVIRONMENT BEGIN
_OPENMP = '201511'
OMP_DYNAMIC = 'FALSE'
OMP_NESTED = 'FALSE'
OMP_NUM_THREADS = '8'
OMP_SCHEDULE = 'DYNAMIC'
OMP_PROC_BIND = 'FALSE'
OMP_PLACES = ''
OMP_STACKSIZE = '0'
OMP_WAIT_POLICY = 'PASSIVE'
OMP_THREAD_LIMIT = '4294967295'
OMP_MAX_ACTIVE_LEVELS = '2147483647'
OMP_CANCELLATION = 'FALSE'
OMP_DEFAULT_DEVICE = '0'
OMP_MAX_TASK_PRIORITY = '0'
OPENMP DISPLAY ENVIRONMENT END

I tried OMP_PLACES=cores, but with OMP_NUM_THREADS unset, it used 8 threads and performed poorly- it appears that "cores" includes all 8 of the virtual cores that GOMP sees.

I also tried OMP_PLACES=cores, with OMP_NUM_THREADS=4. This ran at about the same speed as specifying OMP_PLACES="{0,1,2,3}" but htop showed that it was shifting work from (e.g.) core 1 to its sibling core 5 (core 1's hyperthreaded sibling) and back.

I also tried using OMP_PLACES="{0,1,2,3}" and OMP_NUM_THREADS=8. With this configuration the performance was poor and htop showed that only cores 0-3 were in use even though there were 8 threads. Thus it wasn't using hyperthreading.

I conclude that

OMP_PLACES="{0,1,2,3}" is effective at stopping the system from using hyperthreading.

OMP_PLACES=cores doesn't stop hyperthreading.

OMP_NUM_THREADS=4 is consistently better than 8.

OMP_PROC_BIND=spread seem to keep the threads pinned to the cores they started on.

brada4 · 2018-07-01T08:53:14Z

t was shifting work from (e.g.) core 1 to its sibling core 5 (core 1's hyperthreaded sibling) and back.

Is there any performance impact? In principl cache is shared and should be close to none...

Can you modify intel perf bias register and confirm that in general one hyperthread of 2 gives same or better numeric performance as both? It is documented limitation of old HT Atom for example.

brianborchers · 2018-07-01T12:01:57Z

There was no apparent performance impact to the switching between sibling hyperthreaded virtual cores so that probably isn't hurting- I agree that there's no theoretical reason that it should hurt much since the cache is shared. However, the OS does have to do some bookkeeping to move a thread between cores even if the move is just to a sibling hyperthreaded virtual core.

I don't know what the "intel perf bias register" is?

brada4 · 2018-07-01T13:40:08Z

This one:
$ man 8 x86_energy_perf_policy
It is specific to intel processors, it reprograms processor for speed/power efficiency, but also levels resources available to hyperthreaded cores

The accounting done for such process move is minimal , because all memory context is "hot" in shared L3 cache, thus not really much more work than normal context switches for timer/stats interrupts.

jeremiedbb mentioned this issue Nov 28, 2018

Performance issue with many cores #1881

Closed

martin-frbg mentioned this issue Mar 14, 2019

openBLAS nested parallelism #2052

Open

martin-frbg mentioned this issue Sep 22, 2019

openblas and openmp #2265

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenMP thread placement and affinity #1653

OpenMP thread placement and affinity #1653

brianborchers commented Jun 30, 2018

brada4 commented Jun 30, 2018

martin-frbg commented Jun 30, 2018

brianborchers commented Jun 30, 2018

brianborchers commented Jun 30, 2018

martin-frbg commented Jun 30, 2018

martin-frbg commented Jun 30, 2018

brianborchers commented Jun 30, 2018

brada4 commented Jul 1, 2018 •

edited

Loading

brianborchers commented Jul 1, 2018

brada4 commented Jul 1, 2018 •

edited

Loading

OpenMP thread placement and affinity #1653

OpenMP thread placement and affinity #1653

Comments

brianborchers commented Jun 30, 2018

brada4 commented Jun 30, 2018

martin-frbg commented Jun 30, 2018

brianborchers commented Jun 30, 2018

brianborchers commented Jun 30, 2018

martin-frbg commented Jun 30, 2018

martin-frbg commented Jun 30, 2018

brianborchers commented Jun 30, 2018

brada4 commented Jul 1, 2018 • edited Loading

brianborchers commented Jul 1, 2018

brada4 commented Jul 1, 2018 • edited Loading

brada4 commented Jul 1, 2018 •

edited

Loading

brada4 commented Jul 1, 2018 •

edited

Loading