Skip to content

OpenMP thread placement and affinity #1653

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
brianborchers opened this issue Jun 30, 2018 · 10 comments
Open

OpenMP thread placement and affinity #1653

brianborchers opened this issue Jun 30, 2018 · 10 comments

Comments

@brianborchers
Copy link

In my testing, on a 4-core two-way hyperthreaded Xeon-W Skylake machine, I've found that the following environment variable settings produce consistently high performance:

OMP_NUM_THREADS 4
OMP_PLACES "{0,1,2,3}"
OMP_PROC_BIND spread

This tells the OpenMP library to allow up to 4 threads, tells the OpenMP library that it can start threads on cores 0-3 (and can't use the hyperthreaded siblings 4-7) and that threads should be spread out over the cores as they're started. I believe that it also implies thread affinity so that threads won't move between cores.

I find that if I don't set these environment variables, the performance is generally worse and can also be much more variable. e.g. on a simple test of matrix multiplication, with OMP_NUM_THREADS=4 the run times varied from 6.23 to 8.94 seconds in four tests. After setting OMP_PROC_BIND and OMP_PLACES, the run times varied from 5.44 to 5.49 seconds in four tests.

Is there any more general advice on how to control thread placement and affinity for the best performance with OpenBLAS? What about systems with more cores and multiple sockets? Could information about this be added to the documentation?

@brada4
Copy link
Contributor

brada4 commented Jun 30, 2018

There is no one size fits all guide... In effect you did equivalent of disabling hyperthreading using variables.

@martin-frbg
Copy link
Collaborator

@brada4 can you then suggest a better solution for this case ? E.g. would OMP_NUM_THREADS=8 work with the given definition of OMP_PLACES to add one hyperthread on each core, or are things not that simple ? I agree that documentation on this - in the github wiki or elsewhere would be helpful.

@brianborchers
Copy link
Author

I believe that if I set OMP_NUM_THREADS=8 with the OMP_PLACES="{0,1,2,3}" then it would run 2 threads on each core with no hyperthreading- places 4-7 are the hyperthreaded siblings of cores 0-3.

@brianborchers
Copy link
Author

I'll add that MKL seems to get this right without any intervention from the user.

@martin-frbg
Copy link
Collaborator

I'll comment that it would be kind of sad if it did not, with a team of paid professionals behind it.

@martin-frbg
Copy link
Collaborator

From http://forum.openmp.org/forum/viewtopic.php?f=3&t=1731 you could try setting (only) OMP_DISPLAY_ENV=TRUE to see what the libgomp default behaviour is, and OMP_PLACES=cores to get (probably) the same behaviour as from your explicit list of cores. (And does MKL make use of hyperthreading at all on your system ?)

@brianborchers
Copy link
Author

Interestingly, I don't see MKL using more than 4 threads on this system, even on fairly large tasks and even with OMP_PLACES left unset and OMP_NUM_THREADS set to 8.

The OMP default environment is:

OPENMP DISPLAY ENVIRONMENT BEGIN
_OPENMP = '201511'
OMP_DYNAMIC = 'FALSE'
OMP_NESTED = 'FALSE'
OMP_NUM_THREADS = '8'
OMP_SCHEDULE = 'DYNAMIC'
OMP_PROC_BIND = 'FALSE'
OMP_PLACES = ''
OMP_STACKSIZE = '0'
OMP_WAIT_POLICY = 'PASSIVE'
OMP_THREAD_LIMIT = '4294967295'
OMP_MAX_ACTIVE_LEVELS = '2147483647'
OMP_CANCELLATION = 'FALSE'
OMP_DEFAULT_DEVICE = '0'
OMP_MAX_TASK_PRIORITY = '0'
OPENMP DISPLAY ENVIRONMENT END

I tried OMP_PLACES=cores, but with OMP_NUM_THREADS unset, it used 8 threads and performed poorly- it appears that "cores" includes all 8 of the virtual cores that GOMP sees.

I also tried OMP_PLACES=cores, with OMP_NUM_THREADS=4. This ran at about the same speed as specifying OMP_PLACES="{0,1,2,3}" but htop showed that it was shifting work from (e.g.) core 1 to its sibling core 5 (core 1's hyperthreaded sibling) and back.

I also tried using OMP_PLACES="{0,1,2,3}" and OMP_NUM_THREADS=8. With this configuration the performance was poor and htop showed that only cores 0-3 were in use even though there were 8 threads. Thus it wasn't using hyperthreading.

I conclude that

OMP_PLACES="{0,1,2,3}" is effective at stopping the system from using hyperthreading.

OMP_PLACES=cores doesn't stop hyperthreading.

OMP_NUM_THREADS=4 is consistently better than 8.

OMP_PROC_BIND=spread seem to keep the threads pinned to the cores they started on.

@brada4
Copy link
Contributor

brada4 commented Jul 1, 2018

t was shifting work from (e.g.) core 1 to its sibling core 5 (core 1's hyperthreaded sibling) and back.

Is there any performance impact? In principl cache is shared and should be close to none...

Can you modify intel perf bias register and confirm that in general one hyperthread of 2 gives same or better numeric performance as both? It is documented limitation of old HT Atom for example.

@brianborchers
Copy link
Author

There was no apparent performance impact to the switching between sibling hyperthreaded virtual cores so that probably isn't hurting- I agree that there's no theoretical reason that it should hurt much since the cache is shared. However, the OS does have to do some bookkeeping to move a thread between cores even if the move is just to a sibling hyperthreaded virtual core.

I don't know what the "intel perf bias register" is?

@brada4
Copy link
Contributor

brada4 commented Jul 1, 2018

This one:
$ man 8 x86_energy_perf_policy
It is specific to intel processors, it reprograms processor for speed/power efficiency, but also levels resources available to hyperthreaded cores

The accounting done for such process move is minimal , because all memory context is "hot" in shared L3 cache, thus not really much more work than normal context switches for timer/stats interrupts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants