-
Notifications
You must be signed in to change notification settings - Fork 1.6k
OpenMP thread placement and affinity #1653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
There is no one size fits all guide... In effect you did equivalent of disabling hyperthreading using variables. |
@brada4 can you then suggest a better solution for this case ? E.g. would OMP_NUM_THREADS=8 work with the given definition of OMP_PLACES to add one hyperthread on each core, or are things not that simple ? I agree that documentation on this - in the github wiki or elsewhere would be helpful. |
I believe that if I set OMP_NUM_THREADS=8 with the OMP_PLACES="{0,1,2,3}" then it would run 2 threads on each core with no hyperthreading- places 4-7 are the hyperthreaded siblings of cores 0-3. |
I'll add that MKL seems to get this right without any intervention from the user. |
I'll comment that it would be kind of sad if it did not, with a team of paid professionals behind it. |
From http://forum.openmp.org/forum/viewtopic.php?f=3&t=1731 you could try setting (only) |
Interestingly, I don't see MKL using more than 4 threads on this system, even on fairly large tasks and even with OMP_PLACES left unset and OMP_NUM_THREADS set to 8. The OMP default environment is: OPENMP DISPLAY ENVIRONMENT BEGIN I tried OMP_PLACES=cores, but with OMP_NUM_THREADS unset, it used 8 threads and performed poorly- it appears that "cores" includes all 8 of the virtual cores that GOMP sees. I also tried OMP_PLACES=cores, with OMP_NUM_THREADS=4. This ran at about the same speed as specifying OMP_PLACES="{0,1,2,3}" but htop showed that it was shifting work from (e.g.) core 1 to its sibling core 5 (core 1's hyperthreaded sibling) and back. I also tried using OMP_PLACES="{0,1,2,3}" and OMP_NUM_THREADS=8. With this configuration the performance was poor and htop showed that only cores 0-3 were in use even though there were 8 threads. Thus it wasn't using hyperthreading. I conclude that OMP_PLACES="{0,1,2,3}" is effective at stopping the system from using hyperthreading. OMP_PLACES=cores doesn't stop hyperthreading. OMP_NUM_THREADS=4 is consistently better than 8. OMP_PROC_BIND=spread seem to keep the threads pinned to the cores they started on. |
Is there any performance impact? In principl cache is shared and should be close to none... Can you modify intel perf bias register and confirm that in general one hyperthread of 2 gives same or better numeric performance as both? It is documented limitation of old HT Atom for example. |
There was no apparent performance impact to the switching between sibling hyperthreaded virtual cores so that probably isn't hurting- I agree that there's no theoretical reason that it should hurt much since the cache is shared. However, the OS does have to do some bookkeeping to move a thread between cores even if the move is just to a sibling hyperthreaded virtual core. I don't know what the "intel perf bias register" is? |
This one: The accounting done for such process move is minimal , because all memory context is "hot" in shared L3 cache, thus not really much more work than normal context switches for timer/stats interrupts. |
In my testing, on a 4-core two-way hyperthreaded Xeon-W Skylake machine, I've found that the following environment variable settings produce consistently high performance:
OMP_NUM_THREADS 4
OMP_PLACES "{0,1,2,3}"
OMP_PROC_BIND spread
This tells the OpenMP library to allow up to 4 threads, tells the OpenMP library that it can start threads on cores 0-3 (and can't use the hyperthreaded siblings 4-7) and that threads should be spread out over the cores as they're started. I believe that it also implies thread affinity so that threads won't move between cores.
I find that if I don't set these environment variables, the performance is generally worse and can also be much more variable. e.g. on a simple test of matrix multiplication, with OMP_NUM_THREADS=4 the run times varied from 6.23 to 8.94 seconds in four tests. After setting OMP_PROC_BIND and OMP_PLACES, the run times varied from 5.44 to 5.49 seconds in four tests.
Is there any more general advice on how to control thread placement and affinity for the best performance with OpenBLAS? What about systems with more cores and multiple sockets? Could information about this be added to the documentation?
The text was updated successfully, but these errors were encountered: