Skip to content

Document binding behavior (especially w.r.t. threads) #4845

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jsquyres opened this issue Feb 20, 2018 · 10 comments
Open

Document binding behavior (especially w.r.t. threads) #4845

jsquyres opened this issue Feb 20, 2018 · 10 comments

Comments

@jsquyres
Copy link
Member

Per discussion on the 2018-02-20 webex, and per #4799:

The general issue appears to be that since Open MPI binds to socket by default (for np>2), progress threads may not be located on the same core as the "main" thread(s). #4799 talks about this in the context of PMIx, but the issue actually exists for all progress threads in the MPI process.

The short version is that we agreed that the best way to move forward is to document the current behavior and provide information for people who want different behavior (e.g., enable binding to core). This probably entails:

  • Adding something to README
  • Adding one or more questions to the FAQ (which tends to be more Google-able than the README)

Points made during the discussion:

  • Forever ago, we used to bind-to-core by default. We changed to bind-to-socket for a few reasons, one of which was that we wanted to embrace an MPI_THREAD_MULTIPLE world. I.e., if we bind-to-core by default and an app launches a bunch of threads, they're going to be bound to core by default, and life will ...hurt. If we bind-to-socket by default (at least for np>2), then apps that launch a bunch of threads will likely hurt less.
  • This is a "no right answer" kind of scenario -- if we change the binding defaults, we're going to anger some users while appeasing others. As such, the only winning move may be to not play. I.e., document the current behavior, and explain how to change the behavior for those who want to.
@ggouaillardet
Copy link
Contributor

@jsquyres @artpol84 what if we bind the progress threads to cores by default (and hence "oversubscribe" most of the time) but keep binding the MPI tasks to NUMA nodes (NUMA=socket most of the time though) ? would that migitate the issue (e.g. progress thread migration) described in #4799 ?

@rhc54
Copy link
Contributor

rhc54 commented Feb 21, 2018

??? how do you intend to do that?

@artpol84
Copy link
Contributor

@ggouaillardet you need to make sure that main and progress threads are on the same core to get the best performance in a single-threaded case.
So you will need to bind main() + progress to a core.

@rhc54
Copy link
Contributor

rhc54 commented Feb 21, 2018

@artpol84 Do we really know that it has to be the same core? I'm wondering if it is possible that it only needs to be (for example) the same L3 or L2 cache, or some other level above core. I believe your numbers indicated that the same NUMA wasn't sufficient - yes? But that leaves some room in-between.

@artpol84
Copy link
Contributor

@rhc54 I believe your numbers indicated that the same NUMA wasn't sufficient - yes
That is true. I haven't experimented with different types of bindings.
It also might be that just preventing progress and main threads from migration may help.

@ggouaillardet
Copy link
Contributor

@rhc54 if a MPI tasks is bound on cores [n-m], then we can start and bind the first progress thread on core n, the second progress thread on core n+1%(m-n), and so on. Since PMIx is starting its own progress thread, we might have to a new info or an environment variable. Does this has to be more complicated ?

@artpol84 well, in this case what I suggested has to be improved.
Assuming most PMIx_Get() time is spent in MPI_Init(), a possible mitigation would be to bind both the "main" thread (e.g. the one that calls MPI_Init()) and the PMIx progress thread on the first assigned core when invoking MPI_Init() call and then "restore" the binding for the "main" thread (and possibly the PMIx progress thread too) when leaving MPI_Init().

As @jsquyres pointed, there is "no right answer" here, and I am just suggesting an idea to improve the out of the box performance.

@artpol84
Copy link
Contributor

@ggouaillardet PMIx_Get() is not necessary will be in MPI_Init() since now we have a lazy add_proc.

@rhc54
Copy link
Contributor

rhc54 commented Feb 21, 2018

@ggouaillardet I don't think it has to be more complicated. My concern was that you are implying that there is some global knowledge regarding progress threads, and I don't believe it exists. So I'm still a little puzzled as to how you know you are the nth progress thread, and therefore should go on a specific core.

@jsquyres
Copy link
Member Author

jsquyres commented Feb 21, 2018

This is precisely the problem, and one of the reasons why there is no good answer here: we're currently binding to package, so the "main" application thread(s) (MATs) can float anywhere in the package. So if binding the progress thread(s) in some kind of proximity to the MATs (that is smaller than a package) is necessary for performance, we have no way of knowing that where the MATs will be. And even if you did, the MATs may move.

And if we don't let the MATs move -- by binding them to something smaller than the package -- then we're going against the reason we expanded to bind-to-package (what used to be called "socket") in the first place: being friendly to MPI+OpenMPI/THREAD_MULTIPLE applications.

@rhc54
Copy link
Contributor

rhc54 commented Feb 21, 2018

One possible resolution might be thru the PMIx OpenMP/MPI working group. We now have a method by which the MPI layer learns of the OpenMP layer becoming "active", and vice versa. So we will know that the app is multi-threaded, how many threads it intends to use, and what each side's desired binding looks like. When we get that info (which is when either side calls "init"), then we could perhaps determine a binding pattern within the envelope given to us by the RM.

There has even been discussion about making the worker thread pool "common" between the two sides, though that is strictly at the head scratching phase.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants