-
Notifications
You must be signed in to change notification settings - Fork 900
Lazy wait v2.x #2181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lazy wait v2.x #2181
Conversation
565b011
to
8653bac
Compare
@jladd-mlnx @jsquyres @hppritcha this recent fix is important for SLURM. I removed dependency of this PR from #2176. Please review. We definitely want this in v2.1.0, while I'm not sure we'll have the time to provide all the info requested in #2176. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to go.
one question: is there ever a reason not to do the lazy wait? I'm wondering if we really need another parameter here, or just change the default. Note that changing this is no way impacts the speed of the underlying fence operation as that takes place in the daemons. |
@rhc54, |
I asked for it in master since it was an experimental feature. What I want to know before we bring it to a release branch is the result of those tests. Did people find that it made a difference in general? Is it a detriment to anyone? Etc. We don't just move experimental things to release branches without at least first discussing the results of the experiments. That's how we get into trouble. |
@rhc54 it should be the default. |
@rhc54 Has Intel run any experiments here? If so, could you share the results? |
@matcabral Have you tested this? IIRC, the psm MTL did some progressing during MPI_Init, but I don't know if it matters how aggressively we do it. @jsquyres How about usnic? @hjelmn How about ugni and vader? Anything they need to do? |
Can someone explain what this PR is about? (I don't know offhand if usNIC needs to do anything for this PR) |
Is this specific for direct launch using srun? |
@hppritcha The question is what's a reasonable heuristic for time sharing the CPU between runtime and On Mon, Oct 17, 2016 at 2:18 PM, Howard Pritchard [email protected]
|
Consensus seems to be to do more testing in master with lazy wait enabled by default per dev con call 10/18/16. So park this PR and see if it can be reworked to not include the new mca parameter. |
Possibly just need to add 7910aa2 to this PR. |
@rhc54 , I have not played with this and the PSM2 MTL |
@matcabral We have now turned it "on" by default on master - can you just run a quick smoke test on it? I don't believe it will impact anything, but worth a quick check before it comes to the 2.x series |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a Signed-off-by line to this PR's commit.
Apologize the delayed answer. I finally got to run some osu_ tests and saw no impact in comparison to 2.0.1 |
8653bac
to
f1733ee
Compare
@jsquyres , done with signed-off |
@artpol84 I have confirmed that you can remove the parameters and just hard-code use of the lazy wait. Would you please update this PR? |
@rhc54 mentioned on the call today that he's checked with everyone, and it seems like no transports are adversely affected by using the lazy wait. Hence, this PR can likely be changed to just always use the lazy wait -- i.e., no need for an MCA param. |
Ah, I was expecting that you would update master, and then PR the appropriate changes back to the release branch. |
So we accept this one and then do a separate PR which removes the MCA param? |
No, I'd just close this one out, update master, and then create a new PR that does the right thing. |
In my opinion, the MCA parameter should be removed in master as well. Josh On Tue, Nov 1, 2016 at 4:47 PM, Artem Polyakov [email protected]
|
definitely |
Ok, I'll do that |
Relax CPU usage pressure from the application processes when doing modex and barrier in ompi_mpi_init. We see significant latencies in SLURM/pmix plugin barrier progress because app processes are aggressively call opal_progress pushing away daemon process doing collective progress. (cherry-ported from 0861884) Signed-off-by: Artem Polyakov <[email protected]>
f1733ee
to
53e1e9d
Compare
According to discussion in open-mpi#2181 we don't need MCA parameter any more. Signed-off-by: Artem Polyakov <[email protected]>
Closed per discussion above |
According to discussion in open-mpi#2181 we don't need MCA parameter any more. Signed-off-by: Artem Polyakov <[email protected]>
@jladd-mlnx @jsquyres @hppritcha this recent fix is important for SLURM.
Needs to be rebased once #2176 is merged.