Skip to content

Lazy wait v2.x #2181

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

Lazy wait v2.x #2181

wants to merge 1 commit into from

Conversation

artpol84
Copy link
Contributor

@artpol84 artpol84 commented Oct 5, 2016

@jladd-mlnx @jsquyres @hppritcha this recent fix is important for SLURM.

Needs to be rebased once #2176 is merged.

@artpol84
Copy link
Contributor Author

@jladd-mlnx @jsquyres @hppritcha this recent fix is important for SLURM.

I removed dependency of this PR from #2176. Please review. We definitely want this in v2.1.0, while I'm not sure we'll have the time to provide all the info requested in #2176.

@artpol84 artpol84 changed the title Lazy wait v2.x (waits for #2176) Lazy wait v2.x Oct 17, 2016
Copy link
Member

@jladd-mlnx jladd-mlnx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to go.

@rhc54
Copy link
Contributor

rhc54 commented Oct 17, 2016

one question: is there ever a reason not to do the lazy wait? I'm wondering if we really need another parameter here, or just change the default. Note that changing this is no way impacts the speed of the underlying fence operation as that takes place in the daemons.

@artpol84
Copy link
Contributor Author

@rhc54,
If I understood you question correctly then I'm confused. It was you who requested this parameter originally when I was PRing to master:
#2089
(here in particular #2089 (comment))

@jladd-mlnx
Copy link
Member

#2089

@rhc54
Copy link
Contributor

rhc54 commented Oct 17, 2016

I asked for it in master since it was an experimental feature. What I want to know before we bring it to a release branch is the result of those tests. Did people find that it made a difference in general? Is it a detriment to anyone? Etc.

We don't just move experimental things to release branches without at least first discussing the results of the experiments. That's how we get into trouble.

@jladd-mlnx
Copy link
Member

@rhc54 it should be the default.

@jladd-mlnx
Copy link
Member

@rhc54 Has Intel run any experiments here? If so, could you share the results?

@jladd-mlnx jladd-mlnx added this to the v2.1.0 milestone Oct 17, 2016
@rhc54
Copy link
Contributor

rhc54 commented Oct 17, 2016

@matcabral Have you tested this? IIRC, the psm MTL did some progressing during MPI_Init, but I don't know if it matters how aggressively we do it.

@jsquyres How about usnic?

@hjelmn How about ugni and vader? Anything they need to do?

@jsquyres
Copy link
Member

Can someone explain what this PR is about? (I don't know offhand if usNIC needs to do anything for this PR)

@hppritcha
Copy link
Member

Is this specific for direct launch using srun?

@jladd-mlnx
Copy link
Member

jladd-mlnx commented Oct 17, 2016

@hppritcha
It is not specific to using a SLURM direct launch, mpirun launches could benefit as well. Any runtime agent that shares CPU resources with the application to progress its operations could benefit.

The question is what's a reasonable heuristic for time sharing the CPU between runtime and
application; this is both transport dependent AND runtime dependent (e.g.
if a runtime implements true offloaded nonblocking collectives over a supporting fabric, then it's
reasonable to allocate a larger fraction of the CPU resources to the
application.)

On Mon, Oct 17, 2016 at 2:18 PM, Howard Pritchard [email protected]
wrote:

Is this specific for direct launch using srun?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#2181 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AHqGfODN_DHMhHyQi05g6nPJjJxxegITks5q07vngaJpZM4KPA6D
.

@hppritcha
Copy link
Member

Consensus seems to be to do more testing in master with lazy wait enabled by default per dev con call 10/18/16. So park this PR and see if it can be reworked to not include the new mca parameter.

@jsquyres
Copy link
Member

Possibly just need to add 7910aa2 to this PR.

@matcabral
Copy link
Contributor

@rhc54 , I have not played with this and the PSM2 MTL

@rhc54
Copy link
Contributor

rhc54 commented Oct 18, 2016

@matcabral We have now turned it "on" by default on master - can you just run a quick smoke test on it? I don't believe it will impact anything, but worth a quick check before it comes to the 2.x series

Copy link
Member

@jsquyres jsquyres left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a Signed-off-by line to this PR's commit.

@matcabral
Copy link
Contributor

Apologize the delayed answer. I finally got to run some osu_ tests and saw no impact in comparison to 2.0.1

@artpol84
Copy link
Contributor Author

artpol84 commented Nov 1, 2016

@jsquyres , done with signed-off

@rhc54
Copy link
Contributor

rhc54 commented Nov 1, 2016

@artpol84 I have confirmed that you can remove the parameters and just hard-code use of the lazy wait. Would you please update this PR?

@jsquyres
Copy link
Member

jsquyres commented Nov 1, 2016

@rhc54 mentioned on the call today that he's checked with everyone, and it seems like no transports are adversely affected by using the lazy wait. Hence, this PR can likely be changed to just always use the lazy wait -- i.e., no need for an MCA param.

@artpol84
Copy link
Contributor Author

artpol84 commented Nov 1, 2016

@rhc54 @jsquyres This will force unsync with master that currently has this option. Is this fine?

@rhc54
Copy link
Contributor

rhc54 commented Nov 1, 2016

Ah, I was expecting that you would update master, and then PR the appropriate changes back to the release branch.

@artpol84
Copy link
Contributor Author

artpol84 commented Nov 1, 2016

So we accept this one and then do a separate PR which removes the MCA param?

@rhc54
Copy link
Contributor

rhc54 commented Nov 1, 2016

No, I'd just close this one out, update master, and then create a new PR that does the right thing.

@jladd-mlnx
Copy link
Member

In my opinion, the MCA parameter should be removed in master as well.

Josh

On Tue, Nov 1, 2016 at 4:47 PM, Artem Polyakov [email protected]
wrote:

@rhc54 https://github.com/rhc54 @jsquyres https://github.com/jsquyres
This will force unsync with master that currently has this option. Is this
fine?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#2181 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AHqGfGUeoPyynOpttaHiLkBmtjF8kf7hks5q56VtgaJpZM4KPA6D
.

@rhc54
Copy link
Contributor

rhc54 commented Nov 1, 2016

definitely

@artpol84
Copy link
Contributor Author

artpol84 commented Nov 1, 2016

Ok, I'll do that

Relax CPU usage pressure from the application processes when doing
modex and barrier in ompi_mpi_init.

We see significant latencies in SLURM/pmix plugin barrier progress
because app processes are aggressively call opal_progress pushing
away daemon process doing collective progress.

(cherry-ported from 0861884)

Signed-off-by: Artem Polyakov <[email protected]>
artpol84 added a commit to artpol84/ompi that referenced this pull request Nov 12, 2016
According to discussion in open-mpi#2181 we don't need MCA
parameter any more.

Signed-off-by: Artem Polyakov <[email protected]>
@artpol84
Copy link
Contributor Author

Closed per discussion above

@artpol84 artpol84 closed this Nov 12, 2016
artpol84 added a commit to artpol84/ompi that referenced this pull request Nov 12, 2016
According to discussion in open-mpi#2181 we don't need MCA
parameter any more.

Signed-off-by: Artem Polyakov <[email protected]>
@artpol84 artpol84 deleted the lazy_wait_v2.x branch November 12, 2016 00:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants