Add an experimental ability to skip the RTE barriers #2176

artpol84 · 2016-10-05T11:56:22Z

at the end of MPI_Init and the beginning of MPI_Finalize

(cherry-picked from 2c086e5)

…I_Init and the beginning of MPI_Finalize (cherry-picked from 2c086e5)

artpol84 · 2016-10-05T11:57:01Z

@jladd-mlnx @jsquyres @hppritcha
this is fairly simple change that we really need in v2.1

jladd-mlnx

This is ready to go.

jladd-mlnx · 2016-10-05T13:18:56Z

@hppritcha please merge.

hppritcha · 2016-10-06T15:21:35Z

Is this intended to help debug startup/shutdown issues?

jladd-mlnx · 2016-10-06T16:05:18Z

No. It's a performance enhancement that some (though not all) transports
can leverage.

On Thu, Oct 6, 2016 at 11:21 AM, Howard Pritchard [email protected]
wrote:

Is this intended to help debug startup/shutdown issues?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#2176 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AHqGfLO36W66VbWQYrMWnDcfHURSWesSks5qxRIBgaJpZM4KOtqy
.

rhc54 · 2016-10-07T02:59:36Z

Hold on a second - this PR unfortunately mixes two things. I added two params to master a few months back that would allow developers to experiment with skipping RTE barriers during init and finalize. There are no transports today that can do this - the intent was to provide a mechanism so that network providers could develop and test such capabilities, thereby hopefully encouraging them to do so. At this time, nobody has released support for such a capability.

I would therefore suggest that the above change not go into a production release at this time.

The lazy wait is a separate change that helps Slurm users when an RTE barrier is performed. That change needs to go into 2.x, but should be done separately.

artpol84 · 2016-10-07T03:00:47Z

Ralph, UCX is capable to support this. We want this in 2.1

пятница, 7 октября 2016 г. пользователь rhc54 написал:

Hold on a second - this PR unfortunately mixes two things. I added two
params to master a few months back that would allow developers to
experiment with skipping RTE barriers during init and finalize. There are
no transports today that can do this - the intent was to provide a
mechanism so that network providers could develop and test such
capabilities, thereby hopefully encouraging them to do so. At this time,
nobody has released support for such a capability.

I would therefore suggest that the above change not go into a
production release at this time.

The lazy wait is a separate change that helps Slurm users when an RTE
barrier is performed. That change needs to go into 2.x, but should be done
separately.

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#2176 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AHL5Pr2VMXjuNOfrCOLzuUKEdifxG_abks5qxbWbgaJpZM4KOtqy
.

Best regards, Artem Polyakov
(Mobile mail)

rhc54 · 2016-10-07T03:03:52Z

Is it that UCX doesn't need an RTE barrier (i.e., the underlying transports can deal with messages arriving for processes not yet started), or that UCX has its own internal RTE barrier?

artpol84 · 2016-10-07T03:05:32Z

I guess the first- it was designed with current limitations in mind and
tries to eliminate them.

пятница, 7 октября 2016 г. пользователь rhc54 написал:

Is it that UCX doesn't need an RTE barrier (i.e., the underlying
transports can deal with messages arriving for processes not yet started),
or that UCX has its own internal RTE barrier?

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#2176 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AHL5Pj8SAU3dQlf19S4vznuvkVY1qkP6ks5qxbabgaJpZM4KOtqy
.

Best regards, Artem Polyakov
(Mobile mail)

rhc54 · 2016-10-07T03:06:52Z

Best check to be sure - I don't believe IB knows how to handle that situation. Remember, it is a race condition, so it is easy to fool yourself into thinking this works, especially at small scale.

rhc54 · 2016-10-07T03:08:30Z

Note, by the way, that this has nothing to do with UCX. The problem lies below that level, in the NIC firmware itself. The NIC is going to receive messages for processes that it knows nothing about. When that happens, it either has to buffer those messages until the process can start and register itself with the NIC, or it has to send an error back to the originator, who then must cache the message and resend it until the receiving NIC can deal with it.

artpol84 · 2016-10-07T03:09:54Z

Sure, let's double check.

@yosefe @jladd-mlnx please advise.

jladd-mlnx · 2016-10-07T11:21:30Z

As Artem pointed out, UCX is able to support this feature.

On Thursday, October 6, 2016, Artem Polyakov [email protected]
wrote:

Ralph, UCX is capable to support this. We want this in 2.1

пятница, 7 октября 2016 г. пользователь rhc54 написал:

Hold on a second - this PR unfortunately mixes two things. I added two
params to master a few months back that would allow developers to
experiment with skipping RTE barriers during init and finalize. There are
no transports today that can do this - the intent was to provide a
mechanism so that network providers could develop and test such
capabilities, thereby hopefully encouraging them to do so. At this time,
nobody has released support for such a capability.

I would therefore suggest that the above change not go into a
production release at this time.

The lazy wait is a separate change that helps Slurm users when an RTE
barrier is performed. That change needs to go into 2.x, but should be
done
separately.

—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#2176 (comment), or
mute
the thread
<https://github.com/notifications/unsubscribe-auth/
AHL5Pr2VMXjuNOfrCOLzuUKEdifxG_abks5qxbWbgaJpZM4KOtqy>
.

Best regards, Artem Polyakov
(Mobile mail)

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#2176 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AHqGfPgr0dQN5qDRjfoTgm9RTCpLm-uDks5qxbXhgaJpZM4KOtqy
.

rhc54 · 2016-10-07T15:23:19Z

I was actually asking for a little more reassurance than that 🤕

There are several scenarios to consider here, and I'm really concerned about whether or not you have adequately reviewed code to ensure we are covered, or just tested on a small cluster and said "it works" for some simple test case.

If the user allows the app to perform a modex (i.e., they don't specify the async modex param), then there is at least a barrier in the modex itself that ensures procs have been started on every node. I assume that in this case, the process of opening the UCX component allows the NIC to know that the process exists and sets up any necessary registers so the NIC can handle an ensuing connection request. Thus, a subsequent barrier might not be necessary assuming no further setup is required.

However, if the user specifies the async modex param and specifies that we not perform a barrier at the end of MPI_Init, then it is possible that the remote process will not even have started yet when the NIC receives a connection request. What happens in this case has nothing to do with UCX as that library is only functioning on one side of the problem - it hasn't been instanced yet on the receiving end. You might be protected here if one assumes that you require a modex of some kind to be completed prior to connection - i.e., if someone specifies the async modex param and the no-barrier option, you are still protected by the need for a direct modex to succeed in order for the sender to get the necessary connection info. Thus, you are relying on direct modex to act as your effective barrier.

IF the direct modex is adequate, then effectively every transport could use this "feature", and we could consider making it the default. However, in the absence of any analysis from you and the others in the community, we cannot know, and I would oppose putting this feature in a production release.

The finalize barrier is more concerning as there is no way to know if all MPI messages have been progressed to send completion prior to the process exiting. We have seen hangs due to unsent messages, and I have yet to hear of someone resolving that problem in a reliable manner.

Again, in the absence of some analysis indicating how this has been solved, I would oppose putting this feature in a production release.

I didn't write those two features with any intent of them going into 2.x, nor have I seen any discussion around the problems I described. I'm therefore really reluctant to release them into 2.x without some better understanding of how problems were addressed.

jsquyres · 2016-10-11T16:02:49Z

Per the call today:

Mellanox will provide some additional information about how UCX handles these cases, and some data showing that it works.
Additional infrastructure is requested to detect if an underlying transport/runtime does not support all the required features, then disable this capability in Open MPI (and if a user specifically requested, abort under the "a human asked for X and we can't deliver X, so abort" philosophy).

jsquyres

Please add a Signed-off-by line to this PR's commit.

Di0gen · 2017-01-04T16:50:25Z

bot:mellanox:retest

Di0gen · 2017-01-04T19:46:58Z

bot:mellanox:retest

hppritcha · 2017-01-24T19:47:57Z

per discussion at the Open MPI devel F2F, @artpol84 says this issue can be closed.

Add an experimental ability to skip the RTE barriers at the end of MP…

470b904

…I_Init and the beginning of MPI_Finalize (cherry-picked from 2c086e5)

artpol84 added enhancement Target: v2.x labels Oct 5, 2016

jladd-mlnx approved these changes Oct 5, 2016

View reviewed changes

jladd-mlnx added this to the v2.1.0 milestone Oct 5, 2016

artpol84 mentioned this pull request Oct 5, 2016

Lazy wait v2.x #2181

Closed

jsquyres assigned artpol84 Oct 13, 2016

artpol84 added the pushed-back label Oct 17, 2016

jsquyres requested changes Oct 25, 2016

View reviewed changes

hppritcha closed this Jan 24, 2017

Add an experimental ability to skip the RTE barriers #2176

Add an experimental ability to skip the RTE barriers #2176

Uh oh!

Conversation

artpol84 commented Oct 5, 2016

Uh oh!

artpol84 commented Oct 5, 2016

Uh oh!

jladd-mlnx left a comment

Choose a reason for hiding this comment

Uh oh!

jladd-mlnx commented Oct 5, 2016

Uh oh!

hppritcha commented Oct 6, 2016

Uh oh!

jladd-mlnx commented Oct 6, 2016

Uh oh!

rhc54 commented Oct 7, 2016

Uh oh!

artpol84 commented Oct 7, 2016

Uh oh!

rhc54 commented Oct 7, 2016

Uh oh!

artpol84 commented Oct 7, 2016

Uh oh!

rhc54 commented Oct 7, 2016

Uh oh!

rhc54 commented Oct 7, 2016

Uh oh!

artpol84 commented Oct 7, 2016

Uh oh!

jladd-mlnx commented Oct 7, 2016

Uh oh!

rhc54 commented Oct 7, 2016

Uh oh!

jsquyres commented Oct 11, 2016

Uh oh!

jsquyres left a comment

Choose a reason for hiding this comment

Uh oh!

Di0gen commented Jan 4, 2017

Uh oh!

Di0gen commented Jan 4, 2017

Uh oh!

hppritcha commented Jan 24, 2017

Uh oh!

Uh oh!