Skip to content

Add an experimental ability to skip the RTE barriers #2176

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

artpol84
Copy link
Contributor

@artpol84 artpol84 commented Oct 5, 2016

at the end of MPI_Init and the beginning of MPI_Finalize

(cherry-picked from 2c086e5)

…I_Init and the beginning of MPI_Finalize

(cherry-picked from 2c086e5)
@artpol84
Copy link
Contributor Author

artpol84 commented Oct 5, 2016

@jladd-mlnx @jsquyres @hppritcha
this is fairly simple change that we really need in v2.1

Copy link
Member

@jladd-mlnx jladd-mlnx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is ready to go.

@jladd-mlnx jladd-mlnx added this to the v2.1.0 milestone Oct 5, 2016
@jladd-mlnx
Copy link
Member

@hppritcha please merge.

@artpol84 artpol84 mentioned this pull request Oct 5, 2016
@hppritcha
Copy link
Member

Is this intended to help debug startup/shutdown issues?

@jladd-mlnx
Copy link
Member

No. It's a performance enhancement that some (though not all) transports
can leverage.

On Thu, Oct 6, 2016 at 11:21 AM, Howard Pritchard [email protected]
wrote:

Is this intended to help debug startup/shutdown issues?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#2176 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AHqGfLO36W66VbWQYrMWnDcfHURSWesSks5qxRIBgaJpZM4KOtqy
.

@rhc54
Copy link
Contributor

rhc54 commented Oct 7, 2016

Hold on a second - this PR unfortunately mixes two things. I added two params to master a few months back that would allow developers to experiment with skipping RTE barriers during init and finalize. There are no transports today that can do this - the intent was to provide a mechanism so that network providers could develop and test such capabilities, thereby hopefully encouraging them to do so. At this time, nobody has released support for such a capability.

I would therefore suggest that the above change not go into a production release at this time.

The lazy wait is a separate change that helps Slurm users when an RTE barrier is performed. That change needs to go into 2.x, but should be done separately.

@artpol84
Copy link
Contributor Author

artpol84 commented Oct 7, 2016

Ralph, UCX is capable to support this. We want this in 2.1

пятница, 7 октября 2016 г. пользователь rhc54 написал:

Hold on a second - this PR unfortunately mixes two things. I added two
params to master a few months back that would allow developers to
experiment with skipping RTE barriers during init and finalize. There are
no transports today that can do this - the intent was to provide a
mechanism so that network providers could develop and test such
capabilities, thereby hopefully encouraging them to do so. At this time,
nobody has released support for such a capability.

I would therefore suggest that the above change not go into a
production release at this time.

The lazy wait is a separate change that helps Slurm users when an RTE
barrier is performed. That change needs to go into 2.x, but should be done
separately.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#2176 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AHL5Pr2VMXjuNOfrCOLzuUKEdifxG_abks5qxbWbgaJpZM4KOtqy
.


Best regards, Artem Polyakov
(Mobile mail)

@rhc54
Copy link
Contributor

rhc54 commented Oct 7, 2016

Is it that UCX doesn't need an RTE barrier (i.e., the underlying transports can deal with messages arriving for processes not yet started), or that UCX has its own internal RTE barrier?

@artpol84
Copy link
Contributor Author

artpol84 commented Oct 7, 2016

I guess the first- it was designed with current limitations in mind and
tries to eliminate them.

пятница, 7 октября 2016 г. пользователь rhc54 написал:

Is it that UCX doesn't need an RTE barrier (i.e., the underlying
transports can deal with messages arriving for processes not yet started),
or that UCX has its own internal RTE barrier?


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#2176 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AHL5Pj8SAU3dQlf19S4vznuvkVY1qkP6ks5qxbabgaJpZM4KOtqy
.


Best regards, Artem Polyakov
(Mobile mail)

@rhc54
Copy link
Contributor

rhc54 commented Oct 7, 2016

Best check to be sure - I don't believe IB knows how to handle that situation. Remember, it is a race condition, so it is easy to fool yourself into thinking this works, especially at small scale.

@rhc54
Copy link
Contributor

rhc54 commented Oct 7, 2016

Note, by the way, that this has nothing to do with UCX. The problem lies below that level, in the NIC firmware itself. The NIC is going to receive messages for processes that it knows nothing about. When that happens, it either has to buffer those messages until the process can start and register itself with the NIC, or it has to send an error back to the originator, who then must cache the message and resend it until the receiving NIC can deal with it.

@artpol84
Copy link
Contributor Author

artpol84 commented Oct 7, 2016

Sure, let's double check.

@yosefe @jladd-mlnx please advise.

@jladd-mlnx
Copy link
Member

As Artem pointed out, UCX is able to support this feature.

On Thursday, October 6, 2016, Artem Polyakov [email protected]
wrote:

Ralph, UCX is capable to support this. We want this in 2.1

пятница, 7 октября 2016 г. пользователь rhc54 написал:

Hold on a second - this PR unfortunately mixes two things. I added two
params to master a few months back that would allow developers to
experiment with skipping RTE barriers during init and finalize. There are
no transports today that can do this - the intent was to provide a
mechanism so that network providers could develop and test such
capabilities, thereby hopefully encouraging them to do so. At this time,
nobody has released support for such a capability.

I would therefore suggest that the above change not go into a
production release at this time.

The lazy wait is a separate change that helps Slurm users when an RTE
barrier is performed. That change needs to go into 2.x, but should be
done
separately.


You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
#2176 (comment), or
mute
the thread
<https://github.com/notifications/unsubscribe-auth/
AHL5Pr2VMXjuNOfrCOLzuUKEdifxG_abks5qxbWbgaJpZM4KOtqy>
.


Best regards, Artem Polyakov
(Mobile mail)


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#2176 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AHqGfPgr0dQN5qDRjfoTgm9RTCpLm-uDks5qxbXhgaJpZM4KOtqy
.

@rhc54
Copy link
Contributor

rhc54 commented Oct 7, 2016

I was actually asking for a little more reassurance than that 🤕

There are several scenarios to consider here, and I'm really concerned about whether or not you have adequately reviewed code to ensure we are covered, or just tested on a small cluster and said "it works" for some simple test case.

If the user allows the app to perform a modex (i.e., they don't specify the async modex param), then there is at least a barrier in the modex itself that ensures procs have been started on every node. I assume that in this case, the process of opening the UCX component allows the NIC to know that the process exists and sets up any necessary registers so the NIC can handle an ensuing connection request. Thus, a subsequent barrier might not be necessary assuming no further setup is required.

However, if the user specifies the async modex param and specifies that we not perform a barrier at the end of MPI_Init, then it is possible that the remote process will not even have started yet when the NIC receives a connection request. What happens in this case has nothing to do with UCX as that library is only functioning on one side of the problem - it hasn't been instanced yet on the receiving end. You might be protected here if one assumes that you require a modex of some kind to be completed prior to connection - i.e., if someone specifies the async modex param and the no-barrier option, you are still protected by the need for a direct modex to succeed in order for the sender to get the necessary connection info. Thus, you are relying on direct modex to act as your effective barrier.

IF the direct modex is adequate, then effectively every transport could use this "feature", and we could consider making it the default. However, in the absence of any analysis from you and the others in the community, we cannot know, and I would oppose putting this feature in a production release.

The finalize barrier is more concerning as there is no way to know if all MPI messages have been progressed to send completion prior to the process exiting. We have seen hangs due to unsent messages, and I have yet to hear of someone resolving that problem in a reliable manner.

Again, in the absence of some analysis indicating how this has been solved, I would oppose putting this feature in a production release.

I didn't write those two features with any intent of them going into 2.x, nor have I seen any discussion around the problems I described. I'm therefore really reluctant to release them into 2.x without some better understanding of how problems were addressed.

@jsquyres
Copy link
Member

Per the call today:

  1. Mellanox will provide some additional information about how UCX handles these cases, and some data showing that it works.
  2. Additional infrastructure is requested to detect if an underlying transport/runtime does not support all the required features, then disable this capability in Open MPI (and if a user specifically requested, abort under the "a human asked for X and we can't deliver X, so abort" philosophy).

Copy link
Member

@jsquyres jsquyres left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a Signed-off-by line to this PR's commit.

@Di0gen
Copy link

Di0gen commented Jan 4, 2017

bot:mellanox:retest

1 similar comment
@Di0gen
Copy link

Di0gen commented Jan 4, 2017

bot:mellanox:retest

@hppritcha
Copy link
Member

per discussion at the Open MPI devel F2F, @artpol84 says this issue can be closed.

@hppritcha hppritcha closed this Jan 24, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants