-
Notifications
You must be signed in to change notification settings - Fork 900
Add an experimental ability to skip the RTE barriers #2176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…I_Init and the beginning of MPI_Finalize (cherry-picked from 2c086e5)
@jladd-mlnx @jsquyres @hppritcha |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is ready to go.
@hppritcha please merge. |
Is this intended to help debug startup/shutdown issues? |
No. It's a performance enhancement that some (though not all) transports On Thu, Oct 6, 2016 at 11:21 AM, Howard Pritchard [email protected]
|
Hold on a second - this PR unfortunately mixes two things. I added two params to master a few months back that would allow developers to experiment with skipping RTE barriers during init and finalize. There are no transports today that can do this - the intent was to provide a mechanism so that network providers could develop and test such capabilities, thereby hopefully encouraging them to do so. At this time, nobody has released support for such a capability. I would therefore suggest that the above change not go into a production release at this time. The lazy wait is a separate change that helps Slurm users when an RTE barrier is performed. That change needs to go into 2.x, but should be done separately. |
Ralph, UCX is capable to support this. We want this in 2.1 пятница, 7 октября 2016 г. пользователь rhc54 написал:
Best regards, Artem Polyakov |
Is it that UCX doesn't need an RTE barrier (i.e., the underlying transports can deal with messages arriving for processes not yet started), or that UCX has its own internal RTE barrier? |
I guess the first- it was designed with current limitations in mind and пятница, 7 октября 2016 г. пользователь rhc54 написал:
Best regards, Artem Polyakov |
Best check to be sure - I don't believe IB knows how to handle that situation. Remember, it is a race condition, so it is easy to fool yourself into thinking this works, especially at small scale. |
Note, by the way, that this has nothing to do with UCX. The problem lies below that level, in the NIC firmware itself. The NIC is going to receive messages for processes that it knows nothing about. When that happens, it either has to buffer those messages until the process can start and register itself with the NIC, or it has to send an error back to the originator, who then must cache the message and resend it until the receiving NIC can deal with it. |
Sure, let's double check. @yosefe @jladd-mlnx please advise. |
As Artem pointed out, UCX is able to support this feature. On Thursday, October 6, 2016, Artem Polyakov [email protected]
|
I was actually asking for a little more reassurance than that 🤕 There are several scenarios to consider here, and I'm really concerned about whether or not you have adequately reviewed code to ensure we are covered, or just tested on a small cluster and said "it works" for some simple test case. If the user allows the app to perform a modex (i.e., they don't specify the async modex param), then there is at least a barrier in the modex itself that ensures procs have been started on every node. I assume that in this case, the process of opening the UCX component allows the NIC to know that the process exists and sets up any necessary registers so the NIC can handle an ensuing connection request. Thus, a subsequent barrier might not be necessary assuming no further setup is required. However, if the user specifies the async modex param and specifies that we not perform a barrier at the end of MPI_Init, then it is possible that the remote process will not even have started yet when the NIC receives a connection request. What happens in this case has nothing to do with UCX as that library is only functioning on one side of the problem - it hasn't been instanced yet on the receiving end. You might be protected here if one assumes that you require a modex of some kind to be completed prior to connection - i.e., if someone specifies the async modex param and the no-barrier option, you are still protected by the need for a direct modex to succeed in order for the sender to get the necessary connection info. Thus, you are relying on direct modex to act as your effective barrier. IF the direct modex is adequate, then effectively every transport could use this "feature", and we could consider making it the default. However, in the absence of any analysis from you and the others in the community, we cannot know, and I would oppose putting this feature in a production release. The finalize barrier is more concerning as there is no way to know if all MPI messages have been progressed to send completion prior to the process exiting. We have seen hangs due to unsent messages, and I have yet to hear of someone resolving that problem in a reliable manner. Again, in the absence of some analysis indicating how this has been solved, I would oppose putting this feature in a production release. I didn't write those two features with any intent of them going into 2.x, nor have I seen any discussion around the problems I described. I'm therefore really reluctant to release them into 2.x without some better understanding of how problems were addressed. |
Per the call today:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a Signed-off-by line to this PR's commit.
bot:mellanox:retest |
1 similar comment
bot:mellanox:retest |
per discussion at the Open MPI devel F2F, @artpol84 says this issue can be closed. |
at the end of MPI_Init and the beginning of MPI_Finalize
(cherry-picked from 2c086e5)