-
Notifications
You must be signed in to change notification settings - Fork 900
ofi nic selection bugfix and document update #12144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
* provider if the selection succeeds | ||
* if the selection fails, returns the fi_info | ||
* object that was initially provided. | ||
* The selection is based on the following priority: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jsquyres Does this explain the NIC selection well enough? What else do you think will make it better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have means for testing this patch, but just reading through the code and comments, and looks ok to me. I hope that @hppritcha or somebody else can however actually test the patch.
i'll test this tomorrow. |
Hey @jsquyres could you give the change a read through and see if the document around nic selection makes sense to you? |
This change fixes current round-robin selection logic: - Only providers of the same type should be considered, i.e. providers that match the head of the list. This deviates from the documented behavior. - For unbound process the selection should be based on its local rank, i.e. rank among processes on the same node. Currently only the first NIC will be selected. Signed-off-by: Wenduo Wang <[email protected]>
The documentation needs an update to reflect latest implementation. The original cpuset matching logic has been replaced with a new distance calculation algorithm. This change also clarifies the round-robin selection process when we need to break a tie. Signed-off-by: Wenduo Wang <[email protected]>
779e2be
to
3aba0bb
Compare
@jsquyres Since this PR contains a bugfix, I will merge this change first. We can sync up later and address your concerns separately. |
@wenduwan So sorry for the delay. Yes, I agree, the comments describe how the code works. You were correct to merge. |
Related to #12135. I got tips from Jeff that the NIC selection logic has changed a lot from 4.x, and the community could use a better document to understand the issue given the many conversations recently.
I went ahead and revisited the code. Indeed we should update the documentation - while doing that I also realized that the new code is not doing the desired thing with the round-robin selection and that is a bug I introduced in the other PR:
--bind-to none
, always choose the first NIC. VERY BAD!