Skip to content

ofi nic selection bugfix and document update #12144

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jan 17, 2024

Conversation

wenduwan
Copy link
Contributor

@wenduwan wenduwan commented Dec 2, 2023

Related to #12135. I got tips from Jeff that the NIC selection logic has changed a lot from 4.x, and the community could use a better document to understand the issue given the many conversations recently.

I went ahead and revisited the code. Indeed we should update the documentation - while doing that I also realized that the new code is not doing the desired thing with the round-robin selection and that is a bug I introduced in the other PR:

  • Unbound processes, i.e. --bind-to none, always choose the first NIC. VERY BAD!
  • The selection process does not filter out NICs/providers of different types from the first provider. That is a latent bug causing processes to use different types of NICs, also VERY BAD!

* provider if the selection succeeds
* if the selection fails, returns the fi_info
* object that was initially provided.
* The selection is based on the following priority:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jsquyres Does this explain the NIC selection well enough? What else do you think will make it better?

Copy link
Member

@edgargabriel edgargabriel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have means for testing this patch, but just reading through the code and comments, and looks ok to me. I hope that @hppritcha or somebody else can however actually test the patch.

@hppritcha
Copy link
Member

i'll test this tomorrow.

@wenduwan
Copy link
Contributor Author

Hey @jsquyres could you give the change a read through and see if the document around nic selection makes sense to you?

@wenduwan wenduwan self-assigned this Dec 19, 2023
This change fixes current round-robin selection logic:
- Only providers of the same type should be considered, i.e. providers that
match the head of the list. This deviates from the documented behavior.
- For unbound process the selection should be based on its local rank, i.e.
rank among processes on the same node. Currently only the first NIC will be
selected.

Signed-off-by: Wenduo Wang <[email protected]>
The documentation needs an update to reflect latest implementation.
The original cpuset matching logic has been replaced with a new distance
calculation algorithm.
This change also clarifies the round-robin selection process when we need to
break a tie.

Signed-off-by: Wenduo Wang <[email protected]>
@wenduwan
Copy link
Contributor Author

@jsquyres Since this PR contains a bugfix, I will merge this change first. We can sync up later and address your concerns separately.

@wenduwan wenduwan merged commit 223ed58 into open-mpi:main Jan 17, 2024
@jsquyres
Copy link
Member

@wenduwan So sorry for the delay. Yes, I agree, the comments describe how the code works. You were correct to merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants