-
Notifications
You must be signed in to change notification settings - Fork 900
v4.0.x: usnic fixes and optimizations #7043
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v4.0.x: usnic fixes and optimizations #7043
Conversation
Rename "get_nsec()" to "get_ticks()" to more accurately reflect that this function has no correlation to wall clock time at all. Signed-off-by: Jeff Squyres <[email protected]> (cherry picked from commit ce2910a)
New MCA parameter: btl_usnic_ack_iteration_delay. Set this to the number of times through the usNIC component progress function before sending a standalone ACK (vs. piggy-backing the ACK on any other send going to the target peer). Use "ticks" language to clarify that we're really counting the number of times through the usNIC component DATA_CHANNEL completion check (to check for incoming messages) -- it has no relation to wall clock time whatsoever. Also slightly change the channel-checking scheme in usNIC component progress: only check the PRIORITY channel once (vs. checking it once, not finding anything, and then falling through the progress_2() where we check PRIORITY again and then check the DATA channel). As before, if our "progress" libevent fires, increment the tick counter enough to guarantee that all endpoints that need an ACK will get triggered to send standalone ACKs the next time through progress, if necessary. Signed-off-by: Jeff Squyres <[email protected]> (cherry picked from commit 968b1a5)
Significantly increase the default retrans timeout. If the retrans timeout is too soon, we can end up in a retransmission storm where the logic will continually re-transmit the same frames during a single run through the usNIC progress function (because the timer for a single frame expires before we have run through re-transmitting all the frames pending re-transmission). Signed-off-by: Jeff Squyres <[email protected]> (cherry picked from commit 3cc95d8)
New MCA param: btl_usnic_max_resends_per_iteration. This is the max number of resends we'll do in a single pass through usNIC component progress. This prevents progress from getting stuck in an endless loop of retransmissions (i.e., if more retransmissions are triggered during the sending of retransmissions). Specifically: we need to leave the resend loop to allow receives to happen (which may ACK messages we have sent previously, and therefore cause pending resends to be moot). Signed-off-by: Jeff Squyres <[email protected]> (cherry picked from commit 27e3040)
Move the prefix area from the head to the body in relevant size computations. This fixes a problem in high traffic situations where usNIC may have sent from unregistered memory. Signed-off-by: Jeff Squyres <[email protected]> (cherry picked from commit fe7f772)
Sorry, too late for v4.0.2, moving to v4.0.3 |
It was previously accidentally set to 0. Signed-off-by: Jeff Squyres <[email protected]> (cherry picked from commit 132e4ca)
Signed-off-by: Jeff Squyres <[email protected]> (cherry picked from commit 3080033)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes look good.
Note: I am a coworker of Jeff's at Cisco.
@gpaulsen @hppritcha Can you approve and merge? I have no one left in the OMPI community who can review usNIC stuff, so I got a co-worker to review it for me (but he doesn't have write perms to the OMPI repo, so his review "doesn't count" in terms of github's requirements). Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little curious about the long description of making the timeout longer, and then the lack of description on making it shorter, but I assume you have your reasons and it's not wrong.
Yes, I probably could/should have made that explanation better. The real reason was that I merged the commits on master before getting the code review. In the code review, it came out that it really wasn't necessary to increase the timeout -- 5 miliseconds is long enough already. The real reason for (effectively) that timeout being too short was the lack of a cap on retransmitting during a single iteration of usNIC progress. I.e., when sending many large messages to a very large number of peers, we could get into a situation where the retransmit queue got so long that it effectively took longer to retransmit everything than the retrans timers on the sends. Hence, by the time we finished retransmitting everything, at least some of the timers had already expired, which would then lead to immediately retransmitting again, ...etc. Meaning: the retrans timer wasn't the issue. The amount of work we allowed to happen during a single iteration of usNIC progress was the problem. |
Thanks @bwbarrett for reviewing. |
Ready to merge imho |
See individual commit messages for more details.