-
Notifications
You must be signed in to change notification settings - Fork 902
v2.1.0: Fix OOS issues in openib BTL with single-threaded scenarios #2161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@jsquyres - looking at mca_btl_openib_sendi(), it returns OPAL_ERR_RESOURCE_BUSY if various resources aren't available. I don't see it queuing the message anywhere for later retry that could result in OOS. Am I all wet? |
@hjelmn Can you comment? I haven't looked in the openib BTL in years. |
I think this is largely due to the send coalescing code. I asked @thananon to verify. |
@larrystevenwise @bharatpotnuri See the thread starting around here #2067 (comment) for some details. |
@larrystevenwise @bharatpotnuri Does Chelsio intend to look at this? At this point, this issue will need to be re-milestoned to v3.0.0. |
Ok, so I measured OOS message in current master with MPI_THREAD_SINGLE/ openib btl. I would say this problem is not yet solved. Unfortunately my work is taking hit by this problem.
|
We'll look at it at some point, but we're swamped right now. |
@larrystevenwise @bharatpotnuri any chance someone will look at this soon? otherwise we'll move to future milestone. |
@hppritcha We are still stuck with other work, we may need to move it to next milestone,. |
Per #2067, the openib BTL has some cases where out-of-sequence messages can occur, even in single-threaded scenarios. This affects master, v2.0.x, v2.1.x. It assumedly affects v1.10.x, but I don't know if anyone cares.
(note that this issue specifically does not address the OOS issues in multi-threaded scenarios -- that is a much larger issue, and is covered by #2159)
@larrystevenwise @bharatpotnuri Can you guys have a look at this? See #2067 for reference.
The text was updated successfully, but these errors were encountered: