Skip to content

v2.1.0: Fix OOS issues in openib BTL with single-threaded scenarios #2161

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jsquyres opened this issue Oct 4, 2016 · 9 comments
Open

Comments

@jsquyres
Copy link
Member

jsquyres commented Oct 4, 2016

Per #2067, the openib BTL has some cases where out-of-sequence messages can occur, even in single-threaded scenarios. This affects master, v2.0.x, v2.1.x. It assumedly affects v1.10.x, but I don't know if anyone cares.

(note that this issue specifically does not address the OOS issues in multi-threaded scenarios -- that is a much larger issue, and is covered by #2159)

@larrystevenwise @bharatpotnuri Can you guys have a look at this? See #2067 for reference.

@jsquyres jsquyres added the bug label Oct 4, 2016
@jsquyres jsquyres added this to the v2.1.0 milestone Oct 4, 2016
@larrystevenwise
Copy link

@jsquyres - looking at mca_btl_openib_sendi(), it returns OPAL_ERR_RESOURCE_BUSY if various resources aren't available. I don't see it queuing the message anywhere for later retry that could result in OOS. Am I all wet?

@jsquyres
Copy link
Member Author

jsquyres commented Oct 5, 2016

@hjelmn Can you comment? I haven't looked in the openib BTL in years.

@hjelmn
Copy link
Member

hjelmn commented Oct 5, 2016

I think this is largely due to the send coalescing code. I asked @thananon to verify.

@jsquyres
Copy link
Member Author

@larrystevenwise @bharatpotnuri See the thread starting around here #2067 (comment) for some details.

@jsquyres
Copy link
Member Author

@larrystevenwise @bharatpotnuri Does Chelsio intend to look at this? At this point, this issue will need to be re-milestoned to v3.0.0.

@jsquyres jsquyres modified the milestones: v3.0.0, v2.1.0 Feb 22, 2017
@thananon
Copy link
Member

thananon commented Feb 23, 2017

Ok, so I measured OOS message in current master with MPI_THREAD_SINGLE/ openib btl. I would say this problem is not yet solved. Unfortunately my work is taking hit by this problem.

Number of messages OOS count Percentage
256000 54146 21%
512000 62007 12%
1024000 64378 8%

@larrystevenwise
Copy link

We'll look at it at some point, but we're swamped right now.

@hppritcha
Copy link
Member

@larrystevenwise @bharatpotnuri any chance someone will look at this soon? otherwise we'll move to future milestone.

@bharatpotnuri
Copy link
Contributor

@hppritcha We are still stuck with other work, we may need to move it to next milestone,.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants