-
Notifications
You must be signed in to change notification settings - Fork 900
ibm/onesided/c_strided_acc_onelock is failing on master with openib BTL #640
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
can you give a try to the attached patch (it works for me ...) diff --git a/opal/mca/btl/openib/btl_openib_component.c b/opal/mca/btl/openib/btl_openib_component.c
index 85ba474..6d29f97 100644
--- a/opal/mca/btl/openib/btl_openib_component.c
+++ b/opal/mca/btl/openib/btl_openib_component.c
@@ -3359,12 +3359,15 @@ progress_pending_frags_wqe(mca_btl_base_endpoint_t *ep, const int qpn)
frag = opal_list_remove_first(&ep->qps[qpn].no_wqe_pending_frags[i]);
if(NULL == frag)
break;
+ assert(0 == frag->opal_list_item_refcount);
tmp_ep = to_com_frag(frag)->endpoint;
ret = mca_btl_openib_endpoint_post_send(tmp_ep, to_send_frag(frag));
if (OPAL_SUCCESS != ret) {
/* NTH: this handles retrying if we are out of credits but other errors are not
* handled (maybe abort?). */
- opal_list_prepend (&ep->qps[qpn].no_wqe_pending_frags[i], (opal_list_item_t *) frag);
+ if (OPAL_ERR_RESOURCE_BUSY != ret) {
+ opal_list_prepend (&ep->qps[qpn].no_wqe_pending_frags[i], (opal_list_item_t *) frag);
+ }
break;
}
} the crash occurs in progress_pending_frags_wqe for some reason, mca_btl_openib_endpoint_post_send fails note there is a comment that suggests the error might not be correctly handled. |
This patch works for me and I think basically correct. In the case that we get back OPAL_ERR_RESOURCE_BUSY, this means that somewhere within the post_send function the fragment was already queued up so no need to add it back to the no_wqe_pending_frags list. In the 1.8 series the return value from mca_btl_openib_post_send was never checked and the opal_list_prepend was not there. Maybe if we can get nathan to agree, we can make this change along with some good comments as well. @hjelmn - Any thoughts on this? |
I think this is the correct change. I thought I had made the same change myself but it might not have been pushed. |
Fixes open-mpi/ompi#640 (cherry picked from commit open-mpi/ompi@9f171de)
I noticed the following failures in my nightly tests. I re-ran tests without CUDA-aware confiigured in and still see the failure. This only seems to happen with openib BTL. I do not see this in 1.8 series.
The text was updated successfully, but these errors were encountered: