Skip to content

ibm/onesided/c_strided_acc_onelock is failing on master with openib BTL #640

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rolfv opened this issue Jun 12, 2015 · 3 comments
Closed
Assignees

Comments

@rolfv
Copy link

rolfv commented Jun 12, 2015

I noticed the following failures in my nightly tests. I re-ran tests without CUDA-aware confiigured in and still see the failure. This only seems to happen with openib BTL. I do not see this in 1.8 series.

[rvandevaart@-ivy4 onesided]$ mpirun --host ivy4,ivy5, -np 2 --mca btl self,sm,openib c_strided_acc_onelock
c_strided_acc_onelock: ../../../../../opal/class/opal_list.h:599: opal_list_prepend: Assertion `0 == item->opal_list_item_refcount' failed.
c_strided_acc_onelock: ../../../../../opal/class/opal_list.h:599: opal_list_prepend: Assertion `0 == item->opal_list_item_refcount' failed.
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node ivy4 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
[rvandevaart@ivy4 onesided]$ 
@ggouaillardet
Copy link
Contributor

can you give a try to the attached patch (it works for me ...)

diff --git a/opal/mca/btl/openib/btl_openib_component.c b/opal/mca/btl/openib/btl_openib_component.c
index 85ba474..6d29f97 100644
--- a/opal/mca/btl/openib/btl_openib_component.c
+++ b/opal/mca/btl/openib/btl_openib_component.c
@@ -3359,12 +3359,15 @@ progress_pending_frags_wqe(mca_btl_base_endpoint_t *ep, const int qpn)
             frag = opal_list_remove_first(&ep->qps[qpn].no_wqe_pending_frags[i]);
             if(NULL == frag)
                 break;
+            assert(0 == frag->opal_list_item_refcount);
             tmp_ep = to_com_frag(frag)->endpoint;
             ret = mca_btl_openib_endpoint_post_send(tmp_ep, to_send_frag(frag));
             if (OPAL_SUCCESS != ret) {
                 /* NTH: this handles retrying if we are out of credits but other errors are not
                  * handled (maybe abort?). */
-                opal_list_prepend (&ep->qps[qpn].no_wqe_pending_frags[i], (opal_list_item_t *) frag);
+                if (OPAL_ERR_RESOURCE_BUSY != ret) {
+                    opal_list_prepend (&ep->qps[qpn].no_wqe_pending_frags[i], (opal_list_item_t *) frag);
+                }
                 break;
             }
        }

the crash occurs in progress_pending_frags_wqe

for some reason, mca_btl_openib_endpoint_post_send fails
(acquire_wge fails in qp_get_wge, which puts the fragment in a list)
since a fragment cannot be part of two lists, you get a crash when progress_pending_frags_wqe handle the error and invokes opal_list_prepend(..., frag)

note there is a comment that suggests the error might not be correctly handled.
i do not claim whatsoever this patch is the correct fix, even if it works for me

@rolfv
Copy link
Author

rolfv commented Jun 17, 2015

This patch works for me and I think basically correct. In the case that we get back OPAL_ERR_RESOURCE_BUSY, this means that somewhere within the post_send function the fragment was already queued up so no need to add it back to the no_wqe_pending_frags list. In the 1.8 series the return value from mca_btl_openib_post_send was never checked and the opal_list_prepend was not there. Maybe if we can get nathan to agree, we can make this change along with some good comments as well.

@hjelmn - Any thoughts on this?

@hjelmn
Copy link
Member

hjelmn commented Jun 26, 2015

I think this is the correct change. I thought I had made the same change myself but it might not have been pushed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants