Skip to content

Hangs in mca_btl_vader_component_progress on multiple archs #5638

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
amckinstry opened this issue Aug 31, 2018 · 24 comments
Closed

Hangs in mca_btl_vader_component_progress on multiple archs #5638

amckinstry opened this issue Aug 31, 2018 · 24 comments

Comments

@amckinstry
Copy link

Background information

OpenMPI 3.1.2
PMIX 3.0.1

Installed on Debian /sid.

Testing across our suite of MPI programs, we're seeing hangs on some apps, it looks offhand like the common factor is 32-bit systems: i386, mipsel.

Debian bugs:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=905418
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=907267
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=907407

This is on simple 2-core systems for the most part. I've a straightforward reproducible case in a VM here on i386 with ARPACK.
The backtraces look like

#0  0xb7f43d09 in __kernel_vsyscall ()
#1  0xb7f43986 in __vdso_clock_gettime ()
#2  0xb7111af1 in __GI___clock_gettime (clock_id=1, tp=0xbffc7f54) at ../sysdeps/unix/clock_gettime.c:115
#3  0xb6e9edd7 in ?? () from /usr/lib/i386-linux-gnu/libopen-pal.so.40
#4  0xb6e4d3f8 in opal_progress () from /usr/lib/i386-linux-gnu/libopen-pal.so.40
#5  0xb76474a5 in ompi_request_default_wait () from /usr/lib/i386-linux-gnu/libmpi.so.40
#6  0xb769d0a5 in ompi_coll_base_sendrecv_actual () from /usr/lib/i386-linux-gnu/libmpi.so.40
#7  0xb769d4a9 in ompi_coll_base_allreduce_intra_recursivedoubling () from /usr/lib/i386-linux-gnu/libmpi.so.40
#8  0xb4a39860 in ompi_coll_tuned_allreduce_intra_dec_fixed () from /usr/lib/i386-linux-gnu/openmpi/lib/openmpi3/mca_coll_tuned.so
#9  0xb765cf9b in PMPI_Allreduce () from /usr/lib/i386-linux-gnu/libmpi.so.40
#10 0xb774de78 in pmpi_allreduce__ () from /usr/lib/i386-linux-gnu/libmpi_mpifh.so.40
#11 0xb7f0acc3 in pdnorm2 (comm=0, n=50, x=..., inc=1) at pdnorm2.f:79
#12 0xb7f0614d in pdsaitr (comm=0, ido=1, bmat=..., n=50, k=0, np=4, mode=1, resid=..., rnorm=1.7112505600240149, v=..., ldv=256, h=..., ldh=20, ipntr=...,
    workd=..., workl=..., info=0, _bmat=1) at pdsaitr.f:776
#13 0xb7f0749d in pdsaup2 (comm=0, ido=1, bmat=..., n=50, which=..., nev=4, np=16, tol=1.1102230246251565e-16, resid=..., mode=1, iupd=1, ishift=1,
    mxiter=300, v=..., ldv=256, h=..., ldh=20, ritz=..., bounds=..., q=..., ldq=20, workl=..., ipntr=..., workd=..., info=0, _bmat=1, _which=2)
    at pdsaup2.f:391
#14 0xb7f0813e in pdsaupd (comm=0, ido=1, bmat=..., n=50, which=..., nev=4, tol=1.1102230246251565e-16, resid=..., ncv=20, v=..., ldv=256, iparam=...,
    ipntr=..., workd=..., workl=..., lworkl=560, info=0, _bmat=1, _which=2) at pdsaupd.f:630
#15 0x004f6977 in parnoldi (comm=0) at issue46.f:200
#16 0x004f758f in issue46 () at issue46.f:21
#17 0x004f61ec in main (argc=1, argv=0xbffce86f) at issue46.f:23
#18 0xb70259a1 in __libc_start_main (main=0x4f61b0 <main>, argc=1, argv=0xbffcd7d4, init=0x4f75d0 <__libc_csu_init>, fini=0x4f7630 <__libc_csu_fini>,
    rtld_fini=0xb7f54f60 <_dl_fini>, stack_end=0xbffcd7cc) at ../csu/libc-start.c:310
#19 0x004f622c in _start ()

and


#0  0xb7fced09 in __kernel_vsyscall ()
#1  0xb7fce986 in __vdso_clock_gettime ()
#2  0xb719caf1 in __GI___clock_gettime (clock_id=1, tp=0xbf82ef84) at ../sysdeps/unix/clock_gettime.c:115
#3  0xb6f29dd7 in ?? () from /usr/lib/i386-linux-gnu/libopen-pal.so.40
#4  0xb6ed83f8 in opal_progress () from /usr/lib/i386-linux-gnu/libopen-pal.so.40
#5  0xb4ae5b1d in mca_pml_ob1_recv () from /usr/lib/i386-linux-gnu/openmpi/lib/openmpi3/mca_pml_ob1.so
#6  0xb77095a8 in PMPI_Recv () from /usr/lib/i386-linux-gnu/libmpi.so.40
#7  0xb77e20a5 in pmpi_recv__ () from /usr/lib/i386-linux-gnu/libmpi_mpifh.so.40
#8  0x0048e6a1 in av (comm=0, nloc=50, nx=10, mv_buf=..., v=..., w=...) at issue46.f:436
#9  0x0048e92a in parnoldi (comm=0) at issue46.f:215
#10 0x0048f58f in issue46 () at issue46.f:21
#11 0x0048e1ec in main (argc=1, argv=0xbf83586f) at issue46.f:23
#12 0xb70b09a1 in __libc_start_main (main=0x48e1b0 <main>, argc=1, argv=0xbf834264, init=0x48f5d0 <__libc_csu_init>, fini=0x48f630 <__libc_csu_fini>,
    rtld_fini=0xb7fdff60 <_dl_fini>, stack_end=0xbf83425c) at ../csu/libc-start.c:310
#13 0x0048e22c in _start ()
@amckinstry
Copy link
Author

amckinstry commented Sep 1, 2018

With a debug build:

#0  mca_btl_vader_check_fboxes () at btl_vader_fbox.h:170
#1  mca_btl_vader_component_progress () at btl_vader_component.c:702
#2  0xb6df9380 in opal_progress () from /usr/lib/i386-linux-gnu/libopen-pal.so.40
#3  0xb75f34a5 in ompi_request_wait_completion (req=0x13f4d00) at ../ompi/request/request.h:413
#4  ompi_request_default_wait (req_ptr=0xbf7fdda4, status=0xbf7fdda8) at request/req_wait.c:42
#5  0xb76490a5 in ompi_coll_base_sendrecv_actual (sendbuf=0x13da6c0, scount=1, sdatatype=0xb76abf00 <ompi_mpi_dblprec>, dest=0, stag=-12, recvbuf=0xbf7fdfe0,
    rcount=1, rdatatype=0xb76abf00 <ompi_mpi_dblprec>, source=0, rtag=-12, comm=0xb76b3fe0 <ompi_mpi_comm_world>, status=0x0) at base/coll_base_util.c:59
#6  0xb76494a9 in ompi_coll_base_allreduce_intra_recursivedoubling (sbuf=0xbf7fdfd0, rbuf=0xbf7fdfe0, count=1, dtype=0xb76abf00 <ompi_mpi_dblprec>,
    op=0xb76c1a20 <ompi_mpi_op_sum>, comm=0xb76b3fe0 <ompi_mpi_comm_world>, module=0x13f62f0) at base/coll_base_allreduce.c:217
#7  0xb49e2860 in ompi_coll_tuned_allreduce_intra_dec_fixed (sbuf=0xbf7fdfd0, rbuf=0xbf7fdfe0, count=1, dtype=0xb76abf00 <ompi_mpi_dblprec>,
    op=0xb76c1a20 <ompi_mpi_op_sum>, comm=0xb76b3fe0 <ompi_mpi_comm_world>, module=0x13f62f0) at coll_tuned_decision_fixed.c:66
#8  0xb7608f9b in PMPI_Allreduce (sendbuf=0xbf7fdfd0, recvbuf=0xbf7fdfe0, count=1, datatype=0xb76abf00 <ompi_mpi_dblprec>, op=0xb76c1a20 <ompi_mpi_op_sum>,
    comm=0xb76b3fe0 <ompi_mpi_comm_world>) at pallreduce.c:110
#9  0xb76f9e78 in ompi_allreduce_f (sendbuf=0xbf7fdfd0 "\030!甯A\353?\001먠\335\036\364?FLl\366̑\362?\220\347", <incomplete sequence \355\267>,
    recvbuf=0xbf7fdfe0 "FLl\366̑\362?\220\347", <incomplete sequence \355\267>, count=0xb7eca1c0, datatype=0xb7eca1c4, op=0xb7eca1c8, comm=0x43d2b4,
    ierr=0xbf7fdfcc) at pallreduce_f.c:87
#10 0xb7eabcc3 in pdnorm2 (comm=0, n=50, x=..., inc=1) at pdnorm2.f:79
#11 0xb7ea714d in pdsaitr (comm=0, ido=1, bmat=..., n=50, k=0, np=4, mode=1, resid=..., rnorm=1.7112505600240149, v=..., ldv=256, h=..., ldh=20, ipntr=...,
    workd=..., workl=..., info=0, _bmat=1) at pdsaitr.f:776
#12 0xb7ea849d in pdsaup2 (comm=0, ido=1, bmat=..., n=50, which=..., nev=4, np=16, tol=1.1102230246251565e-16, resid=..., mode=1, iupd=1, ishift=1,
    mxiter=300, v=..., ldv=256, h=..., ldh=20, ritz=..., bounds=..., q=..., ldq=20, workl=..., ipntr=..., workd=..., info=0, _bmat=1, _which=2)
    at pdsaup2.f:391
#13 0xb7ea913e in pdsaupd (comm=0, ido=1, bmat=..., n=50, which=..., nev=4, tol=1.1102230246251565e-16, resid=..., ncv=20, v=..., ldv=256, iparam=...,
    ipntr=..., workd=..., workl=..., lworkl=560, info=0, _bmat=1, _which=2) at pdsaupd.f:630
#14 0x0043b977 in parnoldi (comm=0) at issue46.f:200
#15 0x0043c58f in issue46 () at issue46.f:21
#16 0x0043b1ec in main (argc=1, argv=0xbf80401e) at issue46.f:23
#17 0xb6fd19a1 in __libc_start_main (main=0x43b1b0 <main>, argc=1, argv=0xbf803524, init=0x43c5d0 <__libc_csu_init>, fini=0x43c630 <__libc_csu_fini>,
    rtld_fini=0xb7ef5f60 <_dl_fini>, stack_end=0xbf80351c) at ../csu/libc-start.c:310
#18 0x0043b22c in _start ()
(gdb) quit
A debugging session is active.

and

#0  0xb5a11b46 in mca_btl_vader_component_progress () at btl_vader_component.c:701
#1  0xb6ee2380 in opal_progress () from /usr/lib/i386-linux-gnu/libopen-pal.so.40
#2  0xb4ae5b1d in ompi_request_wait_completion (req=<optimized out>) at ../../../../ompi/request/request.h:413
#3  mca_pml_ob1_recv (addr=<optimized out>, count=<optimized out>, datatype=<optimized out>, src=<optimized out>, tag=<optimized out>, comm=<optimized out>,
    status=<optimized out>) at pml_ob1_irecv.c:135
#4  0xb77135a8 in PMPI_Recv (buf=0xbff0a9c0, count=10, type=0xb7794f00 <ompi_mpi_dblprec>, source=1, tag=0, comm=0xb779cfe0 <ompi_mpi_comm_world>,
    status=0xbff09d28) at precv.c:79
#5  0xb77ec0a5 in ompi_recv_f (
    buf=0xbff0a9c0 "\254g\027\033-ư?\362L,Lq\305\"?[\004\070\347\060\216ÿK\205tH\316lÿ'\265\276\231\037\360\223?\345\210\033]sf\251\277\240XW-\257'\303?-qz\263A;\245\277x\240|\022d*\254?\263\021\202\006\274\232\250\277", count=0xbff09e00, datatype=0x459018, source=0xbff09d18, tag=0xbff09d14, comm=0x4592b4,
    status=0xbff09d28, ierr=0xbff09d10) at precv_f.c:85
#6  0x004576a1 in av (comm=0, nloc=50, nx=10, mv_buf=..., v=..., w=...) at issue46.f:436
#7  0x0045792a in parnoldi (comm=0) at issue46.f:215
#8  0x0045858f in issue46 () at issue46.f:21
#9  0x004571ec in main (argc=1, argv=0xbff1001e) at issue46.f:23
#10 0xb70ba9a1 in __libc_start_main (main=0x4571b0 <main>, argc=1, argv=0xbff0ee24, init=0x4585d0 <__libc_csu_init>, fini=0x458630 <__libc_csu_fini>,
    rtld_fini=0xb7fdef60 <_dl_fini>, stack_end=0xbff0ee1c) at ../csu/libc-start.c:310

@amckinstry
Copy link
Author

https://buildd.debian.org/status/package.php?p=elpa

Fails/hangs on mips64el and ppc64el, so not just 32-bit systems.
tested with internal pmix / libevent and still occurs.

@amckinstry amckinstry changed the title Hangs on i386 (and other 32-bit systems?) Hangs in mca_btl_vader_component_progress on multiple archs Sep 11, 2018
@amckinstry
Copy link
Author

amckinstry commented Sep 11, 2018

This reliably happens on i386 and other archs (but not x86_64).
I've tested with OpenMPI 3.1.3 as of Sept. 11.

0xb5933b16 in mca_btl_vader_component_progress () at btl_vader_component.c:690
#1  0xb6e3a380 in opal_progress () at runtime/opal_progress.c:228
#2  0xb4a24b1d in ompi_request_wait_completion (req=<optimized out>) at ../../../../ompi/request/request.h:413

and on core 2:

 0xb7176af1 in __GI___clock_gettime (clock_id=1, tp=0xbfeaa794) at ../sysdeps/unix/clock_gettime.c:115
#3  0xb6f02d57 in opal_timer_linux_get_cycles_clock_gettime () at timer_linux_component.c:223
#4  0xb6eb13f8 in opal_progress_events () at runtime/opal_progress.c:181
#5  opal_progress () at runtime/opal_progress.c:244
#6  0xb76ab5a5 in ompi_request_wait_completion (req=0xdced00) at ../ompi/request/request.h:413

The test case is issue46 for arpack: an issue with custom communicators:
https://github.com/opencollab/arpack-ng/blob/master/PARPACK/TESTS/MPI/issue46.f

@jsquyres
Copy link
Member

This may be related to #5375.

@hjelmn
Copy link
Member

hjelmn commented Sep 11, 2018

@amckinstry Please try this patch:

diff --git a/opal/mca/btl/vader/btl_vader_fbox.h b/opal/mca/btl/vader/btl_vader_fbox.h
index 17239ce8ef..b5526050e0 100644
--- a/opal/mca/btl/vader/btl_vader_fbox.h
+++ b/opal/mca/btl/vader/btl_vader_fbox.h
@@ -50,9 +50,10 @@ void mca_btl_vader_poll_handle_frag (mca_btl_vader_hdr_t *hdr, mca_btl_base_endp
 static inline void mca_btl_vader_fbox_set_header (mca_btl_vader_fbox_hdr_t *hdr, uint16_t tag,
                                                   uint16_t seq, uint32_t size)
 {
-    mca_btl_vader_fbox_hdr_t tmp = {.data = {.tag = tag, .seq = seq, .size = size}};
-    opal_atomic_wmb ();
+    mca_btl_vader_fbox_hdr_t tmp = {.data = {.tag = 0, .seq = seq, .size = size}};
     hdr->ival = tmp.ival;
+    opal_atomic_wmb ();
+    hdr->data.tag = tag;
 }
 
 /* attempt to reserve a contiguous segment from the remote ep */

@amckinstry
Copy link
Author

Unfortunately not the (full ?) answer.
Its now hanging elsewhere:

0xb71afaf1 in __GI___clock_gettime (clock_id=1, tp=0xbf8e7034) at ../sysdeps/unix/clock_gettime.c:115
#3  0xb6f3bdd7 in opal_timer_linux_get_cycles_clock_gettime () at timer_linux_component.c:223
#4  0xb6eea3f8 in opal_progress_events () at runtime/opal_progress.c:181
#5  opal_progress () at runtime/opal_progress.c:244
#6  0xb76e44a5 in ompi_request_wait_completion (req=0x2145d00) at ../ompi/request/request.h:413
#7  ompi_request_default_wait (req_ptr=0xbf8e7134, status=0xbf8e7138) at request/req_wait.c:42
#8  0xb773a0a5 in ompi_coll_base_sendrecv_actual (sendbuf=0x212b670, scount=1, sdatatype=0xb779cf00 <ompi_mpi_dblprec>, dest=1, stag=-12, recvbuf=0xbf8e7368,
    rcount=1, rdatatype=0xb779cf00 <ompi_mpi_dblprec>, source=1, rtag=-12, comm=0xb77a4fe0 <ompi_mpi_comm_world>, status=0x0) at base/coll_base_util.c:59
#9  0xb773a4a9 in ompi_coll_base_allreduce_intra_recursivedoubling (sbuf=0xbf8e7360, rbuf=0xbf8e7368, count=1, dtype=0xb779cf00 <ompi_mpi_dblprec>,
    op=0xb77b3a20 <ompi_mpi_op_max>, comm=0xb77a4fe0 <ompi_mpi_comm_world>, module=0x21472f0) at base/coll_base_allreduce.c:217
#10 0xb491b860 in ompi_coll_tuned_allreduce_intra_dec_fixed (sbuf=0xbf8e7360, rbuf=0xbf8e7368, count=1, dtype=0xb779cf00 <ompi_mpi_dblprec>,
    op=0xb77b3a20 <ompi_mpi_op_max>, comm=0xb77a4fe0 <ompi_mpi_comm_world>, module=0x21472f0) at coll_tuned_decision_fixed.c:66
#11 0xb76f9f9b in PMPI_Allreduce (sendbuf=0xbf8e7360, recvbuf=0xbf8e7368, count=1, datatype=0xb779cf00 <ompi_mpi_dblprec>, op=0xb77b3a20 <ompi_mpi_op_max>,
    comm=0xb77a4fe0 <ompi_mpi_comm_world>) at pallreduce.c:110
#12 0xb77eae78 in ompi_allreduce_f (sendbuf=0xbf8e7360 'ڼ\264\334\313\361\373?\246w\324\344\000', recvbuf=0xbf8e7368 '\246w\324\344\000', count=0xb7fad1c0,
    datatype=0xb7fad1c4, op=0xb7fad1c0, comm=0x4dd2b4, ierr=0xbf8e735c) at pallreduce_f.c:87
#13 0xb7f8ec6a in pdnorm2 (comm=0, n=50, x=..., inc=1) at pdnorm2.f:73

and

 0xb7184af1 in __GI___clock_gettime (clock_id=1, tp=0xbf9a8054) at ../sysdeps/unix/clock_gettime.c:115
#3  0xb6f10dd7 in opal_timer_linux_get_cycles_clock_gettime () at timer_linux_component.c:223
#4  0xb6ebf3f8 in opal_progress_events () at runtime/opal_progress.c:181
#5  opal_progress () at runtime/opal_progress.c:244
#6  0xb76b94a5 in ompi_request_wait_completion (req=0x2581d00) at ../ompi/request/request.h:413
#7  ompi_request_default_wait (req_ptr=0xbf9a8154, status=0xbf9a8158) at request/req_wait.c:42
#8  0xb770f0a5 in ompi_coll_base_sendrecv_actual (sendbuf=0x2583ca0, scount=2, sdatatype=0xb7771f00 <ompi_mpi_dblprec>, dest=0, stag=-12, recvbuf=0xbf9acb00,
    rcount=2, rdatatype=0xb7771f00 <ompi_mpi_dblprec>, source=0, rtag=-12, comm=0xb7779fe0 <ompi_mpi_comm_world>, status=0x0) at base/coll_base_util.c:59
#9  0xb770f4a9 in ompi_coll_base_allreduce_intra_recursivedoubling (sbuf=0xbf9acb10, rbuf=0xbf9acb00, count=2, dtype=0xb7771f00 <ompi_mpi_dblprec>,
    op=0xb7787a20 <ompi_mpi_op_sum>, comm=0xb7779fe0 <ompi_mpi_comm_world>, module=0x2583530) at base/coll_base_allreduce.c:217
#10 0xb48f4860 in ompi_coll_tuned_allreduce_intra_dec_fixed (sbuf=0xbf9acb10, rbuf=0xbf9acb00, count=2, dtype=0xb7771f00 <ompi_mpi_dblprec>,
    op=0xb7787a20 <ompi_mpi_op_sum>, comm=0xb7779fe0 <ompi_mpi_comm_world>, module=0x2583530) at coll_tuned_decision_fixed.c:66
#11 0xb76cef9b in PMPI_Allreduce (sendbuf=0xbf9acb10, recvbuf=0xbf9acb00, count=2, datatype=0xb7771f00 <ompi_mpi_dblprec>, op=0xb7787a20 <ompi_mpi_op_sum>,
    comm=0xb7779fe0 <ompi_mpi_comm_world>) at pallreduce.c:110
#12 0xb77bfe78 in ompi_allreduce_f (sendbuf=0xbf9acb10 '\344}}\371Z\201\345?\306\022g\312\302\\\374?\000', recvbuf=0xbf9acb00 '\000',
    count=0xb7f96790 <j.3875>, datatype=0xb7f81044, op=0xb7f81040, comm=0x4d62b4, ierr=0xb7f9677c <ierr.3868>) at pallreduce_f.c:87
#13 0xb7f5e2b0 in pdsaitr (comm=0, ido=1, bmat=..., n=50, k=0, np=4, mode=1, resid=..., rnorm=1.7112505600240149, v=..., ldv=256, h=..., ldh=20, ipntr=...,
    workd=..., workl=..., info=0, _bmat=1) at pdsaitr.f:610
#14 0xb7f6049d in pdsaup2 (comm=0, ido=1, bmat=..., n=50, which=..., nev=4, np=16, tol=1.1102230246251565e-16, resid=..., mode=1, iupd=1, ishift=1,
    mxiter=300, v=..., ldv=256, h=..., ldh=20, ritz=..., bounds=..., q=..., ldq=20, workl=..., ipntr=..., workd=..., info=0, _bmat=1, _which=2)
    at pdsaup2.f:391

hjelmn added a commit to hjelmn/ompi that referenced this issue Sep 14, 2018
To ensure fast box entries are complete when processed by the
receiving process the tag must be written last. This includes a zero
header for the next fast box entry (in some cases). This commit fixes
two instances where the tag was written too early. In one case, on
32-bit systems it is possible for the tag part of the header to be
written before the size. The second instance is an ordering issue. The
zero header was being written after the fastbox header.

Fixes open-mpi#5375, open-mpi#5638

Signed-off-by: Nathan Hjelm <[email protected]>
@jsquyres
Copy link
Member

See PR #5696.

hjelmn added a commit to hjelmn/ompi that referenced this issue Sep 14, 2018
To ensure fast box entries are complete when processed by the
receiving process the tag must be written last. This includes a zero
header for the next fast box entry (in some cases). This commit fixes
two instances where the tag was written too early. In one case, on
32-bit systems it is possible for the tag part of the header to be
written before the size. The second instance is an ordering issue. The
zero header was being written after the fastbox header.

Fixes open-mpi#5375, open-mpi#5638

Signed-off-by: Nathan Hjelm <[email protected]>
(cherry picked from commit 850fbff)
Signed-off-by: Nathan Hjelm <[email protected]>
hjelmn added a commit to hjelmn/ompi that referenced this issue Sep 14, 2018
To ensure fast box entries are complete when processed by the
receiving process the tag must be written last. This includes a zero
header for the next fast box entry (in some cases). This commit fixes
two instances where the tag was written too early. In one case, on
32-bit systems it is possible for the tag part of the header to be
written before the size. The second instance is an ordering issue. The
zero header was being written after the fastbox header.

Fixes open-mpi#5375, open-mpi#5638

Signed-off-by: Nathan Hjelm <[email protected]>
(cherry picked from commit 850fbff)
Signed-off-by: Nathan Hjelm <[email protected]>
hjelmn added a commit to hjelmn/ompi that referenced this issue Sep 14, 2018
To ensure fast box entries are complete when processed by the
receiving process the tag must be written last. This includes a zero
header for the next fast box entry (in some cases). This commit fixes
two instances where the tag was written too early. In one case, on
32-bit systems it is possible for the tag part of the header to be
written before the size. The second instance is an ordering issue. The
zero header was being written after the fastbox header.

Fixes open-mpi#5375, open-mpi#5638

Signed-off-by: Nathan Hjelm <[email protected]>
(cherry picked from commit 850fbff)
Signed-off-by: Nathan Hjelm <[email protected]>
hjelmn added a commit to hjelmn/ompi that referenced this issue Sep 14, 2018
To ensure fast box entries are complete when processed by the
receiving process the tag must be written last. This includes a zero
header for the next fast box entry (in some cases). This commit fixes
two instances where the tag was written too early. In one case, on
32-bit systems it is possible for the tag part of the header to be
written before the size. The second instance is an ordering issue. The
zero header was being written after the fastbox header.

Fixes open-mpi#5375, open-mpi#5638

Signed-off-by: Nathan Hjelm <[email protected]>
(cherry picked from commit 850fbff)
Signed-off-by: Nathan Hjelm <[email protected]>
@amckinstry
Copy link
Author

@hjelmn There is more to the commit in 850fbff than in the patch listed above ; is that deliberate ?

@jsquyres
Copy link
Member

@amckinstry Yes. 850fbff should be everything that is needed (even though it is more than what was described on this issue).

@jsquyres
Copy link
Member

@amckinstry Could you let us know if this fixes the issue on your platforms?

@amckinstry
Copy link
Author

@jsquyres Just tested the full patch 850fbff. no joy.

@jsquyres
Copy link
Member

Doh!

So just to be clear:

  • you tested Open MPI v3.1.2
  • with patch of dca3516
  • and also with patch of 850fbff

And you still get hangs on i386?

@amckinstry
Copy link
Author

Yes.

@jsquyres
Copy link
Member

@amckinstry Sorry -- one more clarification: does it only hang on i386, or does it still hang on other 32 bit platforms?

@amckinstry
Copy link
Author

Multiple: i386 (armhf, armel ; arm64 seems fine) and powerpc. Also kfreebsd-i386 and kfreebsd-amd64 ( Debian project with kFreebsd kernel ).
Works on all mips variants and ppc64el .

(Details here: https://buildd.debian.org/status/package.php?p=arpack as arpack test suite fails).
Hard to see a pattern: not 32-bit only, or big-endian; some arch-sensitive code somewhere ?

@hjelmn
Copy link
Member

hjelmn commented Sep 26, 2018

Planning to build a 386 virtual machine to take a closer look. Really surprised this is still happening.

@hjelmn
Copy link
Member

hjelmn commented Sep 26, 2018

For powerpc are you using --disable-builtin-atomics?

@amckinstry
Copy link
Author

amckinstry commented Sep 26, 2018

No, not using --disable-builtin-atomics on any arch at the moment.
Using --enable-builtin-atomics on s390x (which works).
Currently I'm only testing on a 2-core VM on my laptop (i386) for simplicity.

@jsquyres
Copy link
Member

@hjelmn We talked about this on the webex yesterday. It may be that you fixed one area of vader, but we're running into another problem. All we know is that it's hanging for @amckinstry -- not necessarily that it's the same exact problem.

@jsquyres
Copy link
Member

jsquyres commented Oct 2, 2018

@amckinstry If you could test the patch from #5829 and see if that fixes the issue (in addition to the other 2 patches), we'd greatly appreciate it. Thanks!

@amckinstry
Copy link
Author

That appears to fix it, thanks!

@gpaulsen
Copy link
Member

gpaulsen commented Oct 3, 2018

If this is on v3.0.x and v3.1.x, it is probably also an issue for v4.0.x.

@jsquyres
Copy link
Member

jsquyres commented Oct 3, 2018

PR's created (and linked to this issue) for all release branches.

@jsquyres
Copy link
Member

jsquyres commented Oct 3, 2018

Merged into all release branches -- huzzah!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants