-
Notifications
You must be signed in to change notification settings - Fork 902
Hangs in mca_btl_vader_component_progress on multiple archs #5638
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
With a debug build:
and
|
https://buildd.debian.org/status/package.php?p=elpa Fails/hangs on mips64el and ppc64el, so not just 32-bit systems. |
This reliably happens on i386 and other archs (but not x86_64).
and on core 2:
The test case is issue46 for arpack: an issue with custom communicators: |
This may be related to #5375. |
@amckinstry Please try this patch: diff --git a/opal/mca/btl/vader/btl_vader_fbox.h b/opal/mca/btl/vader/btl_vader_fbox.h
index 17239ce8ef..b5526050e0 100644
--- a/opal/mca/btl/vader/btl_vader_fbox.h
+++ b/opal/mca/btl/vader/btl_vader_fbox.h
@@ -50,9 +50,10 @@ void mca_btl_vader_poll_handle_frag (mca_btl_vader_hdr_t *hdr, mca_btl_base_endp
static inline void mca_btl_vader_fbox_set_header (mca_btl_vader_fbox_hdr_t *hdr, uint16_t tag,
uint16_t seq, uint32_t size)
{
- mca_btl_vader_fbox_hdr_t tmp = {.data = {.tag = tag, .seq = seq, .size = size}};
- opal_atomic_wmb ();
+ mca_btl_vader_fbox_hdr_t tmp = {.data = {.tag = 0, .seq = seq, .size = size}};
hdr->ival = tmp.ival;
+ opal_atomic_wmb ();
+ hdr->data.tag = tag;
}
/* attempt to reserve a contiguous segment from the remote ep */
|
Unfortunately not the (full ?) answer.
and
|
To ensure fast box entries are complete when processed by the receiving process the tag must be written last. This includes a zero header for the next fast box entry (in some cases). This commit fixes two instances where the tag was written too early. In one case, on 32-bit systems it is possible for the tag part of the header to be written before the size. The second instance is an ordering issue. The zero header was being written after the fastbox header. Fixes open-mpi#5375, open-mpi#5638 Signed-off-by: Nathan Hjelm <[email protected]>
See PR #5696. |
To ensure fast box entries are complete when processed by the receiving process the tag must be written last. This includes a zero header for the next fast box entry (in some cases). This commit fixes two instances where the tag was written too early. In one case, on 32-bit systems it is possible for the tag part of the header to be written before the size. The second instance is an ordering issue. The zero header was being written after the fastbox header. Fixes open-mpi#5375, open-mpi#5638 Signed-off-by: Nathan Hjelm <[email protected]> (cherry picked from commit 850fbff) Signed-off-by: Nathan Hjelm <[email protected]>
To ensure fast box entries are complete when processed by the receiving process the tag must be written last. This includes a zero header for the next fast box entry (in some cases). This commit fixes two instances where the tag was written too early. In one case, on 32-bit systems it is possible for the tag part of the header to be written before the size. The second instance is an ordering issue. The zero header was being written after the fastbox header. Fixes open-mpi#5375, open-mpi#5638 Signed-off-by: Nathan Hjelm <[email protected]> (cherry picked from commit 850fbff) Signed-off-by: Nathan Hjelm <[email protected]>
To ensure fast box entries are complete when processed by the receiving process the tag must be written last. This includes a zero header for the next fast box entry (in some cases). This commit fixes two instances where the tag was written too early. In one case, on 32-bit systems it is possible for the tag part of the header to be written before the size. The second instance is an ordering issue. The zero header was being written after the fastbox header. Fixes open-mpi#5375, open-mpi#5638 Signed-off-by: Nathan Hjelm <[email protected]> (cherry picked from commit 850fbff) Signed-off-by: Nathan Hjelm <[email protected]>
To ensure fast box entries are complete when processed by the receiving process the tag must be written last. This includes a zero header for the next fast box entry (in some cases). This commit fixes two instances where the tag was written too early. In one case, on 32-bit systems it is possible for the tag part of the header to be written before the size. The second instance is an ordering issue. The zero header was being written after the fastbox header. Fixes open-mpi#5375, open-mpi#5638 Signed-off-by: Nathan Hjelm <[email protected]> (cherry picked from commit 850fbff) Signed-off-by: Nathan Hjelm <[email protected]>
@amckinstry Yes. 850fbff should be everything that is needed (even though it is more than what was described on this issue). |
@amckinstry Could you let us know if this fixes the issue on your platforms? |
Yes. |
@amckinstry Sorry -- one more clarification: does it only hang on i386, or does it still hang on other 32 bit platforms? |
Multiple: i386 (armhf, armel ; arm64 seems fine) and powerpc. Also kfreebsd-i386 and kfreebsd-amd64 ( Debian project with kFreebsd kernel ). (Details here: https://buildd.debian.org/status/package.php?p=arpack as arpack test suite fails). |
Planning to build a 386 virtual machine to take a closer look. Really surprised this is still happening. |
For powerpc are you using --disable-builtin-atomics? |
No, not using --disable-builtin-atomics on any arch at the moment. |
@hjelmn We talked about this on the webex yesterday. It may be that you fixed one area of vader, but we're running into another problem. All we know is that it's hanging for @amckinstry -- not necessarily that it's the same exact problem. |
@amckinstry If you could test the patch from #5829 and see if that fixes the issue (in addition to the other 2 patches), we'd greatly appreciate it. Thanks! |
That appears to fix it, thanks! |
If this is on v3.0.x and v3.1.x, it is probably also an issue for v4.0.x. |
PR's created (and linked to this issue) for all release branches. |
Merged into all release branches -- huzzah! |
Background information
OpenMPI 3.1.2
PMIX 3.0.1
Installed on Debian /sid.
Testing across our suite of MPI programs, we're seeing hangs on some apps, it looks offhand like the common factor is 32-bit systems: i386, mipsel.
Debian bugs:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=905418
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=907267
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=907407
This is on simple 2-core systems for the most part. I've a straightforward reproducible case in a VM here on i386 with ARPACK.
The backtraces look like
and
The text was updated successfully, but these errors were encountered: