-
Notifications
You must be signed in to change notification settings - Fork 900
Hang in mca_mpool_hugepage_module_init() on ARM64 #3697
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Does it hang forever in nanosleep() ?
…On Wed, Jun 14, 2017 at 3:20 AM, Yossi Itigin ***@***.***> wrote:
Assigned #3697 <#3697> to @shamisp
<https://github.com/shamisp>.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3697 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ACIe2L0EeuusZcT5vC-y9JeExQu5CW95ks5sD5fPgaJpZM4N5ijf>
.
|
Unlikely, this is a fixed 100ns interval" static inline void _opal_lifo_release_cpu (void)
{
/* NTH: there are many ways to cause the current thread to be suspended. This one
* should work well in most cases. Another approach would be to use poll (NULL, 0, ) but
* the interval will be forced to be in ms (instead of ns or us). Note that there
* is a performance improvement for the lifo test when this call is made on detection
* of contention but it may not translate into actually MPI or application performance
* improvements. */
static struct timespec interval = { .tv_sec = 0, .tv_nsec = 100 };
nanosleep (&interval, NULL);
} |
Can you provide more information about what exactly is hanging? I.e., this is clearly one stack trace of the hang, but it must be looping on something that never completes. |
@yosefe can I reproduce this with a single node ? Do I really need IB for this ? thanks |
You don't need IB for this, just force the load of hugepage mpool. Also, this seems to happen in a very peculiar place, during a lifo_pop operation, which in this particualr context use the conditional load/store of the ARM64. As our tests pass on ARM64, I don't think the issue is in the atomic itself, but instead in the way the rb tree is initialized or how the freelist is grown in this particular instance. |
is this a multi-thread run ? how many threads ? |
@shamisp it's looping in Line 224 in d7ebcca
it can be reproduced on 1 node. disabling hugepage mpool moves the hange to another place - so it's something more fundamental:
|
command line on single node:
|
I can reproduce the hang on my ARM64 machine. Open MPI: openmpi-v2.x-201706150321-b562082 (nightly snapshot tarball) I cannot reproduce the hang with Test programs in the Open MPI source tree also hang.
In the In the |
Does it run with --disable-builtin-atomics? |
@hjelmn Builtin atomics are disabled by default in v2.x branch. I enabled BUILTIN_GCC ( |
@yosefe - what compiler version is used ? |
|
@PHHargrove reported similar issues on the devel list. Regarding
A common point of ARM64 and PPC64(LE) is
|
Added a v3.0.0 milestone since @PHHargrove saw this on 3.0.0rc1, per the above comment. |
I will try to reproduce it on one of my systems. |
I can confirm that the problem only show ups in "-O0" mode. |
I know what is happening. With -O0 the opal_atomic_* functions are not inlined. That makes the LL/SC atomics a function call. This will likely cause livelock with the LL/SC fifo/lifo implementations as it increases the chance that a read will cancel the LL reservation. The correct fix is to force those atomics to always be inlined. I will make the change and see if it fixes the issue. |
@hjelmn I confirmed the current status on AArch64.
All hang-up occur in the As you said, bad condition is: Open MPI 2.1 or higher + Open MPI 2.0.x does not have this issue because it does not have I'll confirm performance difference of |
@hjelmn I run
Each value is a median value of 10 times runs. If you need more data, let me know. |
I have tried gcc gcc/6.1.0 and gcc/7.1.0 and I still observe the same issue. |
Enabling debugging can cause the load-link store-conditional atomic operations to hit a live-lock condition. To prevent the live-lock always inline these atomics. Fixes open-mpi#3697 Signed-off-by: Nathan Hjelm <[email protected]>
@kawashima-fj #3988 should fix the hang. Its no surprise that the built-in atomics version is slower. The LL/SC lifo is significantly faster than the compare-and-swap version. |
This is the correct one. Think we have a fix. |
As documented in open-mpi#4563 and open-mpi#3697, there is an issue on ARM and POWER platforms when the atomic fifo assembly isn't inlined, which manifests as a hang. Document the issue and the work-around until a proper fix is committed. Signed-off-by: Brian Barrett <[email protected]>
As documented in #4563 and #3697, there is an issue on ARM and POWER platforms when the atomic fifo assembly isn't inlined, which manifests as a hang. Document the issue and the work-around until a proper fix is committed. Signed-off-by: Brian Barrett <[email protected]>
As documented in open-mpi#4563 and open-mpi#3697, there is an issue on ARM and POWER platforms when the atomic fifo assembly isn't inlined, which manifests as a hang. Document the issue and the work-around until a proper fix is committed. Signed-off-by: Brian Barrett <[email protected]> (cherry picked from commit 4658422)
Per 2018-03 Dallas face-to-face meeting, this is still happening to Fujitsu on ARMv8. It was discussed in the Dallas meeting; @hjelmn is looking into this. |
Can we have a bit more details on this ? What AMO is broken ? Does it happen only with built-in AMOs ? |
This commit fixes a hang that occurs with debug builds of Open MPI on aarch64 and power/powerpc systems. When the ll/sc atomics are inline functions the compiler emits load/store instructions for the function arguments with -O0. These extra load/store arguments can cause the ll reservation to be cancelled causing live-lock. Note that we did attempt to fix this with always_inline but the extra instructions are stil emitted by the compiler (gcc). There may be another fix but this has been tested and is working well. References open-mpi#3697. Close when applied to v3.0.x and v3.1.x. Signed-off-by: Nathan Hjelm <[email protected]>
This commit fixes a hang that occurs with debug builds of Open MPI on aarch64 and power/powerpc systems. When the ll/sc atomics are inline functions the compiler emits load/store instructions for the function arguments with -O0. These extra load/store arguments can cause the ll reservation to be cancelled causing live-lock. Note that we did attempt to fix this with always_inline but the extra instructions are stil emitted by the compiler (gcc). There may be another fix but this has been tested and is working well. References open-mpi#3697. Close when applied to v3.0.x and v3.1.x. Signed-off-by: Nathan Hjelm <[email protected]>
This commit fixes a hang that occurs with debug builds of Open MPI on aarch64 and power/powerpc systems. When the ll/sc atomics are inline functions the compiler emits load/store instructions for the function arguments with -O0. These extra load/store arguments can cause the ll reservation to be cancelled causing live-lock. Note that we did attempt to fix this with always_inline but the extra instructions are stil emitted by the compiler (gcc). There may be another fix but this has been tested and is working well. References #3697. Close when applied to v3.0.x and v3.1.x. Signed-off-by: Nathan Hjelm <[email protected]>
This commit fixes a hang that occurs with debug builds of Open MPI on aarch64 and power/powerpc systems. When the ll/sc atomics are inline functions the compiler emits load/store instructions for the function arguments with -O0. These extra load/store arguments can cause the ll reservation to be cancelled causing live-lock. Note that we did attempt to fix this with always_inline but the extra instructions are stil emitted by the compiler (gcc). There may be another fix but this has been tested and is working well. Back-port from master. References open-mpi#3697. Close when applied to v3.0.x and v3.1.x. Signed-off-by: Nathan Hjelm <[email protected]> (cherry picked from commit f8dbf62) Signed-off-by: Nathan Hjelm <[email protected]>
This commit fixes a hang that occurs with debug builds of Open MPI on aarch64 and power/powerpc systems. When the ll/sc atomics are inline functions the compiler emits load/store instructions for the function arguments with -O0. These extra load/store arguments can cause the ll reservation to be cancelled causing live-lock. Note that we did attempt to fix this with always_inline but the extra instructions are stil emitted by the compiler (gcc). There may be another fix but this has been tested and is working well. Back-port from master. References open-mpi#3697. Close when applied to v3.0.x and v3.1.x. Signed-off-by: Nathan Hjelm <[email protected]> (cherry picked from commit f8dbf62) Signed-off-by: Nathan Hjelm <[email protected]> (cherry picked from commit b09f0b1) Signed-off-by: Nathan Hjelm <[email protected]>
This commit fixes a hang that occurs with debug builds of Open MPI on aarch64 and power/powerpc systems. When the ll/sc atomics are inline functions the compiler emits load/store instructions for the function arguments with -O0. These extra load/store arguments can cause the ll reservation to be cancelled causing live-lock. Note that we did attempt to fix this with always_inline but the extra instructions are stil emitted by the compiler (gcc). There may be another fix but this has been tested and is working well. References open-mpi#3697. Close when applied to v3.0.x and v3.1.x. Signed-off-by: Nathan Hjelm <[email protected]>
Background information
Details of the problem
A hang in ctxalloc test during MPI_Init.
Similar hang is observed in many other tests.
ctxalloc can be found here
stack trace:
The text was updated successfully, but these errors were encountered: