Skip to content

Running osu_acc_latency errors out with 3.1.2 #5946

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Akshay-Venkatesh opened this issue Oct 17, 2018 · 3 comments
Open

Running osu_acc_latency errors out with 3.1.2 #5946

Akshay-Venkatesh opened this issue Oct 17, 2018 · 3 comments
Labels

Comments

@Akshay-Venkatesh
Copy link
Contributor

Background information

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

3.1.2 tar available here: https://www.open-mpi.org/software/ompi/v3.1/

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

From tarball configured with:

  Configure command line: '--enable-mpirun-prefix-by-default' '--enable-debug' '--enable-mem-debug' '--enable-mpi-fortran=no'

Please describe the system on which you are running

  • Operating system/version: CentOS Linux release 7.5.1804 (Core)
  • Computer hardware: haswell
  • Network type: Infiniband controller: Mellanox Technologies MT27600 [Connect-IB]
  • C compiler family name: GNU
  • C compiler version: 4.8.5

Details of the problem

I'm using OSU benchmarks 5.3 for an intra-node accumulate latency benchmark that works fine with 3.1.0 but fails with 3.1.2.

[akvenkatesh@hsw225 build]$ mpirun -np 2 ./libexec/osu-micro-benchmarks/mpi/one-sided/osu_acc_latency
# OSU MPI_Accumulate latency Test v5.3
# Window creation: MPI_Win_allocate
# Synchronization: MPI_Win_flush
# Size          Latency (us)
0                       0.23
[hsw225:63377:0:63377] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))
==== backtrace ====
    0  /home/akvenkatesh/ucx/build/lib/libucs.so.0(+0x232e9) [0x2b7dbd4922e9]
    1  /home/akvenkatesh/ucx/build/lib/libucs.so.0(+0x2342f) [0x2b7dbd49242f]
    2  /home/akvenkatesh/ompi-non-git/openmpi-3.1.2/build-vanilla/lib/openmpi/mca_btl_openib.so(mca_btl_openib_get+0x144) [0x2b7dbb251a05]
    3  /home/akvenkatesh/ompi-non-git/openmpi-3.1.2/build-vanilla/lib/openmpi/mca_osc_rdma.so(ompi_osc_get_data_blocking+0x2a8) [0x2b7dc6b6ba11]
    4  /home/akvenkatesh/ompi-non-git/openmpi-3.1.2/build-vanilla/lib/openmpi/mca_osc_rdma.so(+0xe851) [0x2b7dc6b73851]
    5  /home/akvenkatesh/ompi-non-git/openmpi-3.1.2/build-vanilla/lib/openmpi/mca_osc_rdma.so(+0x10575) [0x2b7dc6b75575]
    6  /home/akvenkatesh/ompi-non-git/openmpi-3.1.2/build-vanilla/lib/openmpi/mca_osc_rdma.so(ompi_osc_rdma_accumulate+0x14a) [0x2b7dc6b75f08]
    7  /home/akvenkatesh/ompi-non-git/openmpi-3.1.2/build-vanilla/lib/libmpi.so.40(PMPI_Accumulate+0x438) [0x2b7da9cb5fa4]
    8  ./libexec/osu-micro-benchmarks/mpi/one-sided/osu_acc_latency() [0x401cc0]
    9  ./libexec/osu-micro-benchmarks/mpi/one-sided/osu_acc_latency() [0x4018d8]
   10  /usr/lib64/libc.so.6(__libc_start_main+0xf5) [0x2b7daa1c6445]
   11  ./libexec/osu-micro-benchmarks/mpi/one-sided/osu_acc_latency() [0x401479]
===================
[hsw225:63377:0:63377] Process frozen...

If I use --mca osc ucx explicity I see a hang:

[akvenkatesh@hsw225 build]$ mpirun -np 2 --mca osc ucx ./libexec/osu-micro-benchmarks/mpi/one-sided/osu_acc_latency
# OSU MPI_Accumulate latency Test v5.3
# Window creation: MPI_Win_allocate
# Synchronization: MPI_Win_flush
# Size          Latency (us)

These are the backtraces:

(gdb) bt
#0  0x00002ab3c3587976 in ucs_callbackq_leave (cbq=0x1b47290) at ../../../src/ucs/datastruct/callbackq.c:64
#1  0x00002ab3c3588b16 in ucs_callbackq_slow_proxy (arg=0x1b47290) at ../../../src/ucs/datastruct/callbackq.c:391
#2  0x00002ab3c2d42b9c in ucs_callbackq_dispatch (cbq=0x1b47290)
    at /home/akvenkatesh/ucx/build/../src/ucs/datastruct/callbackq.h:208
#3  0x00002ab3c2d47d70 in uct_worker_progress (worker=0x1b47290) at /home/akvenkatesh/ucx/build/../src/uct/api/uct.h:1644
#4  ucp_worker_progress (worker=0x1b5ef80) at ../../../src/ucp/core/ucp_worker.c:1381
#5  0x00002ab3c2d4bd0e in ucp_rma_wait (worker=0x1b5ef80, user_req=0x312a8e0, op_name=0x2ab3c2dbde9a "atomic_fadd64")
    at ../../../src/ucp/rma/rma.inl:49
#6  0x00002ab3c2d4e885 in ucp_atomic_fetch_b (ep=0x2ab3e2c0f0e0, opcode=UCP_ATOMIC_FETCH_OP_FADD, value=1, result=0x7ffde77b99f8, 
    size=8, remote_addr=53173648, rkey=0x3121b40, op_name=0x2ab3c2dbde9a "atomic_fadd64") at ../../../src/ucp/rma/amo_basic.c:263
#7  0x00002ab3c2d4ec55 in ucp_atomic_fadd64_inner (result=0x7ffde77b99f8, rkey=0x3121b40, remote_addr=53173648, add=1, 
    ep=0x2ab3e2c0f0e0) at ../../../src/ucp/rma/amo_basic.c:292
#8  ucp_atomic_fadd64 (ep=0x2ab3e2c0f0e0, add=1, remote_addr=53173648, rkey=0x3121b40, result=0x7ffde77b99f8)
    at ../../../src/ucp/rma/amo_basic.c:288
#9  0x00002ab3cc634ece in start_shared (module=0x1ef6980, target=1) at ../../../../../ompi/mca/osc/ucx/osc_ucx_passive_target.c:28
#10 0x00002ab3cc63545f in ompi_osc_ucx_lock (lock_type=2, target=1, assert=0, win=0x1b38310)
    at ../../../../../ompi/mca/osc/ucx/osc_ucx_passive_target.c:145
#11 0x00002ab3afd6b639 in PMPI_Win_lock (lock_type=2, rank=1, assert=0, win=0x1b38310) at pwin_lock.c:66
#12 0x0000000000401bf6 in run_acc_with_flush (rank=0, type=WIN_ALLOCATE) at ../../../mpi/one-sided/osu_acc_latency.c:207
#13 0x00000000004018d8 in main (argc=1, argv=0x7ffde77b9c88) at ../../../mpi/one-sided/osu_acc_latency.c:128
(gdb) bt
#0  progress_callback () at ../../../../../ompi/mca/osc/ucx/osc_ucx_component.c:107
#1  0x00002b66e1b65562 in opal_progress () at ../../opal/runtime/opal_progress.c:228
#2  0x00002b66e0edc326 in ompi_request_wait_completion (req=0x32cdde0) at ../../ompi/request/request.h:413
#3  0x00002b66e0edc360 in ompi_request_default_wait (req_ptr=0x7fffa3250a80, status=0x7fffa3250a60)
    at ../../ompi/request/req_wait.c:42
#4  0x00002b66e0f82efd in ompi_coll_base_sendrecv_zero (dest=0, stag=-16, source=0, rtag=-16, comm=0x607800 <ompi_mpi_comm_world>)
    at ../../../../ompi/mca/coll/base/coll_base_barrier.c:64
#5  0x00002b66e0f835b8 in ompi_coll_base_barrier_intra_two_procs (comm=0x607800 <ompi_mpi_comm_world>, module=0x2f1ad30)
    at ../../../../ompi/mca/coll/base/coll_base_barrier.c:300
#6  0x00002b66fd4f7a3d in ompi_coll_tuned_barrier_intra_dec_fixed (comm=0x607800 <ompi_mpi_comm_world>, module=0x2f1ad30)
    at ../../../../../ompi/mca/coll/tuned/coll_tuned_decision_fixed.c:196
#7  0x00002b66e0efdf5e in PMPI_Barrier (comm=0x607800 <ompi_mpi_comm_world>) at pbarrier.c:63
#8  0x0000000000401e2d in run_acc_with_flush (rank=1, type=WIN_ALLOCATE) at ../../../mpi/one-sided/osu_acc_latency.c:219
#9  0x00000000004018d8 in main (argc=1, argv=0x7fffa3250cc8) at ../../../mpi/one-sided/osu_acc_latency.c:128

If I disable openib, it works for intra-node and I get this:

[akvenkatesh@hsw225 build]$ mpirun -np 2 --mca btl ^openib ./libexec/osu-micro-benchmarks/mpi/one-sided/osu_acc_latency
# OSU MPI_Accumulate latency Test v5.3
# Window creation: MPI_Win_allocate
# Synchronization: MPI_Win_flush
# Size          Latency (us)
0                       0.11
1                       0.11
2                       0.11
4                       0.12
8                       0.12
16                      0.14
32                      0.17
64                      0.25
128                     0.39
256                     0.65
512                     1.18
1024                    2.19
2048                    4.29
4096                    8.36
8192                   16.63
16384                  34.97
32768                  68.19
65536                 133.99
131072                266.68
262144                533.38
524288               1057.40
1048576              2119.27
2097152              4233.57
4194304              8448.53

The same doesn't work for inter-node case, with the following non-blocking allreduce failure stemming from Win_allocate:

[akvenkatesh@hsw225 build]$ mpirun -np 2 --hostfile $PWD/hostfile ./libexec/osu-micro-benchmarks/mpi/one-sided/osu_acc_latency     [hsw224:37186] ../../../../../ompi/mca/pml/ucx/pml_ucx.c:296 Error: ucp_ep_create(proc=0) failed: Destination is unreachable
[hsw224:37186] ../../../../../ompi/mca/pml/ucx/pml_ucx.c:362 Error: Failed to resolve UCX endpoint for rank 0
Error in MPI_Isend(22728420, 1, 0x2b4d32c1dfe0, 0, -26, 6322176) (-1)
osu_acc_latency: ../../../../../ompi/mca/coll/libnbc/nbc_iallreduce.c:185: ompi_coll_libnbc_iallreduce: Assertion `((0xdeafbeedULL << 32) + 0xdeafbeedULL) == ((opal_object_t *) (schedule))->obj_magic_id' failed.
[hsw224:37186] *** Process received signal ***
[hsw224:37186] Signal: Aborted (6)
[hsw224:37186] Signal code:  (-6)
[hsw224:37186] [ 0] /lib64/libpthread.so.0(+0xf6d0)[0x2b4d32c4d6d0]
[hsw224:37186] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x2b4d32e90277]
[hsw224:37186] [ 2] /lib64/libc.so.6(abort+0x148)[0x2b4d32e91968]
[hsw224:37186] [ 3] /lib64/libc.so.6(+0x2f096)[0x2b4d32e89096]
[hsw224:37186] [ 4] /lib64/libc.so.6(+0x2f142)[0x2b4d32e89142]
[hsw224:37186] [ 5] /home/akvenkatesh/ompi-non-git/openmpi-3.1.2/build-vanilla/lib/openmpi/mca_coll_libnbc.so(ompi_coll_libnbc_iallreduce+0x610)[0x2b4d460dc82d]
[hsw224:37186] [ 6] /home/akvenkatesh/ompi-non-git/openmpi-3.1.2/build-vanilla/lib/libmpi.so.40(+0x34546)[0x2b4d328ca546]
[hsw224:37186] [ 7] /home/akvenkatesh/ompi-non-git/openmpi-3.1.2/build-vanilla/lib/libmpi.so.40(+0x338e0)[0x2b4d328c98e0]
[hsw224:37186] [ 8] /home/akvenkatesh/ompi-non-git/openmpi-3.1.2/build-vanilla/lib/libmpi.so.40(+0x37945)[0x2b4d328cd945]
[hsw224:37186] [ 9] /home/akvenkatesh/ompi-non-git/openmpi-3.1.2/build-vanilla/lib/libopen-pal.so.40(opal_progress+0x30)[0x2b4d33578562]
[hsw224:37186] [10] /home/akvenkatesh/ompi-non-git/openmpi-3.1.2/build-vanilla/lib/libmpi.so.40(+0x32f0b)[0x2b4d328c8f0b]
[hsw224:37186] [11] /home/akvenkatesh/ompi-non-git/openmpi-3.1.2/build-vanilla/lib/libmpi.so.40(ompi_comm_nextcid+0x6c)[0x2b4d328c970b]
[hsw224:37186] [12] /home/akvenkatesh/ompi-non-git/openmpi-3.1.2/build-vanilla/lib/libmpi.so.40(ompi_comm_dup_with_info+0x10b)[0x2b4d328c622c]
[hsw224:37186] [13] /home/akvenkatesh/ompi-non-git/openmpi-3.1.2/build-vanilla/lib/libmpi.so.40(ompi_comm_dup+0x25)[0x2b4d328c611f]
[hsw224:37186] [14] /home/akvenkatesh/ompi-non-git/openmpi-3.1.2/build-vanilla/lib/openmpi/mca_osc_rdma.so(+0x154e8)[0x2b4d477a64e8]
[hsw224:37186] [15] /home/akvenkatesh/ompi-non-git/openmpi-3.1.2/build-vanilla/lib/libmpi.so.40(ompi_osc_base_select+0x155)[0x2b4d329abaa5]
[hsw224:37186] [16] /home/akvenkatesh/ompi-non-git/openmpi-3.1.2/build-vanilla/lib/libmpi.so.40(ompi_win_allocate+0x7b)[0x2b4d328f71d3]
[hsw224:37186] [17] /home/akvenkatesh/ompi-non-git/openmpi-3.1.2/build-vanilla/lib/libmpi.so.40(MPI_Win_allocate+0x256)[0x2b4d3296cfa4]
[hsw224:37186] [18] ./libexec/osu-micro-benchmarks/mpi/one-sided/osu_acc_latency[0x40448c]
[hsw224:37186] [19] ./libexec/osu-micro-benchmarks/mpi/one-sided/osu_acc_latency[0x401ba3]
[hsw224:37186] [20] ./libexec/osu-micro-benchmarks/mpi/one-sided/osu_acc_latency[0x4018d8]
[hsw224:37186] [21] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2b4d32e7c445]
[hsw224:37186] [22] ./libexec/osu-micro-benchmarks/mpi/one-sided/osu_acc_latency[0x401479]
[hsw224:37186] *** End of error message ***
[1539818461.353517] [hsw224:37186:0]         select.c:312  UCX  ERROR no active messages transport to <no debug data>: Unsupported operation
# OSU MPI_Accumulate latency Test v5.3
# Window creation: MPI_Win_allocate
# Synchronization: MPI_Win_flush
# Size          Latency (us)
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 37186 on node hsw224 exited on signal 6 (Aborted).
--------------------------------------------------------------------------

Changing the allocation to Win_create or Win_create_dynamic doesn't change the problem because all these go to ompi_comm_dup eventually.

What am I missing?

@jsquyres
Copy link
Member

@hjelmn @thananon Is this fixed by the UCT fixes you guys put in recently?

@thananon
Copy link
Member

No, I'm pretty sure this is not btl/uct related.

@hjelmn
Copy link
Member

hjelmn commented Oct 19, 2018

Yeah. Totally unrelated. btl/uct is a master/4.0.x component.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants