Skip to content

v4.1.x: topo/treematch failure with distgraph_test_4 #8991

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
karasevb opened this issue May 21, 2021 · 3 comments
Closed

v4.1.x: topo/treematch failure with distgraph_test_4 #8991

karasevb opened this issue May 21, 2021 · 3 comments
Assignees

Comments

@karasevb
Copy link
Member

karasevb commented May 21, 2021

Background information

When debugging a hang on some application, I found that it comes from MPI_Dist_graph_create(). A similar issue reproduced by distgraph_test_4 from IBM tests. Adding --mca topo ^treematch avoids the issue.

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

v4.1.x

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

git clone ...
Configured with: --enable-debug --with-ucx --disable-man-pages --enable-mpirun-prefix-by-default

Please describe the system on which you are running

  • Operating system/version: Red Hat Enterprise Linux Server release 7.6 (Maipo)
  • Computer hardware: Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz
  • Network type: Mellanox CX-5

Details of the problem

  • Running on full ppn with TCP:
$mpirun --map-by node --bind-to core -n 56 --mca pml ob1 --mca btl self,tcp --mca osc ^ucx ./distgraph_test_4

Fails with:

(gdb) bt
#0  0x00007f264a06859c in free () from /lib64/libc.so.6
#1  0x00007f263671c75a in free_list_child (tree=0x28d4b40) at treematch/tm_tree.c:118
#2  0x00007f263671c73a in free_list_child (tree=0x284e5e0) at treematch/tm_tree.c:116
#3  0x00007f263671c73a in free_list_child (tree=0x284e690) at treematch/tm_tree.c:116
#4  0x00007f263671c7d8 in free_non_constraint_tree (tree=0x284e690) at treematch/tm_tree.c:136
#5  0x00007f263671c887 in tm_free_tree (tree=0x284e690) at treematch/tm_tree.c:158
#6  0x00007f2636714957 in mca_topo_treematch_dist_graph_create (topo_module=0x108eab0, comm_old=0x607620 <ompi_mpi_comm_world>, n=0, nodes=0x0, degrees=0x0, targets=0x0, weights=0x3, info=0x606120 <ompi_mpi_info_null>, reorder=1, newcomm=0x7ffcc9cce4a8)
    at topo_treematch_dist_graph_create.c:686
#7  0x00007f264a658e5a in PMPI_Dist_graph_create (comm_old=0x607620 <ompi_mpi_comm_world>, n=0, sources=0x0, degrees=0x0, destinations=0x0, weights=0x3, info=0x606120 <ompi_mpi_info_null>, reorder=1, newcomm=0x7ffcc9cce4a8) at pdist_graph_create.c:91
#8  0x0000000000402916 in wrap_dist_graph_create (cnt=1, rank=0, n=0, sources=0x0, degrees=0x0, destinations=0x0, weights=0x3, newcomm=0x7ffcc9cce4a8) at distgraph_test_4.c:349
#9  0x0000000000403d55 in create_graph (cnt=1, root=55, newcomm=0x7ffcc9cce4a8) at distgraph_test_4.c:880
#10 0x0000000000404cb2 in main (argc=1, argv=0x7ffcc9cce5a8) at distgraph_test_4.c:1150
  • Running on full ppn with UCX:
    $mpirun --map-by node --bind-to core -n 50 --mca pml ucx --mca osc ^ucx ./distgraph_test_4
    This case can lead to sporadic hang, all procs are staying at the following backtrace :
Thread 1 (Thread 0x7f015b715740 (LWP 23042)):
#0  0x00007f0147cb4457 in ucp_worker_progress (worker=0xff91f0) at core/ucp_worker.c:2611
#1  0x00007f014d345c49 in mca_pml_ucx_progress () at pml_ucx.c:535
#2  0x00007f015a50e980 in opal_progress () at runtime/opal_progress.c:231
#3  0x00007f015b16e816 in ompi_request_wait_completion (req=0x1aee068) at ../ompi/request/request.h:415
#4  0x00007f015b16f06b in ompi_comm_nextcid (newcomm=0x1bb4570, comm=0x607620 <ompi_mpi_comm_world>, bridgecomm=0x0, arg0=0x0, arg1=0x0, send_first=false, mode=32) at communicator/comm_cid.c:293
#5  0x00007f015b169e67 in ompi_comm_create (comm=0x607620 <ompi_mpi_comm_world>, group=0x1189d40, newcomm=0x7fffdb8b5648) at communicator/comm.c:361
#6  0x00007f012629146f in mca_topo_treematch_dist_graph_create (topo_module=0x1370e60, comm_old=0x607620 <ompi_mpi_comm_world>, n=56, nodes=0x16b8ae0, degrees=0x1852de0, targets=0x1c1b000, weights=0x184a8f0, info=0x606120 <ompi_mpi_info_null>, reorder=0, newcomm=0x7fffdb8b5648) at topo_treematch_dist_graph_create.c:116
#7  0x00007f015b1c1e5a in PMPI_Dist_graph_create (comm_old=0x607620 <ompi_mpi_comm_world>, n=56, sources=0x16b8ae0, degrees=0x1852de0, destinations=0x1c1b000, weights=0x184a8f0, info=0x606120 <ompi_mpi_info_null>, reorder=0, newcomm=0x7fffdb8b5648) at pdist_graph_create.c:91
#8  0x0000000000402916 in wrap_dist_graph_create (cnt=9, rank=36, n=56, sources=0x16b8ae0, degrees=0x1852de0, destinations=0x1c1b000, weights=0x184a8f0, newcomm=0x7fffdb8b5648) at distgraph_test_4.c:349
#9  0x0000000000403d55 in create_graph (cnt=9, root=55, newcomm=0x7fffdb8b5648) at distgraph_test_4.c:880
#10 0x0000000000404cb2 in main (argc=1, argv=0x7fffdb8b5748) at distgraph_test_4.c:1150

or the same crash as above.

@karasevb karasevb changed the title topo/treematch failure with distgraph_test_4 v4.1.x: topo/treematch failure with distgraph_test_4 May 23, 2021
@jjhursey
Copy link
Member

jjhursey commented May 11, 2022

I'm seeing the same error with topology/distgraph1 and a hostfile with 3 nodes.

mpirun --npernode 2  -mca pml ob1 -mca osc ucx,sm -mca btl self,tcp,vader topology/distgraph1

Configured with:

./configure --prefix=$MPI_ROOT --without-hcoll --enable-debug --enable-mpirun-prefix-by-default \
    --disable-dlopen --enable-io-romio --disable-io-ompio --enable-mpi1-compatibility --with-ucx=/opt/ucx

From out MTT runs:

  • main and v5.0.x pass
  • v4.1.x fails with the issue seen above.

@gpaulsen @hppritcha is this worth trying to fix for v4.1.x or do we need a restriction?

jjhursey added a commit to jjhursey/ompi that referenced this issue May 11, 2022
@jjhursey jjhursey self-assigned this May 11, 2022
@jjhursey
Copy link
Member

PR #10367 fixes this for me.

@jjhursey
Copy link
Member

jjhursey commented Jun 7, 2022

Now that #10367 is merged. I think this can be closed. Please re-open if you still see the issue.

@jjhursey jjhursey closed this as completed Jun 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants