You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When debugging a hang on some application, I found that it comes from MPI_Dist_graph_create(). A similar issue reproduced by distgraph_test_4 from IBM tests. Adding --mca topo ^treematch avoids the issue.
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
v4.1.x
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
(gdb) bt
#0 0x00007f264a06859c in free () from /lib64/libc.so.6#1 0x00007f263671c75a in free_list_child (tree=0x28d4b40) at treematch/tm_tree.c:118#2 0x00007f263671c73a in free_list_child (tree=0x284e5e0) at treematch/tm_tree.c:116#3 0x00007f263671c73a in free_list_child (tree=0x284e690) at treematch/tm_tree.c:116#4 0x00007f263671c7d8 in free_non_constraint_tree (tree=0x284e690) at treematch/tm_tree.c:136#5 0x00007f263671c887 in tm_free_tree (tree=0x284e690) at treematch/tm_tree.c:158#6 0x00007f2636714957 in mca_topo_treematch_dist_graph_create (topo_module=0x108eab0, comm_old=0x607620 <ompi_mpi_comm_world>, n=0, nodes=0x0, degrees=0x0, targets=0x0, weights=0x3, info=0x606120 <ompi_mpi_info_null>, reorder=1, newcomm=0x7ffcc9cce4a8)
at topo_treematch_dist_graph_create.c:686
#7 0x00007f264a658e5a in PMPI_Dist_graph_create (comm_old=0x607620 <ompi_mpi_comm_world>, n=0, sources=0x0, degrees=0x0, destinations=0x0, weights=0x3, info=0x606120 <ompi_mpi_info_null>, reorder=1, newcomm=0x7ffcc9cce4a8) at pdist_graph_create.c:91#8 0x0000000000402916 in wrap_dist_graph_create (cnt=1, rank=0, n=0, sources=0x0, degrees=0x0, destinations=0x0, weights=0x3, newcomm=0x7ffcc9cce4a8) at distgraph_test_4.c:349#9 0x0000000000403d55 in create_graph (cnt=1, root=55, newcomm=0x7ffcc9cce4a8) at distgraph_test_4.c:880#10 0x0000000000404cb2 in main (argc=1, argv=0x7ffcc9cce5a8) at distgraph_test_4.c:1150
Running on full ppn with UCX: $mpirun --map-by node --bind-to core -n 50 --mca pml ucx --mca osc ^ucx ./distgraph_test_4
This case can lead to sporadic hang, all procs are staying at the following backtrace :
Thread 1 (Thread 0x7f015b715740 (LWP 23042)):
#0 0x00007f0147cb4457 in ucp_worker_progress (worker=0xff91f0) at core/ucp_worker.c:2611#1 0x00007f014d345c49 in mca_pml_ucx_progress () at pml_ucx.c:535#2 0x00007f015a50e980 in opal_progress () at runtime/opal_progress.c:231#3 0x00007f015b16e816 in ompi_request_wait_completion (req=0x1aee068) at ../ompi/request/request.h:415#4 0x00007f015b16f06b in ompi_comm_nextcid (newcomm=0x1bb4570, comm=0x607620 <ompi_mpi_comm_world>, bridgecomm=0x0, arg0=0x0, arg1=0x0, send_first=false, mode=32) at communicator/comm_cid.c:293#5 0x00007f015b169e67 in ompi_comm_create (comm=0x607620 <ompi_mpi_comm_world>, group=0x1189d40, newcomm=0x7fffdb8b5648) at communicator/comm.c:361#6 0x00007f012629146f in mca_topo_treematch_dist_graph_create (topo_module=0x1370e60, comm_old=0x607620 <ompi_mpi_comm_world>, n=56, nodes=0x16b8ae0, degrees=0x1852de0, targets=0x1c1b000, weights=0x184a8f0, info=0x606120 <ompi_mpi_info_null>, reorder=0, newcomm=0x7fffdb8b5648) at topo_treematch_dist_graph_create.c:116#7 0x00007f015b1c1e5a in PMPI_Dist_graph_create (comm_old=0x607620 <ompi_mpi_comm_world>, n=56, sources=0x16b8ae0, degrees=0x1852de0, destinations=0x1c1b000, weights=0x184a8f0, info=0x606120 <ompi_mpi_info_null>, reorder=0, newcomm=0x7fffdb8b5648) at pdist_graph_create.c:91#8 0x0000000000402916 in wrap_dist_graph_create (cnt=9, rank=36, n=56, sources=0x16b8ae0, degrees=0x1852de0, destinations=0x1c1b000, weights=0x184a8f0, newcomm=0x7fffdb8b5648) at distgraph_test_4.c:349#9 0x0000000000403d55 in create_graph (cnt=9, root=55, newcomm=0x7fffdb8b5648) at distgraph_test_4.c:880#10 0x0000000000404cb2 in main (argc=1, argv=0x7fffdb8b5748) at distgraph_test_4.c:1150
or the same crash as above.
The text was updated successfully, but these errors were encountered:
Uh oh!
There was an error while loading. Please reload this page.
Background information
When debugging a hang on some application, I found that it comes from MPI_Dist_graph_create(). A similar issue reproduced by distgraph_test_4 from IBM tests. Adding
--mca topo ^treematch
avoids the issue.What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
v4.1.x
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
git clone ...
Configured with:
--enable-debug --with-ucx --disable-man-pages --enable-mpirun-prefix-by-default
Please describe the system on which you are running
Details of the problem
$mpirun --map-by node --bind-to core -n 56 --mca pml ob1 --mca btl self,tcp --mca osc ^ucx ./distgraph_test_4
Fails with:
$mpirun --map-by node --bind-to core -n 50 --mca pml ucx --mca osc ^ucx ./distgraph_test_4
This case can lead to sporadic hang, all procs are staying at the following backtrace :
or the same crash as above.
The text was updated successfully, but these errors were encountered: