Skip to content

Random freezes of Infiniband #10432

Closed
Closed
@robertsawko

Description

@robertsawko

Hello, I would appreciate some advice on the following issue.

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

OpenMPI 4.1.1 and UCX 1.12.1

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

OpenMPI and UCX were installed from sources:

./contrib/configure-release \
    --prefix=/lustre/scafellpike/local/apps/hierarchy//compiler/gcc/6.5/ucx/1.12.1 \
    --enable-mt \
    --with-knem=${KNEM_DIR}
./configure \
  --prefix=/lustre/scafellpike/local/apps/hierarchy/compiler/gcc/6.5/openmpi/4.1.1-ucx \
  --enable-shared --disable-static \
  --enable-mpi-fortran=usempi \
  --disable-libompitrace \
  --enable-wrapper-rpath \
  --with-lsf=${LSF_LIBDIR%%linux*} \
  --with-lsf-libdir=${LSF_LIBDIR} \
  -with-knem=${knem_dir} \
  --without-mxm \
  --with-ucx=/lustre/scafellpike/local/apps/hierarchy/compiler/gcc/6.5/ucx/1.12.1 \
  --without-verbs \
  --without-cuda \
  && make -j32

Please describe the system on which you are running

  • Operating system/version: Red Hat Enterprise Linux Server release 7.6 (Maipo)
  • Computer hardware: Skylake x86 cluster
  • Network type: Mellanox Infiniband is present and this is what I would like to use

Details of the problem

I am having issues at MPI initialisation stage. As a sanity check I started running Intel MPI Benchmark

mpirun ./IMB-MPI Sendrecv

The code simply freezes when we reach the actual benchmark. Forcing TCP makes it work which makes me think it's either a hardware problem or still some issue in my setup.

mpirun --mca btl tcp,self ./IMB-MPI Sendrecv

I've used OMPI_MCA_pml_ucx_verbose=100 following a similar problem I was also having before]and here is the output for just two processes:

[sqg1cintr17.bullx:51367] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../..]
[sqg1cintr22.bullx:09180] MCW rank 1 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../..]
[sqg1cintr17.bullx:51376] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/ompi/mca/pml/ucx/pml_ucx.c:197 mca_pml_ucx_open: UCX version 1.12.1
[sqg1cintr17.bullx:51376] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/opal/mca/common/ucx/common_ucx.c:304 posix/memory: did not match transport list
[sqg1cintr17.bullx:51376] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/opal/mca/common/ucx/common_ucx.c:304 sysv/memory: did not match transport list
[sqg1cintr17.bullx:51376] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/opal/mca/common/ucx/common_ucx.c:304 self/memory0: did not match transport list
[sqg1cintr17.bullx:51376] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/opal/mca/common/ucx/common_ucx.c:304 tcp/ens1f0: did not match transport list
[sqg1cintr17.bullx:51376] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/opal/mca/common/ucx/common_ucx.c:304 tcp/ens1f1: did not match transport list
[sqg1cintr17.bullx:51376] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/opal/mca/common/ucx/common_ucx.c:304 tcp/lo: did not match transport list
[sqg1cintr17.bullx:51376] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/opal/mca/common/ucx/common_ucx.c:304 tcp/ib0: did not match transport list
[sqg1cintr17.bullx:51376] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/opal/mca/common/ucx/common_ucx.c:299 rc_verbs/mlx5_0:1: matched transport list but not device list
[sqg1cintr17.bullx:51376] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/opal/mca/common/ucx/common_ucx.c:299 rc_mlx5/mlx5_0:1: matched transport list but not device list
[sqg1cintr17.bullx:51376] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/opal/mca/common/ucx/common_ucx.c:299 dc_mlx5/mlx5_0:1: matched transport list but not device list
[sqg1cintr17.bullx:51376] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/opal/mca/common/ucx/common_ucx.c:299 ud_verbs/mlx5_0:1: matched transport list but not device list
[sqg1cintr17.bullx:51376] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/opal/mca/common/ucx/common_ucx.c:304 ud_mlx5/mlx5_0:1: did not match transport list
[sqg1cintr17.bullx:51376] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/opal/mca/common/ucx/common_ucx.c:304 cma/memory: did not match transport list
[sqg1cintr17.bullx:51376] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/opal/mca/common/ucx/common_ucx.c:304 knem/memory: did not match transport list
[sqg1cintr17.bullx:51376] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/opal/mca/common/ucx/common_ucx.c:311 support level is transports only
[sqg1cintr17.bullx:51376] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/ompi/mca/pml/ucx/pml_ucx.c:289 mca_pml_ucx_init
[sqg1cintr17.bullx:51376] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/ompi/mca/pml/ucx/pml_ucx.c:114 Pack remote worker address, size 249
[sqg1cintr17.bullx:51376] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/ompi/mca/pml/ucx/pml_ucx.c:114 Pack local worker address, size 414
[sqg1cintr17.bullx:51376] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/ompi/mca/pml/ucx/pml_ucx.c:351 created ucp context 0x2249f90, worker 0x7fd174074010
[sqg1cintr17.bullx:51376] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/ompi/mca/pml/ucx/pml_ucx_component.c:129 returning priority 19
[sqg1cintr17.bullx:51376] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/ompi/mca/pml/ucx/pml_ucx.c:367 mca_pml_ucx_cleanup
[sqg1cintr22.bullx:09187] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/ompi/mca/pml/ucx/pml_ucx.c:197 mca_pml_ucx_open: UCX version 1.12.1
[sqg1cintr17.bullx:51376] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/ompi/mca/pml/ucx/pml_ucx.c:268 mca_pml_ucx_close
[sqg1cintr22.bullx:09187] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/opal/mca/common/ucx/common_ucx.c:304 posix/memory: did not match transport list
[sqg1cintr22.bullx:09187] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/opal/mca/common/ucx/common_ucx.c:304 sysv/memory: did not match transport list
[sqg1cintr22.bullx:09187] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/opal/mca/common/ucx/common_ucx.c:304 self/memory0: did not match transport list
[sqg1cintr22.bullx:09187] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/opal/mca/common/ucx/common_ucx.c:304 tcp/ens1f1: did not match transport list
[sqg1cintr22.bullx:09187] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/opal/mca/common/ucx/common_ucx.c:304 tcp/ens1f0: did not match transport list
[sqg1cintr22.bullx:09187] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/opal/mca/common/ucx/common_ucx.c:304 tcp/lo: did not match transport list
[sqg1cintr22.bullx:09187] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/opal/mca/common/ucx/common_ucx.c:304 tcp/ib0: did not match transport list
[sqg1cintr22.bullx:09187] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/opal/mca/common/ucx/common_ucx.c:299 rc_verbs/mlx5_0:1: matched transport list but not device list
[sqg1cintr22.bullx:09187] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/opal/mca/common/ucx/common_ucx.c:299 rc_mlx5/mlx5_0:1: matched transport list but not device list
[sqg1cintr22.bullx:09187] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/opal/mca/common/ucx/common_ucx.c:299 dc_mlx5/mlx5_0:1: matched transport list but not device list
[sqg1cintr22.bullx:09187] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/opal/mca/common/ucx/common_ucx.c:299 ud_verbs/mlx5_0:1: matched transport list but not device list
[sqg1cintr22.bullx:09187] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/opal/mca/common/ucx/common_ucx.c:304 ud_mlx5/mlx5_0:1: did not match transport list
[sqg1cintr22.bullx:09187] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/opal/mca/common/ucx/common_ucx.c:304 cma/memory: did not match transport list
[sqg1cintr22.bullx:09187] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/opal/mca/common/ucx/common_ucx.c:304 knem/memory: did not match transport list
[sqg1cintr22.bullx:09187] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/opal/mca/common/ucx/common_ucx.c:311 support level is transports only
[sqg1cintr22.bullx:09187] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/ompi/mca/pml/ucx/pml_ucx.c:289 mca_pml_ucx_init
[sqg1cintr22.bullx:09187] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/ompi/mca/pml/ucx/pml_ucx.c:114 Pack remote worker address, size 249
[sqg1cintr22.bullx:09187] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/ompi/mca/pml/ucx/pml_ucx.c:114 Pack local worker address, size 414
[sqg1cintr22.bullx:09187] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/ompi/mca/pml/ucx/pml_ucx.c:351 created ucp context 0x2738040, worker 0x7fc570031010
[sqg1cintr22.bullx:09187] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/ompi/mca/pml/ucx/pml_ucx_component.c:129 returning priority 19
[sqg1cintr22.bullx:09187] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/ompi/mca/pml/ucx/pml_ucx.c:367 mca_pml_ucx_cleanup
[sqg1cintr22.bullx:09187] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/ompi/mca/pml/ucx/pml_ucx.c:268 mca_pml_ucx_close

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions