Skip to content

Random freezes of Infiniband #10432

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
robertsawko opened this issue May 27, 2022 · 8 comments
Closed

Random freezes of Infiniband #10432

robertsawko opened this issue May 27, 2022 · 8 comments
Assignees
Labels
Milestone

Comments

@robertsawko
Copy link

Hello, I would appreciate some advice on the following issue.

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

OpenMPI 4.1.1 and UCX 1.12.1

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

OpenMPI and UCX were installed from sources:

./contrib/configure-release \
    --prefix=/lustre/scafellpike/local/apps/hierarchy//compiler/gcc/6.5/ucx/1.12.1 \
    --enable-mt \
    --with-knem=${KNEM_DIR}
./configure \
  --prefix=/lustre/scafellpike/local/apps/hierarchy/compiler/gcc/6.5/openmpi/4.1.1-ucx \
  --enable-shared --disable-static \
  --enable-mpi-fortran=usempi \
  --disable-libompitrace \
  --enable-wrapper-rpath \
  --with-lsf=${LSF_LIBDIR%%linux*} \
  --with-lsf-libdir=${LSF_LIBDIR} \
  -with-knem=${knem_dir} \
  --without-mxm \
  --with-ucx=/lustre/scafellpike/local/apps/hierarchy/compiler/gcc/6.5/ucx/1.12.1 \
  --without-verbs \
  --without-cuda \
  && make -j32

Please describe the system on which you are running

  • Operating system/version: Red Hat Enterprise Linux Server release 7.6 (Maipo)
  • Computer hardware: Skylake x86 cluster
  • Network type: Mellanox Infiniband is present and this is what I would like to use

Details of the problem

I am having issues at MPI initialisation stage. As a sanity check I started running Intel MPI Benchmark

mpirun ./IMB-MPI Sendrecv

The code simply freezes when we reach the actual benchmark. Forcing TCP makes it work which makes me think it's either a hardware problem or still some issue in my setup.

mpirun --mca btl tcp,self ./IMB-MPI Sendrecv

I've used OMPI_MCA_pml_ucx_verbose=100 following a similar problem I was also having before]and here is the output for just two processes:

[sqg1cintr17.bullx:51367] MCW rank 0 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../..]
[sqg1cintr22.bullx:09180] MCW rank 1 bound to socket 0[core 0[hwt 0-1]]: [BB/../../../../../../../../../../../../../../..][../../../../../../../../../../../../../../../..]
[sqg1cintr17.bullx:51376] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/ompi/mca/pml/ucx/pml_ucx.c:197 mca_pml_ucx_open: UCX version 1.12.1
[sqg1cintr17.bullx:51376] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/opal/mca/common/ucx/common_ucx.c:304 posix/memory: did not match transport list
[sqg1cintr17.bullx:51376] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/opal/mca/common/ucx/common_ucx.c:304 sysv/memory: did not match transport list
[sqg1cintr17.bullx:51376] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/opal/mca/common/ucx/common_ucx.c:304 self/memory0: did not match transport list
[sqg1cintr17.bullx:51376] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/opal/mca/common/ucx/common_ucx.c:304 tcp/ens1f0: did not match transport list
[sqg1cintr17.bullx:51376] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/opal/mca/common/ucx/common_ucx.c:304 tcp/ens1f1: did not match transport list
[sqg1cintr17.bullx:51376] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/opal/mca/common/ucx/common_ucx.c:304 tcp/lo: did not match transport list
[sqg1cintr17.bullx:51376] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/opal/mca/common/ucx/common_ucx.c:304 tcp/ib0: did not match transport list
[sqg1cintr17.bullx:51376] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/opal/mca/common/ucx/common_ucx.c:299 rc_verbs/mlx5_0:1: matched transport list but not device list
[sqg1cintr17.bullx:51376] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/opal/mca/common/ucx/common_ucx.c:299 rc_mlx5/mlx5_0:1: matched transport list but not device list
[sqg1cintr17.bullx:51376] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/opal/mca/common/ucx/common_ucx.c:299 dc_mlx5/mlx5_0:1: matched transport list but not device list
[sqg1cintr17.bullx:51376] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/opal/mca/common/ucx/common_ucx.c:299 ud_verbs/mlx5_0:1: matched transport list but not device list
[sqg1cintr17.bullx:51376] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/opal/mca/common/ucx/common_ucx.c:304 ud_mlx5/mlx5_0:1: did not match transport list
[sqg1cintr17.bullx:51376] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/opal/mca/common/ucx/common_ucx.c:304 cma/memory: did not match transport list
[sqg1cintr17.bullx:51376] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/opal/mca/common/ucx/common_ucx.c:304 knem/memory: did not match transport list
[sqg1cintr17.bullx:51376] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/opal/mca/common/ucx/common_ucx.c:311 support level is transports only
[sqg1cintr17.bullx:51376] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/ompi/mca/pml/ucx/pml_ucx.c:289 mca_pml_ucx_init
[sqg1cintr17.bullx:51376] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/ompi/mca/pml/ucx/pml_ucx.c:114 Pack remote worker address, size 249
[sqg1cintr17.bullx:51376] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/ompi/mca/pml/ucx/pml_ucx.c:114 Pack local worker address, size 414
[sqg1cintr17.bullx:51376] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/ompi/mca/pml/ucx/pml_ucx.c:351 created ucp context 0x2249f90, worker 0x7fd174074010
[sqg1cintr17.bullx:51376] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/ompi/mca/pml/ucx/pml_ucx_component.c:129 returning priority 19
[sqg1cintr17.bullx:51376] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/ompi/mca/pml/ucx/pml_ucx.c:367 mca_pml_ucx_cleanup
[sqg1cintr22.bullx:09187] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/ompi/mca/pml/ucx/pml_ucx.c:197 mca_pml_ucx_open: UCX version 1.12.1
[sqg1cintr17.bullx:51376] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/ompi/mca/pml/ucx/pml_ucx.c:268 mca_pml_ucx_close
[sqg1cintr22.bullx:09187] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/opal/mca/common/ucx/common_ucx.c:304 posix/memory: did not match transport list
[sqg1cintr22.bullx:09187] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/opal/mca/common/ucx/common_ucx.c:304 sysv/memory: did not match transport list
[sqg1cintr22.bullx:09187] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/opal/mca/common/ucx/common_ucx.c:304 self/memory0: did not match transport list
[sqg1cintr22.bullx:09187] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/opal/mca/common/ucx/common_ucx.c:304 tcp/ens1f1: did not match transport list
[sqg1cintr22.bullx:09187] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/opal/mca/common/ucx/common_ucx.c:304 tcp/ens1f0: did not match transport list
[sqg1cintr22.bullx:09187] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/opal/mca/common/ucx/common_ucx.c:304 tcp/lo: did not match transport list
[sqg1cintr22.bullx:09187] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/opal/mca/common/ucx/common_ucx.c:304 tcp/ib0: did not match transport list
[sqg1cintr22.bullx:09187] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/opal/mca/common/ucx/common_ucx.c:299 rc_verbs/mlx5_0:1: matched transport list but not device list
[sqg1cintr22.bullx:09187] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/opal/mca/common/ucx/common_ucx.c:299 rc_mlx5/mlx5_0:1: matched transport list but not device list
[sqg1cintr22.bullx:09187] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/opal/mca/common/ucx/common_ucx.c:299 dc_mlx5/mlx5_0:1: matched transport list but not device list
[sqg1cintr22.bullx:09187] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/opal/mca/common/ucx/common_ucx.c:299 ud_verbs/mlx5_0:1: matched transport list but not device list
[sqg1cintr22.bullx:09187] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/opal/mca/common/ucx/common_ucx.c:304 ud_mlx5/mlx5_0:1: did not match transport list
[sqg1cintr22.bullx:09187] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/opal/mca/common/ucx/common_ucx.c:304 cma/memory: did not match transport list
[sqg1cintr22.bullx:09187] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/opal/mca/common/ucx/common_ucx.c:304 knem/memory: did not match transport list
[sqg1cintr22.bullx:09187] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/opal/mca/common/ucx/common_ucx.c:311 support level is transports only
[sqg1cintr22.bullx:09187] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/ompi/mca/pml/ucx/pml_ucx.c:289 mca_pml_ucx_init
[sqg1cintr22.bullx:09187] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/ompi/mca/pml/ucx/pml_ucx.c:114 Pack remote worker address, size 249
[sqg1cintr22.bullx:09187] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/ompi/mca/pml/ucx/pml_ucx.c:114 Pack local worker address, size 414
[sqg1cintr22.bullx:09187] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/ompi/mca/pml/ucx/pml_ucx.c:351 created ucp context 0x2738040, worker 0x7fc570031010
[sqg1cintr22.bullx:09187] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/ompi/mca/pml/ucx/pml_ucx_component.c:129 returning priority 19
[sqg1cintr22.bullx:09187] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/ompi/mca/pml/ucx/pml_ucx.c:367 mca_pml_ucx_cleanup
[sqg1cintr22.bullx:09187] /lustre/scafellpike/local/package_build/build/rrs59-build/software/parallel/openmpi/openmpi-4.1.1/ompi/mca/pml/ucx/pml_ucx.c:268 mca_pml_ucx_close

@jsquyres jsquyres added this to the v4.1.5 milestone May 27, 2022
@jsquyres
Copy link
Member

@open-mpi/ucx FYI

@janjust
Copy link
Contributor

janjust commented May 27, 2022

@robertsawko Which IB Device is present on your system?

@robertsawko
Copy link
Author

robertsawko commented May 27, 2022 via email

@janjust
Copy link
Contributor

janjust commented May 27, 2022

Thanks, what if you specify -mca pml ucx -x UCX_NET_DEVICES=mlx5_0:1

What do you mean by random freezes? Does this mean it happens sporadically, or it simply doesn't go past the first send/recv message? If the above runtime parameters don't help, what's the backtrace on both ranks when it freezes?

@bosilca
Copy link
Member

bosilca commented May 28, 2022

@janjust is right, according to your logs, UCX PML disqualify itself because the list of transports was empty.

@yosefe
Copy link
Contributor

yosefe commented May 28, 2022

@robertsawko what is the output of ls -l /sys/class/infiniband/mlx5_0/device/driver ?
Also, can you pls try with latest v4.1.x branch, perhaps f38878e is fixing the issue?

@robertsawko
Copy link
Author

robertsawko commented May 29, 2022

Hi! Thanks again to everyone for all their commitment and responding over the weekend too.

@yosefe

ls -l /sys/class/infiniband/mlx5_0/device/driver
lrwxrwxrwx. 1 root root 0 May  8 22:08 /sys/class/infiniband/mlx5_0/device/driver -> ../../../../bus/pci/drivers/mlx5_core

Also, I am using 4.1.1 stable. But I am happy to recompile with the commit you specified.

@janjust you are right, when I specify:

mpirun -mca pml ucx -x UCX_NET_DEVICES=mlx5_0:1 ./IMB-MPI1 Sendrecv

the benchmark runs like a sprint runner on the last 10m of the final day of an Olympic competition with a fighting chance of breaking a world record... Sorry. So is that something that I need to specify? Maybe include in the Lmod file? Why is that list empty?

@robertsawko
Copy link
Author

@yosefe, I can confirm that the problem is actually fixed with 4.1.x - I no longer need to specify the variable and the Sendrecv produces number that I expect of our Infiniband. Many thanks for pointing this out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants