Skip to content

Warning: There was an error initializing an OpenFabrics device. #6517

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ca-taylor opened this issue Mar 25, 2019 · 22 comments
Closed

Warning: There was an error initializing an OpenFabrics device. #6517

ca-taylor opened this issue Mar 25, 2019 · 22 comments

Comments

@ca-taylor
Copy link

Thank you for taking the time to submit an issue!

Background information

OpenMPI 4.0.0 is reporting an error message (see below) and claiming that there is an error initializing an OpenFabrics device.

What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)

4.0.0

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

openmpi-4.0.0.tar.gz (GA release)

Please describe the system on which you are running

  • Operating system/version: RHEL 7.6
  • Computer hardware: Intel Haswell-based Dell SOS6320s (E5-2698 v3 @ 2.30GHz)
  • Network type: InfiniBand (Mellanox Connectx-3 FDR, mlx4)

ucx-1.4.0-1.el7.x86_64
ucx-devel-1.4.0-1.el7.x86_64


Details of the problem

I'm encountering the following message despite having built with UCX library support which works fine with OpenMPI 3.1.2

--------------------------------------------------------------------------
By default, for Open MPI 4.0 and later, infiniband ports on a device
are not used by default.  The intent is to use UCX for these devices.
You can override this policy by setting the btl_openib_allow_ib MCA parameter
to true.

  Local host:              c36a-s39
  Local adapter:           mlx4_0
  Local port:              1

--------------------------------------------------------------------------

@ca-taylor
Copy link
Author

From my config.log...

$ ./configure --build=x86_64-redhat-linux-gnu --host=x86_64-redhat-linux-gnu --program-prefix= --disable-dependency-tracking --prefix=/apps/mpi/cuda/10.0.130/intel/2018.1.163/openmpi/4.0.0 --exec-prefix=/apps/mpi/cuda/10.0.130/intel/2018.1.163/openmpi/4.0.0 --bindir=/apps/mpi/cuda/10.0.130/intel/2018.1.163/openmpi/4.0.0/bin --sbindir=/apps/mpi/cuda/10.0.130/intel/2018.1.163/openmpi/4.0.0/sbin --sysconfdir=/apps/mpi/cuda/10.0.130/intel/2018.1.163/openmpi/4.0.0/etc --datadir=/apps/mpi/cuda/10.0.130/intel/2018.1.163/openmpi/4.0.0/share --includedir=/apps/mpi/cuda/10.0.130/intel/2018.1.163/openmpi/4.0.0/include --libdir=/apps/mpi/cuda/10.0.130/intel/2018.1.163/openmpi/4.0.0/lib64 --libexecdir=/apps/mpi/cuda/10.0.130/intel/2018.1.163/openmpi/4.0.0/libexec --localstatedir=/var --sharedstatedir=/var/lib --mandir=/apps/mpi/cuda/10.0.130/intel/2018.1.163/openmpi/4.0.0/share/man --infodir=/apps/mpi/cuda/10.0.130/intel/2018.1.163/openmpi/4.0.0/share/info C=icc CXX=icpc FC=ifort FFLAGS=-O2 -g -warn -m64 LDFLAGS= --enable-static --enable-orterun-prefix-by-default --with-slurm=/opt/slurm --with-pmix=/opt/pmix/2.1.1 --with-pmi=/opt/slurm --with-libevent=external --with-hwloc=external --with-verbs --with-libfabric --with-ucx --with-mxm=no --with-cuda=/apps/compilers/cuda/10.0.130 --enable-openib-udcm --enable-openib-rdmacm

@ca-taylor
Copy link
Author

configure:5901: --- MCA component pml:ucx (m4 configuration macro)
configure:331882: checking for MCA component pml:ucx compile mode
configure:331888: result: static
configure:335683: checking if MCA component pml:ucx can compile
configure:335685: result: yes

@ca-taylor
Copy link
Author

From the verbose output, it may be lying or just dropping back to ethernet. I can't tell.

WARNING: There was an error initializing an OpenFabrics device.
Local host: c36a-s39
Local device: mlx4_0

[c36a-s39.ufhpc:245514] pml_ucx.c:134 mca_pml_ucx_open
[c36a-s39.ufhpc:245514] pml_ucx.c:198 mca_pml_ucx_init
[c36a-s39.ufhpc:245513] pml_ucx.c:134 mca_pml_ucx_open
[c36a-s39.ufhpc:245515] pml_ucx.c:134 mca_pml_ucx_open
[c36a-s39.ufhpc:245512] pml_ucx.c:134 mca_pml_ucx_open
[c36a-s39.ufhpc:245515] pml_ucx.c:198 mca_pml_ucx_init
[c36a-s39.ufhpc:245513] pml_ucx.c:198 mca_pml_ucx_init
[c36a-s39.ufhpc:245512] pml_ucx.c:198 mca_pml_ucx_init
[c36a-s39.ufhpc:245514] pml_ucx.c:255 created ucp context 0xf13dd0, worker 0x2b9af7829010
[c36a-s39.ufhpc:245513] pml_ucx.c:255 created ucp context 0x130ddd0, worker 0x2acbb7e42010
[c36a-s39.ufhpc:245515] pml_ucx.c:255 created ucp context 0xd5bdd0, worker 0x2b7e5ffa1010
[c36a-s39.ufhpc:245512] pml_ucx.c:255 created ucp context 0x25e0ea0, worker 0x2af3dfe24010
[c36a-s39.ufhpc:245512] pml_ucx.c:313 connecting to proc. 0
[c36a-s39.ufhpc:245513] pml_ucx.c:313 connecting to proc. 1
[c36a-s39.ufhpc:245514] pml_ucx.c:313 connecting to proc. 2
[c36a-s39.ufhpc:245515] pml_ucx.c:313 connecting to proc. 3
[c36a-s39.ufhpc:245517] pml_ucx.c:313 connecting to proc. 4
[c36a-s39.ufhpc:245512] pml_ucx.c:313 connecting to proc. 1
[c36a-s39.ufhpc:245513] pml_ucx.c:313 connecting to proc. 2
[c36a-s39.ufhpc:245514] pml_ucx.c:313 connecting to proc. 3
[c36a-s39.ufhpc:245515] pml_ucx.c:313 connecting to proc. 4
[c36a-s39.ufhpc:245517] pml_ucx.c:313 connecting to proc. 0
[c36a-s39.ufhpc:245512] pml_ucx.c:313 connecting to proc. 2
[c36a-s39.ufhpc:245513] pml_ucx.c:313 connecting to proc. 3
[c36a-s39.ufhpc:245514] pml_ucx.c:313 connecting to proc. 4
[c36a-s39.ufhpc:245513] pml_ucx.c:313 connecting to proc. 4
[c36a-s39.ufhpc:245512] pml_ucx.c:313 connecting to proc. 3
[c36a-s39.ufhpc:245513] pml_ucx.c:313 connecting to proc. 0
[c36a-s39.ufhpc:245514] pml_ucx.c:313 connecting to proc. 0
[c36a-s39.ufhpc:245515] pml_ucx.c:313 connecting to proc. 0
[c36a-s39.ufhpc:245517] pml_ucx.c:313 connecting to proc. 1
[c36a-s39.ufhpc:245512] pml_ucx.c:313 connecting to proc. 4
[c36a-s39.ufhpc:245514] pml_ucx.c:313 connecting to proc. 1
[c36a-s39.ufhpc:245515] pml_ucx.c:313 connecting to proc. 1
[c36a-s39.ufhpc:245517] pml_ucx.c:313 connecting to proc. 2
[c36a-s39.ufhpc:245515] pml_ucx.c:313 connecting to proc. 2
[c36a-s39.ufhpc:245517] pml_ucx.c:313 connecting to proc. 3

@yosefe
Copy link
Contributor

yosefe commented Mar 25, 2019

@ca-taylor is there any other issue except this error message? e.g is the test running successfully?

@ca-taylor
Copy link
Author

Then a bit later...

[c36a-s39.ufhpc:245515] pml_ucx.c:313 connecting to proc. 1
[c36a-s39.ufhpc:245517] pml_ucx.c:313 connecting to proc. 2
[c36a-s39.ufhpc:245515] pml_ucx.c:313 connecting to proc. 2
[c36a-s39.ufhpc:245517] pml_ucx.c:313 connecting to proc. 3
[c36a-s39:245513:0:245513] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
==== backtrace ====
0 /lib64/libucs.so.0(+0x174b0) [0x2acbb246d4b0]
1 /lib64/libucs.so.0(+0x17662) [0x2acbb246d662]
2 /apps/cuda/10.0.130/intel/2018.1.163/openmpi/4.0.0/relion/3.0.2/lib/librelion_lib.so(_ZN13BackProjector22setLowResDataAndWeightER13MultidimArrayI8tComplexIdEERS0_IdEi+0xff3) [0x2acba2269ba3]
3 /apps/cuda/10.0.130/intel/2018.1.163/openmpi/4.0.0/relion/3.0.2/lib/librelion_lib.so(_ZN14MlOptimiserMpi28joinTwoHalvesAtLowResolutionEv+0x85f) [0x2acba249faaf]
4 /apps/cuda/10.0.130/intel/2018.1.163/openmpi/4.0.0/relion/3.0.2/lib/librelion_lib.so(_ZN14MlOptimiserMpi7iterateEv+0x343) [0x2acba2491be3]
5 /apps/cuda/10.0.130/intel/2018.1.163/openmpi/4.0.0/relion/3.0.2/bin/relion_refine_mpi(main+0x20a) [0x40575a]
6 /usr/lib64/libc.so.6(__libc_start_main+0xf5) [0x2acbaf1013d5]
7 /apps/cuda/10.0.130/intel/2018.1.163/openmpi/4.0.0/relion/3.0.2/bin/relion_refine_mpi() [0x405429]

srun: error: c36a-s39: task 1: Segmentation fault (core dumped)

@ca-taylor
Copy link
Author

Comment from yosefe:

can you pls try adding "--mca opal_common_ucx_opal_mem_hooks 1"?

@ca-taylor
Copy link
Author

With

export OMPI_MCA_opal_common_ucx_opal_mem_hooks=1
export OMPI_MCA_pml_ucx_verbose=100
--------------------------------------------------------------------------
By default, for Open MPI 4.0 and later, infiniband ports on a device
are not used by default.  The intent is to use UCX for these devices.
You can override this policy by setting the btl_openib_allow_ib MCA parameter
to true.

  Local host:              c36a-s39
  Local adapter:           mlx4_0
  Local port:              1

--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   c36a-s39
  Local device: mlx4_0
--------------------------------------------------------------------------
[c36a-s39.ufhpc:252551] common_ucx.c:108 using OPAL memory hooks as external events
[c36a-s39.ufhpc:252551] pml_ucx.c:134 mca_pml_ucx_open
[c36a-s39.ufhpc:252552] common_ucx.c:108 using OPAL memory hooks as external events
[c36a-s39.ufhpc:252552] pml_ucx.c:134 mca_pml_ucx_open
[c36a-s39.ufhpc:252546] common_ucx.c:108 using OPAL memory hooks as external events
[c36a-s39.ufhpc:252546] pml_ucx.c:134 mca_pml_ucx_open
[c36a-s39.ufhpc:252548] common_ucx.c:108 using OPAL memory hooks as external events
[c36a-s39.ufhpc:252548] pml_ucx.c:134 mca_pml_ucx_open
[c36a-s39.ufhpc:252550] common_ucx.c:108 using OPAL memory hooks as external events
[c36a-s39.ufhpc:252550] pml_ucx.c:134 mca_pml_ucx_open
[c36a-s39.ufhpc:252547] common_ucx.c:108 using OPAL memory hooks as external events
[c36a-s39.ufhpc:252547] pml_ucx.c:134 mca_pml_ucx_open
[c36a-s39.ufhpc:252551] pml_ucx.c:198 mca_pml_ucx_init
[c36a-s39.ufhpc:252552] pml_ucx.c:198 mca_pml_ucx_init
[c36a-s39.ufhpc:252545] common_ucx.c:108 using OPAL memory hooks as external events
[c36a-s39.ufhpc:252545] pml_ucx.c:134 mca_pml_ucx_open
[c36a-s39.ufhpc:252553] common_ucx.c:108 using OPAL memory hooks as external events
[c36a-s39.ufhpc:252553] pml_ucx.c:134 mca_pml_ucx_open
[c36a-s39.ufhpc:252549] common_ucx.c:108 using OPAL memory hooks as external events
[c36a-s39.ufhpc:252549] pml_ucx.c:134 mca_pml_ucx_open
[c36a-s39.ufhpc:252552] pml_ucx.c:255 created ucp context 0x2854a60, worker 0x2b9525d23010
[c36a-s39.ufhpc:252551] pml_ucx.c:255 created ucp context 0x2734a50, worker 0x2aecd7f0e010
[c36a-s39.ufhpc:252548] pml_ucx.c:198 mca_pml_ucx_init
[c36a-s39.ufhpc:252546] pml_ucx.c:198 mca_pml_ucx_init
[c36a-s39.ufhpc:252550] pml_ucx.c:198 mca_pml_ucx_init
[c36a-s39.ufhpc:252547] pml_ucx.c:198 mca_pml_ucx_init
[c36a-s39.ufhpc:252548] pml_ucx.c:255 created ucp context 0xda9a60, worker 0x2b1487fa6010
[c36a-s39.ufhpc:252546] pml_ucx.c:255 created ucp context 0x10b3a60, worker 0x2b7023e3b010
[c36a-s39.ufhpc:252549] pml_ucx.c:198 mca_pml_ucx_init
[c36a-s39.ufhpc:252550] pml_ucx.c:255 created ucp context 0x1c41a70, worker 0x2b2d3ff9e010
[c36a-s39.ufhpc:252545] pml_ucx.c:198 mca_pml_ucx_init
[c36a-s39.ufhpc:252547] pml_ucx.c:255 created ucp context 0x10c6a70, worker 0x2ab321665010
[c36a-s39.ufhpc:252549] pml_ucx.c:255 created ucp context 0x1c77a60, worker 0x2b8d0fec0010
[c36a-s39.ufhpc:252553] pml_ucx.c:198 mca_pml_ucx_init
[c36a-s39.ufhpc:252545] pml_ucx.c:255 created ucp context 0x15aab50, worker 0x2b60268cf010
[c36a-s39.ufhpc:252553] pml_ucx.c:255 created ucp context 0xdb6a60, worker 0x2ab4814ba010
[c36a-s39.ufhpc:252545] pml_ucx.c:313 connecting to proc. 0
[c36a-s39.ufhpc:252553] pml_ucx.c:313 connecting to proc. 8
[c36a-s39.ufhpc:252552] pml_ucx.c:313 connecting to proc. 7
[c36a-s39.ufhpc:252546] pml_ucx.c:313 connecting to proc. 1
[c36a-s39.ufhpc:252547] pml_ucx.c:313 connecting to proc. 2
[c36a-s39.ufhpc:252548] pml_ucx.c:313 connecting to proc. 3
[c36a-s39.ufhpc:252551] pml_ucx.c:313 connecting to proc. 6
[c36a-s39.ufhpc:252550] pml_ucx.c:313 connecting to proc. 5
[c36a-s39.ufhpc:252553] pml_ucx.c:313 connecting to proc. 0
[c36a-s39.ufhpc:252552] pml_ucx.c:313 connecting to proc. 8
[c36a-s39.ufhpc:252547] pml_ucx.c:313 connecting to proc. 3
[c36a-s39.ufhpc:252546] pml_ucx.c:313 connecting to proc. 2
[c36a-s39.ufhpc:252551] pml_ucx.c:313 connecting to proc. 7
[c36a-s39.ufhpc:252548] pml_ucx.c:313 connecting to proc. 4
[c36a-s39.ufhpc:252553] pml_ucx.c:313 connecting to proc. 1
[c36a-s39.ufhpc:252549] pml_ucx.c:313 connecting to proc. 4
[c36a-s39.ufhpc:252552] pml_ucx.c:313 connecting to proc. 0
[c36a-s39.ufhpc:252547] pml_ucx.c:313 connecting to proc. 4
[c36a-s39.ufhpc:252553] pml_ucx.c:313 connecting to proc. 2
[c36a-s39.ufhpc:252546] pml_ucx.c:313 connecting to proc. 3
[c36a-s39.ufhpc:252551] pml_ucx.c:313 connecting to proc. 8
[c36a-s39.ufhpc:252552] pml_ucx.c:313 connecting to proc. 1
[c36a-s39.ufhpc:252546] pml_ucx.c:313 connecting to proc. 4
[c36a-s39.ufhpc:252547] pml_ucx.c:313 connecting to proc. 5
[c36a-s39.ufhpc:252548] pml_ucx.c:313 connecting to proc. 5
[c36a-s39.ufhpc:252550] pml_ucx.c:313 connecting to proc. 6
[c36a-s39.ufhpc:252551] pml_ucx.c:313 connecting to proc. 0
[c36a-s39.ufhpc:252553] pml_ucx.c:313 connecting to proc. 3
[c36a-s39.ufhpc:252553] pml_ucx.c:313 connecting to proc. 4
[c36a-s39.ufhpc:252546] pml_ucx.c:313 connecting to proc. 5
[c36a-s39.ufhpc:252547] pml_ucx.c:313 connecting to proc. 6
[c36a-s39.ufhpc:252548] pml_ucx.c:313 connecting to proc. 6
[c36a-s39.ufhpc:252551] pml_ucx.c:313 connecting to proc. 1
[c36a-s39.ufhpc:252552] pml_ucx.c:313 connecting to proc. 2
[c36a-s39.ufhpc:252552] pml_ucx.c:313 connecting to proc. 3
[c36a-s39.ufhpc:252546] pml_ucx.c:313 connecting to proc. 6
[c36a-s39.ufhpc:252547] pml_ucx.c:313 connecting to proc. 7
[c36a-s39.ufhpc:252548] pml_ucx.c:313 connecting to proc. 7
[c36a-s39.ufhpc:252551] pml_ucx.c:313 connecting to proc. 2
[c36a-s39.ufhpc:252553] pml_ucx.c:313 connecting to proc. 5
[c36a-s39.ufhpc:252547] pml_ucx.c:313 connecting to proc. 8
[c36a-s39.ufhpc:252548] pml_ucx.c:313 connecting to proc. 8
[c36a-s39.ufhpc:252549] pml_ucx.c:313 connecting to proc. 5
[c36a-s39.ufhpc:252550] pml_ucx.c:313 connecting to proc. 7
[c36a-s39.ufhpc:252552] pml_ucx.c:313 connecting to proc. 4
[c36a-s39.ufhpc:252552] pml_ucx.c:313 connecting to proc. 5
[c36a-s39.ufhpc:252553] pml_ucx.c:313 connecting to proc. 6
[c36a-s39.ufhpc:252546] pml_ucx.c:313 connecting to proc. 7
[c36a-s39.ufhpc:252546] pml_ucx.c:313 connecting to proc. 8
[c36a-s39.ufhpc:252547] pml_ucx.c:313 connecting to proc. 0
[c36a-s39.ufhpc:252548] pml_ucx.c:313 connecting to proc. 0
[c36a-s39.ufhpc:252550] pml_ucx.c:313 connecting to proc. 8
[c36a-s39.ufhpc:252551] pml_ucx.c:313 connecting to proc. 3
[c36a-s39.ufhpc:252551] pml_ucx.c:313 connecting to proc. 4
[c36a-s39.ufhpc:252553] pml_ucx.c:313 connecting to proc. 7
[c36a-s39.ufhpc:252547] pml_ucx.c:313 connecting to proc. 1
[c36a-s39.ufhpc:252548] pml_ucx.c:313 connecting to proc. 1
[c36a-s39.ufhpc:252550] pml_ucx.c:313 connecting to proc. 0
[c36a-s39.ufhpc:252551] pml_ucx.c:313 connecting to proc. 5
[c36a-s39.ufhpc:252552] pml_ucx.c:313 connecting to proc. 6
[c36a-s39.ufhpc:252546] pml_ucx.c:313 connecting to proc. 0
[c36a-s39.ufhpc:252548] pml_ucx.c:313 connecting to proc. 2
[c36a-s39.ufhpc:252549] pml_ucx.c:313 connecting to proc. 6
[c36a-s39.ufhpc:252550] pml_ucx.c:313 connecting to proc. 1
[c36a-s39.ufhpc:252550] pml_ucx.c:313 connecting to proc. 2
[c36a-s39.ufhpc:252549] pml_ucx.c:313 connecting to proc. 7
[c36a-s39.ufhpc:252550] pml_ucx.c:313 connecting to proc. 3
[c36a-s39.ufhpc:252549] pml_ucx.c:313 connecting to proc. 8
[c36a-s39.ufhpc:252550] pml_ucx.c:313 connecting to proc. 4
[c36a-s39.ufhpc:252549] pml_ucx.c:313 connecting to proc. 0
[c36a-s39.ufhpc:252549] pml_ucx.c:313 connecting to proc. 1
[c36a-s39.ufhpc:252549] pml_ucx.c:313 connecting to proc. 2
[c36a-s39.ufhpc:252549] pml_ucx.c:313 connecting to proc. 3
[c36a-s39.ufhpc:252545] pml_ucx.c:313 connecting to proc. 1
[c36a-s39.ufhpc:252545] pml_ucx.c:313 connecting to proc. 2
[c36a-s39.ufhpc:252545] pml_ucx.c:313 connecting to proc. 3
[c36a-s39.ufhpc:252545] pml_ucx.c:313 connecting to proc. 4
[c36a-s39.ufhpc:252545] pml_ucx.c:313 connecting to proc. 5
[c36a-s39.ufhpc:252545] pml_ucx.c:313 connecting to proc. 6
[c36a-s39.ufhpc:252545] pml_ucx.c:313 connecting to proc. 7
[c36a-s39.ufhpc:252545] pml_ucx.c:313 connecting to proc. 8

@ca-taylor
Copy link
Author

I need to verify but I think we may only be seeing these messages on our GPU nodes with CUDA-enabled builds of OpenMPI but, again, I need to verify that.

@ca-taylor
Copy link
Author

The most recent job ran to completion without errors from the app.. Below is the stderr which looks ok.

So I guess the question is, "Why the message about the error initializing the OpenFabrics device"?

[c36a-s39.ufhpc:252545] common_ucx.c:190 disconnecting from rank 0
[c36a-s39.ufhpc:252546] common_ucx.c:190 disconnecting from rank 1
[c36a-s39.ufhpc:252547] common_ucx.c:190 disconnecting from rank 2
[c36a-s39.ufhpc:252547] common_ucx.c:190 disconnecting from rank 3
[c36a-s39.ufhpc:252547] common_ucx.c:190 disconnecting from rank 4
[c36a-s39.ufhpc:252548] common_ucx.c:190 disconnecting from rank 3
[c36a-s39.ufhpc:252548] common_ucx.c:190 disconnecting from rank 4
[c36a-s39.ufhpc:252548] common_ucx.c:190 disconnecting from rank 5
[c36a-s39.ufhpc:252549] common_ucx.c:190 disconnecting from rank 4
[c36a-s39.ufhpc:252549] common_ucx.c:190 disconnecting from rank 5
[c36a-s39.ufhpc:252549] common_ucx.c:190 disconnecting from rank 6
[c36a-s39.ufhpc:252552] common_ucx.c:190 disconnecting from rank 7
[c36a-s39.ufhpc:252552] common_ucx.c:190 disconnecting from rank 8
[c36a-s39.ufhpc:252552] common_ucx.c:190 disconnecting from rank 0
[c36a-s39.ufhpc:252553] common_ucx.c:190 disconnecting from rank 8
[c36a-s39.ufhpc:252553] common_ucx.c:190 disconnecting from rank 0
[c36a-s39.ufhpc:252553] common_ucx.c:190 disconnecting from rank 1
[c36a-s39.ufhpc:252553] common_ucx.c:190 disconnecting from rank 2
[c36a-s39.ufhpc:252545] common_ucx.c:190 disconnecting from rank 1
[c36a-s39.ufhpc:252545] common_ucx.c:190 disconnecting from rank 2
[c36a-s39.ufhpc:252546] common_ucx.c:190 disconnecting from rank 2
[c36a-s39.ufhpc:252546] common_ucx.c:190 disconnecting from rank 3
[c36a-s39.ufhpc:252546] common_ucx.c:190 disconnecting from rank 4
[c36a-s39.ufhpc:252553] common_ucx.c:190 disconnecting from rank 3
[c36a-s39.ufhpc:252553] common_ucx.c:190 disconnecting from rank 4
[c36a-s39.ufhpc:252545] common_ucx.c:190 disconnecting from rank 3
[c36a-s39.ufhpc:252545] common_ucx.c:190 disconnecting from rank 4
[c36a-s39.ufhpc:252546] common_ucx.c:190 disconnecting from rank 5
[c36a-s39.ufhpc:252546] common_ucx.c:190 disconnecting from rank 6
[c36a-s39.ufhpc:252546] common_ucx.c:190 disconnecting from rank 7
[c36a-s39.ufhpc:252546] common_ucx.c:190 disconnecting from rank 8
[c36a-s39.ufhpc:252546] common_ucx.c:190 disconnecting from rank 0
[c36a-s39.ufhpc:252546] common_ucx.c:156 waiting for 0 disconnect requests
[c36a-s39.ufhpc:252547] common_ucx.c:190 disconnecting from rank 5
[c36a-s39.ufhpc:252547] common_ucx.c:190 disconnecting from rank 6
[c36a-s39.ufhpc:252548] common_ucx.c:190 disconnecting from rank 6
[c36a-s39.ufhpc:252548] common_ucx.c:190 disconnecting from rank 7
[c36a-s39.ufhpc:252549] common_ucx.c:190 disconnecting from rank 7
[c36a-s39.ufhpc:252549] common_ucx.c:190 disconnecting from rank 8
[c36a-s39.ufhpc:252549] common_ucx.c:190 disconnecting from rank 0
[c36a-s39.ufhpc:252550] common_ucx.c:190 disconnecting from rank 5
[c36a-s39.ufhpc:252550] common_ucx.c:190 disconnecting from rank 6
[c36a-s39.ufhpc:252550] common_ucx.c:190 disconnecting from rank 7
[c36a-s39.ufhpc:252551] common_ucx.c:190 disconnecting from rank 6
[c36a-s39.ufhpc:252551] common_ucx.c:190 disconnecting from rank 7
[c36a-s39.ufhpc:252551] common_ucx.c:190 disconnecting from rank 8
[c36a-s39.ufhpc:252552] common_ucx.c:190 disconnecting from rank 1
[c36a-s39.ufhpc:252552] common_ucx.c:190 disconnecting from rank 2
[c36a-s39.ufhpc:252552] common_ucx.c:190 disconnecting from rank 3
[c36a-s39.ufhpc:252553] common_ucx.c:190 disconnecting from rank 5
[c36a-s39.ufhpc:252553] common_ucx.c:190 disconnecting from rank 6
[c36a-s39.ufhpc:252553] common_ucx.c:190 disconnecting from rank 7
[c36a-s39.ufhpc:252553] common_ucx.c:156 waiting for 0 disconnect requests
[c36a-s39.ufhpc:252547] common_ucx.c:190 disconnecting from rank 7
[c36a-s39.ufhpc:252547] common_ucx.c:190 disconnecting from rank 8
[c36a-s39.ufhpc:252547] common_ucx.c:190 disconnecting from rank 0
[c36a-s39.ufhpc:252550] common_ucx.c:190 disconnecting from rank 8
[c36a-s39.ufhpc:252550] common_ucx.c:190 disconnecting from rank 0
[c36a-s39.ufhpc:252552] common_ucx.c:190 disconnecting from rank 4
[c36a-s39.ufhpc:252552] common_ucx.c:190 disconnecting from rank 5
[c36a-s39.ufhpc:252552] common_ucx.c:190 disconnecting from rank 6
[c36a-s39.ufhpc:252552] common_ucx.c:156 waiting for 0 disconnect requests
[c36a-s39.ufhpc:252545] common_ucx.c:190 disconnecting from rank 5
[c36a-s39.ufhpc:252551] common_ucx.c:190 disconnecting from rank 0
[c36a-s39.ufhpc:252547] common_ucx.c:190 disconnecting from rank 1
[c36a-s39.ufhpc:252547] common_ucx.c:156 waiting for 0 disconnect requests
[c36a-s39.ufhpc:252549] common_ucx.c:190 disconnecting from rank 1
[c36a-s39.ufhpc:252549] common_ucx.c:190 disconnecting from rank 2
[c36a-s39.ufhpc:252549] common_ucx.c:190 disconnecting from rank 3
[c36a-s39.ufhpc:252550] common_ucx.c:190 disconnecting from rank 1
[c36a-s39.ufhpc:252551] common_ucx.c:190 disconnecting from rank 1
[c36a-s39.ufhpc:252551] common_ucx.c:190 disconnecting from rank 2
[c36a-s39.ufhpc:252545] common_ucx.c:190 disconnecting from rank 6
[c36a-s39.ufhpc:252545] common_ucx.c:190 disconnecting from rank 7
[c36a-s39.ufhpc:252548] common_ucx.c:190 disconnecting from rank 8
[c36a-s39.ufhpc:252548] common_ucx.c:190 disconnecting from rank 0
[c36a-s39.ufhpc:252549] common_ucx.c:156 waiting for 0 disconnect requests
[c36a-s39.ufhpc:252545] common_ucx.c:190 disconnecting from rank 8
[c36a-s39.ufhpc:252545] common_ucx.c:156 waiting for 0 disconnect requests
[c36a-s39.ufhpc:252548] common_ucx.c:190 disconnecting from rank 1
[c36a-s39.ufhpc:252548] common_ucx.c:190 disconnecting from rank 2
[c36a-s39.ufhpc:252548] common_ucx.c:156 waiting for 0 disconnect requests
[c36a-s39.ufhpc:252550] common_ucx.c:190 disconnecting from rank 2
[c36a-s39.ufhpc:252550] common_ucx.c:190 disconnecting from rank 3
[c36a-s39.ufhpc:252550] common_ucx.c:190 disconnecting from rank 4
[c36a-s39.ufhpc:252550] common_ucx.c:156 waiting for 0 disconnect requests
[c36a-s39.ufhpc:252551] common_ucx.c:190 disconnecting from rank 3
[c36a-s39.ufhpc:252551] common_ucx.c:190 disconnecting from rank 4
[c36a-s39.ufhpc:252551] common_ucx.c:190 disconnecting from rank 5
[c36a-s39.ufhpc:252551] common_ucx.c:156 waiting for 0 disconnect requests
[c36a-s39.ufhpc:252551] pml_ucx.c:269 mca_pml_ucx_cleanup
[c36a-s39.ufhpc:252546] pml_ucx.c:269 mca_pml_ucx_cleanup
[c36a-s39.ufhpc:252553] pml_ucx.c:269 mca_pml_ucx_cleanup
[c36a-s39.ufhpc:252547] pml_ucx.c:269 mca_pml_ucx_cleanup
[c36a-s39.ufhpc:252548] pml_ucx.c:269 mca_pml_ucx_cleanup
[c36a-s39.ufhpc:252549] pml_ucx.c:269 mca_pml_ucx_cleanup
[c36a-s39.ufhpc:252552] pml_ucx.c:269 mca_pml_ucx_cleanup
[c36a-s39.ufhpc:252545] pml_ucx.c:269 mca_pml_ucx_cleanup
[c36a-s39.ufhpc:252550] pml_ucx.c:269 mca_pml_ucx_cleanup
[c36a-s39.ufhpc:252553] pml_ucx.c:182 mca_pml_ucx_close
[c36a-s39.ufhpc:252546] pml_ucx.c:182 mca_pml_ucx_close
[c36a-s39.ufhpc:252552] pml_ucx.c:182 mca_pml_ucx_close
[c36a-s39.ufhpc:252551] pml_ucx.c:182 mca_pml_ucx_close
[c36a-s39.ufhpc:252548] pml_ucx.c:182 mca_pml_ucx_close
[c36a-s39.ufhpc:252545] pml_ucx.c:182 mca_pml_ucx_close
[c36a-s39.ufhpc:252550] pml_ucx.c:182 mca_pml_ucx_close
[c36a-s39.ufhpc:252549] pml_ucx.c:182 mca_pml_ucx_close
[c36a-s39.ufhpc:252547] pml_ucx.c:182 mca_pml_ucx_close

@yosefe
Copy link
Contributor

yosefe commented Mar 25, 2019

@ca-taylor can you pls try configuring OpenMPI --without-verbs instead of --with-verbs?
These error messages are printed by openib BTL which is deprecated. --without-verbs should disable it altogether.

@ca-taylor
Copy link
Author

@ca-taylor can you pls try configuring OpenMPI --without-verbs instead of --with-verbs?
These error messages are printed by openib BTL which is deprecated. --without-verbs should
disable it altogether.

That resolves the issue and is simple enough. Thank you.

Is there an easy way to determine which device and transport layer UCX has decided to use? I don't see any indication of that in the verbose output.

@jsquyres
Copy link
Member

@yosefe I'll bet we're going to get more questions like this. Can you guys make up an FAQ item or three about this so that when people google for it, they find the FAQ / don't need to ask on the mailing list / don't need to post an issue?

@yosefe
Copy link
Contributor

yosefe commented Mar 26, 2019

@ca-taylor

Is there an easy way to determine which device and transport layer UCX has decided to use? I don't see any indication of that in the verbose output.

Currently - no. However, it's possible to set the device and transport to use: https://github.com/openucx/ucx/wiki/UCX-environment-parameters

@devreal
Copy link
Contributor

devreal commented Mar 28, 2019

I am running into what seems to be the same issue on our IB cluster in an application that makes heavy use of MPI RMA calls. I followed the advice from @yosefe and configured Open MPI 4.0.0 with --without-verbs and UCX 1.5. I am still seeing the warning message:

--------------------------------------------------------------------------
By default, for Open MPI 4.0 and later, infiniband ports on a device
are not used by default.  The intent is to use UCX for these devices.
You can override this policy by setting the btl_openib_allow_ib MCA parameter
to true.

  Local host:              n091002
  Local adapter:           mlx4_0
  Local port:              1

--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   n091002
  Local device: mlx4_0
--------------------------------------------------------------------------

I also get similar UCX debug output with OMPI_MCA_opal_common_ucx_opal_mem_hooks=1 and OMPI_MCA_pml_ucx_verbose=100:

[n091002:29554] ../../../../../opal/mca/common/ucx/common_ucx.c:108 using OPAL memory hooks as external events
[n091002:29554] ../../../../../ompi/mca/pml/ucx/pml_ucx.c:134 mca_pml_ucx_open
[n091601:24026] ../../../../../opal/mca/common/ucx/common_ucx.c:108 using OPAL memory hooks as external events
[n091601:24026] ../../../../../ompi/mca/pml/ucx/pml_ucx.c:134 mca_pml_ucx_open
[n091002:29554] ../../../../../ompi/mca/pml/ucx/pml_ucx.c:198 mca_pml_ucx_init
[n091601:24026] ../../../../../ompi/mca/pml/ucx/pml_ucx.c:198 mca_pml_ucx_init
[n091002:29554] ../../../../../ompi/mca/pml/ucx/pml_ucx.c:255 created ucp context 0x1f9e210, worker 0x2b5943eb4010
[n091601:24026] ../../../../../ompi/mca/pml/ucx/pml_ucx.c:255 created ucp context 0x12771e0, worker 0x2b8893f75010
[n091601:24026] ../../../../../ompi/mca/pml/ucx/pml_ucx.c:313 connecting to proc. 0
[n091002:29554] ../../../../../ompi/mca/pml/ucx/pml_ucx.c:313 connecting to proc. 1

After that, the application hangs in a one-sided communication call. Interestingly, if I set osc_ucx_priority to 100 the same messages appear but the application runs for a while until it segfaults somewhere:

    0  ucx-1.5.0/lib/libucs.so.0(+0x23bbb) [0x2af128a0abbb]
    1  ucx-1.5.0/lib/libucs.so.0(+0x236b6) [0x2af128a0a6b6]
    2  openmpi-4.0.0-ucx-intel/lib/openmpi/mca_osc_ucx.so(req_completion+0x84) [0x2af12a514874]
    3  ucx-1.5.0/lib/libucp.so.0(ucp_atomic_rep_handler+0x4a) [0x2af128317ffa]
    4  ucx-1.5.0/lib/libuct.so.0(+0x2b872) [0x2af128598872]
    5  ucx-1.5.0/lib/libucp.so.0(ucp_worker_progress+0x33) [0x2af128312463]
    6  openmpi-4.0.0-ucx-intel/lib/openmpi/mca_osc_ucx.so(ompi_osc_ucx_fetch_and_op+0x4ed) [0x2af12a50b73d]
    7  openmpi-4.0.0-ucx-intel/lib/libmpi.so.40(MPI_Fetch_and_op+0x5f) [0x2af112d0a44f]
    8  ./mpi_progress_ompi4.0.0ucx() [0x405a08]
    9  ./mpi_progress_ompi4.0.0ucx() [0x402cb3]
   10  /lib64/libc.so.6(__libc_start_main+0xf5) [0x2af1137273d5]
   11  ./mpi_progress_ompi4.0.0ucx() [0x401fa9]

There is no connection with cuda.

ompi_info shows no significant traces of verbs:

$ ompi_info --all --all | grep verbs
  Configure command line: 'CC=icc' 'CXX=icpc' 'FTN=ifort' '--with-ucx=/lustre/nec/ws2/ws/opt-vulcan/ucx-1.5.0/' '--prefix=/lustre/nec/ws2/ws/opt-vulcan/openmpi-4.0.0-ucx-intel' '--without-verbs'
                          GID index to use on verbs device ports

Setting btl_openib_allow_ib true works as expected but I guess that it disables UCX.

Anything else I might try?

@jsquyres
Copy link
Member

We have an FAQ item in the works -- do this help you?

open-mpi/ompi-www#249

@devreal
Copy link
Contributor

devreal commented Mar 28, 2019

@jsquyres It helped in that it made me try to delete my installation of Open MPI and reissue make install. That seemed to get rid of the openib stuff. Now the warning message is gone but the application hangs:

One process waits in a barrier, another one has this stack trace:

(gdb) bt
#0  0x00002ae0d34c2a69 in opal_progress () from /lustre/nec/ws2/ws/opt-vulcan/openmpi-4.0.0-ucx-intel/lib/libopen-pal.so.40
#1  0x00002ae0e95f5a36 in ompi_osc_pt2pt_flush_lock () from /lustre/nec/ws2/ws/opt-vulcan/lib/openmpi/mca_osc_pt2pt.so
#2  0x00002ae0e95f7fa9 in ompi_osc_pt2pt_flush () from /lustre/nec/ws2/ws/opt-vulcan/openmpi-4.0.0-ucx-intel/lib/openmpi/mca_osc_pt2pt.so
#3  0x00002ae0d221c11d in PMPI_Win_flush () from /lustre/nec/ws2/ws/opt-vulcan/openmpi-4.0.0-ucx-intel/lib/libmpi.so.40
#4  0x00000000004059f7 in fetch_op ()
#5  0x000000000040220e in main ()

I am a bit puzzled that it's the pt2pt osc module though. Setting osc_ucx_priority to 100 again seems to finally enable UCX for one-sided communication. The results seem fishy though and the application crashes with the same stack trace as reported above. I will dig into that tomorrow with a debug build of Open MPI.

@jsquyres
Copy link
Member

@devreal That seems like a new / different problem. You might want to open a new issue about that, and make sure Mellanox sees / replies to you about it.

@devreal
Copy link
Contributor

devreal commented Mar 29, 2019

@jsquyres I'm in the process of reporting some more issues I'm facing with UCX. I do not, however, see the pt2pt osc component being used over UCX. That seems to have been a glitch on my side. As stated above, purging the installation directory helped getting rid of the open IB component.

@yosefe
Copy link
Contributor

yosefe commented Mar 31, 2019

@devreal pt2pt does not work over UCX, there is an OSC component called "ucx" instead.
@ca-taylor @devreal can we close this, and track the hangs by #6546 and #6549?

@devreal
Copy link
Contributor

devreal commented Apr 1, 2019

@yosefe From my perspective yes, sorry for taking over this issue ;)

@ca-taylor
Copy link
Author

ca-taylor commented Apr 1, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants