-
Notifications
You must be signed in to change notification settings - Fork 900
Warning: There was an error initializing an OpenFabrics device. #6517
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
From my config.log...
|
|
From the verbose output, it may be lying or just dropping back to ethernet. I can't tell.
|
@ca-taylor is there any other issue except this error message? e.g is the test running successfully? |
Then a bit later...
|
Comment from yosefe: can you pls try adding "--mca opal_common_ucx_opal_mem_hooks 1"? |
With
|
I need to verify but I think we may only be seeing these messages on our GPU nodes with CUDA-enabled builds of OpenMPI but, again, I need to verify that. |
The most recent job ran to completion without errors from the app.. Below is the stderr which looks ok. So I guess the question is, "Why the message about the error initializing the OpenFabrics device"?
|
@ca-taylor can you pls try configuring OpenMPI |
That resolves the issue and is simple enough. Thank you. Is there an easy way to determine which device and transport layer UCX has decided to use? I don't see any indication of that in the verbose output. |
@yosefe I'll bet we're going to get more questions like this. Can you guys make up an FAQ item or three about this so that when people google for it, they find the FAQ / don't need to ask on the mailing list / don't need to post an issue? |
Currently - no. However, it's possible to set the device and transport to use: https://github.com/openucx/ucx/wiki/UCX-environment-parameters |
I am running into what seems to be the same issue on our IB cluster in an application that makes heavy use of MPI RMA calls. I followed the advice from @yosefe and configured Open MPI 4.0.0 with
I also get similar UCX debug output with
After that, the application hangs in a one-sided communication call. Interestingly, if I set
There is no connection with cuda.
Setting Anything else I might try? |
We have an FAQ item in the works -- do this help you? |
@jsquyres It helped in that it made me try to delete my installation of Open MPI and reissue One process waits in a barrier, another one has this stack trace:
I am a bit puzzled that it's the |
@devreal That seems like a new / different problem. You might want to open a new issue about that, and make sure Mellanox sees / replies to you about it. |
@jsquyres I'm in the process of reporting some more issues I'm facing with UCX. I do not, however, see the |
@devreal pt2pt does not work over UCX, there is an OSC component called "ucx" instead. |
@yosefe From my perspective yes, sorry for taking over this issue ;) |
@yosefe Ok with me too.
… On Apr 1, 2019, at 4:58 AM, Joseph Schuchart ***@***.***> wrote:
@yosefe <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_yosefe&d=DwMCaQ&c=sJ6xIWYx-zLMB3EPkvcnVg&r=JwS8LQKv081KnA9sH-QCVChPt3yY-gx14zWZi1z3sFY&m=OhJnC3mTVYoczMgHTu-cR1ABsRQpE7Rm26lmlqnF4oo&s=M9hFDurn6MnXSDfTn_ooqBUPfD7-8OjZBlXoOYEicZc&e=> From my perspective yes, sorry for taking over this issue ;)
|
Thank you for taking the time to submit an issue!
Background information
OpenMPI 4.0.0 is reporting an error message (see below) and claiming that there is an error initializing an OpenFabrics device.
What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)
4.0.0
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
openmpi-4.0.0.tar.gz (GA release)
Please describe the system on which you are running
ucx-1.4.0-1.el7.x86_64
ucx-devel-1.4.0-1.el7.x86_64
Details of the problem
I'm encountering the following message despite having built with UCX library support which works fine with OpenMPI 3.1.2
The text was updated successfully, but these errors were encountered: