Skip to content

Commit b1a2f38

Browse files
authored
Merge pull request #249 from ca-taylor/faq-6517
FAQs for github issue #6517
2 parents b8c8f45 + 856f960 commit b1a2f38

File tree

1 file changed

+80
-0
lines changed

1 file changed

+80
-0
lines changed

faq/openfabrics.inc

Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2343,6 +2343,86 @@ shell$ mpirun --mca pml ucx --mca osc ucx --mca scoll ucx --mca atomic ucx ...
23432343

23442344
/////////////////////////////////////////////////////////////////////////
23452345

2346+
$q[] = "I'm getting errors about \"initializing an OpenFabrics device\" when running v4.0.0 with UCX support enabled. What should I do?";
2347+
$anchor[] = "ofa-device-error";
2348+
2349+
$a[] = "The short answer is that you should probably just disable
2350+
verbs support in Open MPI.
2351+
2352+
The messages below were observed by at least one site where Open MPI
2353+
v4.0.0 was built with support for InfiniBand verbs ([--with-verbs]),
2354+
OFA UCX ([--with-ucx]), and CUDA ([--with-cuda]) with applications
2355+
running on GPU-enabled hosts:
2356+
2357+
<geshi>
2358+
WARNING: There was an error initializing an OpenFabrics device.
2359+
2360+
Local host: c36a-s39
2361+
Local device: mlx4_0
2362+
</geshi>
2363+
2364+
and
2365+
2366+
<geshi>
2367+
By default, for Open MPI 4.0 and later, infiniband ports on a device
2368+
are not used by default. The intent is to use UCX for these devices.
2369+
You can override this policy by setting the btl_openib_allow_ib MCA parameter
2370+
to true.
2371+
2372+
Local host: c36a-s39
2373+
Local adapter: mlx4_0
2374+
Local port: 1
2375+
</geshi>
2376+
2377+
These messages are coming from the [openib] BTL. As noted in the
2378+
messages above, Open MPI deprecated the openib BTL (enabled when Open
2379+
MPI is configured [--with-verbs]) is deprecated in favor of the UCX
2380+
PML, which includes support for OpenFabrics devices. The [openib] BTL
2381+
is therefore not needed.
2382+
2383+
You can disable the [openib] BTL (and therefore avoid these messages)
2384+
in a few different ways:
2385+
2386+
<ul>
2387+
<li> Configure Open MPI [--without-verbs]. This will prevent building
2388+
the [openib] BTL in the first place.</li>
2389+
<li> Disable the [openib] BTL via the [btl] MCA param (see <a
2390+
href=\"?category=tuning#setting-mca-params\">this FAQ item</a> for
2391+
information on how to set MCA params). For example,
2392+
<geshi bash>
2393+
shell$ mpirun --mca btl '^openib' ...
2394+
</geshi></li>
2395+
</ul>
2396+
2397+
Note that simply selecting a different PML (e.g., the UCX PML) is
2398+
*not* sufficient to avoid these messages. For example:
2399+
2400+
<geshi bash>
2401+
shell$ mpirun --mca pml ucx ...
2402+
</geshi>
2403+
2404+
You will still see these messages because the [openib] BTL is not only
2405+
used by the PML, it is also used in other contexts internally in Open
2406+
MPI. Hence, it is not sufficient to simply choose a non-OB1 PML; you
2407+
need to actually disable the [openib] BTL to make the messages go
2408+
away.";
2409+
2410+
/////////////////////////////////////////////////////////////////////////
2411+
2412+
$q[] = "How can I find out what devices and transports are supported by UCX on my system?";
2413+
$anchor[] = "ucx-supported-devices";
2414+
2415+
$a[] = "Check out the <a
2416+
href=\"http://www.openucx.org/documentation/\">UCX documentation</a>
2417+
for more information, but you can use the [ucx_info] command. For
2418+
example:
2419+
2420+
<gesh bash>
2421+
shell$ ucx_info -d
2422+
</geshi>";
2423+
2424+
/////////////////////////////////////////////////////////////////////////
2425+
23462426
$q[] = "What is <code>cpu-set</code>?";
23472427
$anchor[] = "cpu-set";
23482428
$a[] = "

0 commit comments

Comments
 (0)