@@ -2343,6 +2343,86 @@ shell$ mpirun --mca pml ucx --mca osc ucx --mca scoll ucx --mca atomic ucx ...
2343
2343
2344
2344
/////////////////////////////////////////////////////////////////////////
2345
2345
2346
+ $ q [] = "I'm getting errors about \"initializing an OpenFabrics device \" when running v4.0.0 with UCX support enabled. What should I do? " ;
2347
+ $ anchor [] = "ofa-device-error " ;
2348
+
2349
+ $ a [] = "The short answer is that you should probably just disable
2350
+ verbs support in Open MPI.
2351
+
2352
+ The messages below were observed by at least one site where Open MPI
2353
+ v4.0.0 was built with support for InfiniBand verbs ([--with-verbs]),
2354
+ OFA UCX ([--with-ucx]), and CUDA ([--with-cuda]) with applications
2355
+ running on GPU-enabled hosts:
2356
+
2357
+ <geshi>
2358
+ WARNING: There was an error initializing an OpenFabrics device.
2359
+
2360
+ Local host: c36a-s39
2361
+ Local device: mlx4_0
2362
+ </geshi>
2363
+
2364
+ and
2365
+
2366
+ <geshi>
2367
+ By default, for Open MPI 4.0 and later, infiniband ports on a device
2368
+ are not used by default. The intent is to use UCX for these devices.
2369
+ You can override this policy by setting the btl_openib_allow_ib MCA parameter
2370
+ to true.
2371
+
2372
+ Local host: c36a-s39
2373
+ Local adapter: mlx4_0
2374
+ Local port: 1
2375
+ </geshi>
2376
+
2377
+ These messages are coming from the [openib] BTL. As noted in the
2378
+ messages above, Open MPI deprecated the openib BTL (enabled when Open
2379
+ MPI is configured [--with-verbs]) is deprecated in favor of the UCX
2380
+ PML, which includes support for OpenFabrics devices. The [openib] BTL
2381
+ is therefore not needed.
2382
+
2383
+ You can disable the [openib] BTL (and therefore avoid these messages)
2384
+ in a few different ways:
2385
+
2386
+ <ul>
2387
+ <li> Configure Open MPI [--without-verbs]. This will prevent building
2388
+ the [openib] BTL in the first place.</li>
2389
+ <li> Disable the [openib] BTL via the [btl] MCA param (see <a
2390
+ href= \"?category=tuning#setting-mca-params \">this FAQ item</a> for
2391
+ information on how to set MCA params). For example,
2392
+ <geshi bash>
2393
+ shell$ mpirun --mca btl '^openib' ...
2394
+ </geshi></li>
2395
+ </ul>
2396
+
2397
+ Note that simply selecting a different PML (e.g., the UCX PML) is
2398
+ *not* sufficient to avoid these messages. For example:
2399
+
2400
+ <geshi bash>
2401
+ shell$ mpirun --mca pml ucx ...
2402
+ </geshi>
2403
+
2404
+ You will still see these messages because the [openib] BTL is not only
2405
+ used by the PML, it is also used in other contexts internally in Open
2406
+ MPI. Hence, it is not sufficient to simply choose a non-OB1 PML; you
2407
+ need to actually disable the [openib] BTL to make the messages go
2408
+ away. " ;
2409
+
2410
+ /////////////////////////////////////////////////////////////////////////
2411
+
2412
+ $ q [] = "How can I find out what devices and transports are supported by UCX on my system? " ;
2413
+ $ anchor [] = "ucx-supported-devices " ;
2414
+
2415
+ $ a [] = "Check out the <a
2416
+ href= \"http://www.openucx.org/documentation/ \">UCX documentation</a>
2417
+ for more information, but you can use the [ucx_info] command. For
2418
+ example:
2419
+
2420
+ <gesh bash>
2421
+ shell$ ucx_info -d
2422
+ </geshi> " ;
2423
+
2424
+ /////////////////////////////////////////////////////////////////////////
2425
+
2346
2426
$ q [] = "What is <code>cpu-set</code>? " ;
2347
2427
$ anchor [] = "cpu-set " ;
2348
2428
$ a [] = "
0 commit comments