@@ -2343,50 +2343,76 @@ shell$ mpirun --mca pml ucx --mca osc ucx --mca scoll ucx --mca atomic ucx ...
2343
2343
2344
2344
/////////////////////////////////////////////////////////////////////////
2345
2345
2346
- $ q [] = "I'm getting errors about \"initializing an OpenFabrics device \" when running
2347
- 4.0.0 with UCX support enabled. What should I do? " ;
2346
+ $ q [] = "I'm getting errors about \"initializing an OpenFabrics device \" when running v4.0.0 with UCX support enabled. What should I do? " ;
2348
2347
$ anchor [] = "ofa-device-error " ;
2349
- $ a [] = "Disable ib verbs support (--without-verbs).
2350
2348
2351
- The messages below were observed by at least one site where OpenMPI 4.0.0
2352
- was built with support for InfiniBand verbs (--with-verbs), OFA UCX (--with-ucx),
2353
- and CUDA (--with-cuda) with applications running on gpu-enabled hosts.
2349
+ $ a [] = "The short answer is that you should probably just disable
2350
+ verbs support in Open MPI.
2354
2351
2355
- Since the openib BTL (configured via --with-verbs) is deprecated in favor
2356
- of UCX and the UCX PML includes support for the OpenFabrics devices in question
2357
- (Mellanox HCAs), the openib BTL is not needed .
2352
+ The messages below were observed by at least one site where Open MPI v4.0.0
2353
+ was built with support for InfiniBand verbs ([--with-verbs]), OFA UCX ([--with-ucx]),
2354
+ and CUDA ([--with-cuda]) with applications running on GPU-enabled hosts .
2358
2355
2359
- <pre>
2356
+ <geshi>
2357
+ -----------------------------------------------------------------------------
2360
2358
WARNING: There was an error initializing an OpenFabrics device.
2361
2359
2362
- Local host: c36a-s39
2363
- Local device: mlx4_0
2364
- </pre>
2360
+ Local host: c36a-s39
2361
+ Local device: mlx4_0
2362
+ -----------------------------------------------------------------------------
2363
+ </geshi>
2365
2364
2366
2365
and
2367
2366
2368
- <pre >
2367
+ <geshi >
2369
2368
-----------------------------------------------------------------------------
2370
2369
By default, for Open MPI 4.0 and later, infiniband ports on a device
2371
2370
are not used by default. The intent is to use UCX for these devices.
2372
2371
You can override this policy by setting the btl_openib_allow_ib MCA parameter
2373
2372
to true.
2374
2373
2375
- Local host: c36a-s39
2376
- Local adapter: mlx4_0
2377
- Local port: 1
2374
+ Local host: c36a-s39
2375
+ Local adapter: mlx4_0
2376
+ Local port: 1
2378
2377
-----------------------------------------------------------------------------
2379
- </pre> " ;
2378
+ </geshi>
2379
+
2380
+ As noted in the messages above, Open MPI deprecated the openib BTL
2381
+ (enabled when Open MPI is configured [--with-verbs]) is deprecated in
2382
+ favor of the UCX PML, which includes support for OpenFabrics devices.
2383
+ The [openib] BTL is therefore not needed.
2384
+
2385
+ You can disable the [openib] BTL in a few different ways:
2386
+
2387
+ <ul>
2388
+ <li> Configure Open MPI [--without-verbs]. This will prevent building
2389
+ the [openib] BTL in the first place.</li>
2390
+ <li> Disable the [openib] BTL via the [btl] MCA param (see <a
2391
+ href= \"?category=tuning#setting-mca-params \">this FAQ item</a> for
2392
+ information on how to set MCA params). For example,
2393
+ <geshi bash>
2394
+ shell$ mpirun --mca btl '^openib' ...
2395
+ </geshi></li>
2396
+ <li> Force the use of the UCX PML via the [pml] MCA param. For example:
2397
+ <geshi bash>
2398
+ shell$ mpirun --mca pml ucx ...
2399
+ </geshi></li>
2400
+ </ul> " ;
2401
+
2380
2402
/////////////////////////////////////////////////////////////////////////
2381
2403
2382
2404
$ q [] = "How can I find out what devices and transports are supported by UCX on my system? " ;
2383
2405
$ anchor [] = "ucx-supported-devices " ;
2384
- $ a [] = "The <code>ucx_info</code> command can be used.
2385
2406
2386
- For example,
2387
- <pre>
2407
+ $ a [] = "Check out the <a
2408
+ href= \"http://www.openucx.org/documentation/ \">UCX documentation</a>
2409
+ for more information, but you can use the [ucx_info] command. For
2410
+ example:
2411
+
2412
+ <gesh bash>
2388
2413
shell$ ucx_info -d
2389
- </pre> " ;
2414
+ </geshi> " ;
2415
+
2390
2416
/////////////////////////////////////////////////////////////////////////
2391
2417
2392
2418
$ q [] = "What is <code>cpu-set</code>? " ;
0 commit comments