Skip to content

Conversation

ovidiusm
Copy link
Contributor

@ovidiusm ovidiusm commented Sep 22, 2025

What?

Fix nixlbench container build with CUDA 13.0

Why?

To unblock testing CUDA 13.0 in NIXL and UCX

How?

  • pytorch 2.10 will be compatible with CUDA 13 but has not been released yet. Adding uv flag to pull from nightly index.
  • Separated venv creation from build
  • Removed uv run commands. They are not needed if venv is activated; and they pull automatically packages, which we do not want since we cannot pass the nightly pytorch index (using uv venv would lead to downgrading pytorch packages)

Tested:

  • container build
  • nixlbench on GPU worker
  • python bindings (nixl_api_example)
  • kvbench (sequential test on GPU worker)

Issues:

  • gpunetio plugin still links against a cuda 12 binary (it links against both cudart 12 and 13):

Failed to load plugin from /workspace/nixl/.venv/lib/python3.12/site-packages/.nixl.mesonpy.libs/plugins/libplugin_GPUNETIO.so: libcudart.so.12: cannot open shared object file: No such file or directory

Seems to be a problem in the DOCA dependency. We should look into it separately.

Tests

Build:

./benchmark/nixlbench/contrib/build.sh --base-image-tag 25.09-cuda13.0-devel-ubuntu24.04

Nixlbench:

docker run --privileged --device=/dev/infiniband --net=host --ipc=host --pid=host --gpus all -e NVIDIA_VISIBLE_DEVICES=all --rm -ti $IMG nixlbench --etcd-endpoints http://$SERVER:2379 --backend UCX --initiator_seg_type VRAM
^A^[[C
==========
== CUDA ==
==========

NVIDIA Release  (build )
CUDA Version 13.0.1.012
Container image Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

GOVERNING TERMS: The software and materials are governed by the NVIDIA Software License Agreement
(found at https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-software-license-agreement/)
and the Product-Specific Terms for NVIDIA AI Products
(found at https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products/).

NOTE: CUDA Forward Compatibility mode ENABLED.
  Using CUDA 13.0 driver version 580.82.07 with kernel driver version 575.57.08.
  See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.

WARNING: Adjusting num_iter to 1008 to allow equal distribution to 1 threads
WARNING: Adjusting warmup_iter to 112 to allow equal distribution to 1 threads
Connecting to ETCD at http://soul05:2379
ETCD Runtime: Registered as rank 0 item 1 of 2
E1009 13:00:03.722458  150881 nixl_plugin_manager.cpp:122] Failed to load plugin from /usr/local/nixl/lib/x86_64-linux-gnu/plugins/libplugin_GPUNETIO.so: libcudart.so.12: cannot open shared object file: No such file or directory
E1009 13:00:03.722516  150881 nixl_plugin_manager.cpp:288] Failed to load plugin 'GPUNETIO' from any directory
Init nixl worker, dev all rank 0, type initiator, hostname soul05
Waiting for all processes to start... (expecting 2 total: 1 initiators and 1 targets)
All processes are ready to proceed
****************************************************************************************************************************************************************
NIXLBench Configuration
****************************************************************************************************************************************************************
Runtime (--runtime_type=[etcd])                             : ETCD
ETCD Endpoint                                               : http://soul05:2379
Worker type (--worker_type=[nixl,nvshmem])                  : nixl
Backend (--backend=[UCX,UCX_MO,GDS,GDS_MT,POSIX,Mooncake,HF3FS,OBJ]): UCX
Enable pt (--enable_pt=[0,1])                               : 0
Progress threads (--progress_threads=N)                     : 0
Device list (--device_list=dev1,dev2,...)                   : all
Enable VMM (--enable_vmm=[0,1])                             : 0
Initiator seg type (--initiator_seg_type=[DRAM,VRAM])       : VRAM
Target seg type (--target_seg_type=[DRAM,VRAM])             : DRAM
Scheme (--scheme=[pairwise,manytoone,onetomany,tp])         : pairwise
Mode (--mode=[SG,MG])                                       : SG
Op type (--op_type=[READ,WRITE])                            : WRITE
Check consistency (--check_consistency=[0,1])               : 0
Total buffer size (--total_buffer_size=N)                   : 8589934592
Num initiator dev (--num_initiator_dev=N)                   : 1
Num target dev (--num_target_dev=N)                         : 1
Start block size (--start_block_size=N)                     : 4096
Max block size (--max_block_size=N)                         : 67108864
Start batch size (--start_batch_size=N)                     : 1
Max batch size (--max_batch_size=N)                         : 1
Num iter (--num_iter=N)                                     : 1008
Warmup iter (--warmup_iter=N)                               : 112
Large block iter factor (--large_blk_iter_ftr=N)            : 16
Num threads (--num_threads=N)                               : 1
----------------------------------------------------------------------------------------------------------------------------------------------------------------

Block Size (B)      Batch Size     B/W (GB/Sec)   Avg Lat. (us)  Avg Prep (us)  P99 Prep (us)  Avg Post (us)  P99 Post (us)  Avg Tx (us)    P99 Tx (us)
----------------------------------------------------------------------------------------------------------------------------------------------------------------
4096                1              0.936229       4.4            11.0           11.0           0.7            1.0            3.7            5.0
8192                1              1.859805       4.4            12.0           12.0           0.9            1.0            3.5            4.0
16384               1              2.979447       5.5            12.0           12.0           0.9            1.0            4.5            5.0
32768               1              5.330882       6.1            12.0           12.0           0.9            1.0            5.3            6.0
65536               1              8.678440       7.6            12.0           12.0           0.9            1.0            6.6            7.0
131072              1              12.594907      10.4           13.0           13.0           0.9            1.0            9.5            11.0
262144              1              16.341444      16.0           12.0           12.0           0.9            1.0            15.2           17.0
524288              1              19.239226      27.3           12.0           12.0           0.9            1.0            26.4           30.0
1048576             1              21.213114      49.4           12.0           12.0           0.9            1.0            48.5           52.0
2097152             1              22.208871      94.4           12.0           12.0           1.0            9.0            93.1           97.0
4194304             1              22.856254      183.5          13.0           13.0           1.0            9.0            182.2          186.0
8388608             1              23.217745      361.3          13.0           13.0           1.0            9.0            360.0          366.0
16777216            1              23.385207      717.4          12.0           12.0           1.0            10.0           716.1          722.0
33554432            1              23.412143      1433.2         13.0           13.0           1.0            10.0           1431.9         1459.0
67108864            1              23.493715      2856.5         13.0           13.0

Plugins dependencies:

docker run --privileged --device=/dev/infiniband --net=host --ipc=host --pid=host --gpus all -e NVIDIA_VISIBLE_DEVICES=all --rm -ti $IMG sh -c "find /usr/local/nixl -name '*so' | xargs -IF sh -c 'ldd F | grep -q libcudart && echo F && ldd F | grep libcudart'"

/usr/local/nixl/lib/x86_64-linux-gnu/plugins/libplugin_UCX_MO.so
        libcudart.so.13 => /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.13 (0x00007f4d8bc00000)
/usr/local/nixl/lib/x86_64-linux-gnu/plugins/libplugin_GPUNETIO.so
        libcudart.so.13 => /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.13 (0x00007f7b79600000)
        libcudart.so.12 => not found
/usr/local/nixl/lib/x86_64-linux-gnu/plugins/libplugin_GDS_MT.so
        libcudart.so.13 => /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.13 (0x00007fea2ea00000)
/usr/local/nixl/lib/x86_64-linux-gnu/plugins/libplugin_GDS.so
        libcudart.so.13 => /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.13 (0x00007f42cd800000)
/usr/local/nixl/lib/x86_64-linux-gnu/plugins/libplugin_LIBFABRIC.so
        libcudart.so.13 => /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.13 (0x00007fb794200000)

Pytorch:

python3
Python 3.12.3 (main, Aug 14 2025, 17:47:21) [GCC 13.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> print("PyTorch version:", torch.__version__)
PyTorch version: 2.10.0.dev20251008+cu130
>>> print("CUDA available:", torch.cuda.is_available())
CUDA available: True
>>> print("CUDA version:", torch.version.cuda)
CUDA version: 13.0
>>> print("cuDNN version:", torch.backends.cudnn.version())
cuDNN version: 91300

Python example:

/workspace/nixl/examples/python# ./nixl_api_example.py
2025-10-09 13:18:32 NIXL INFO    nixl_api_example.py:35 Using NIXL Plugins from:
/workspace/nixl/.venv/lib/python3.12/site-packages/.nixl.mesonpy.libs/plugins/
E1009 13:18:32.498575  158786 nixl_plugin_manager.cpp:122] Failed to load plugin from /workspace/nixl/.venv/lib/python3.12/site-packages/.nixl.mesonpy.libs/plugins/libplugin_GPUNETIO.so: libcudart.so.12: cannot open shared object file: No such file or directory
E1009 13:18:32.498604  158786 nixl_plugin_manager.cpp:288] Failed to load plugin 'GPUNETIO' from any directory
2025-10-09 13:18:36 NIXL INFO    _api.py:361 Backend UCX was instantiated
2025-10-09 13:18:36 NIXL INFO    _api.py:251 Initialized NIXL agent: target
2025-10-09 13:18:36 NIXL INFO    nixl_api_example.py:44 Plugin parameters:
['DRAM_SEG', 'VRAM_SEG']
{'ucx_error_handling_mode': 'peer', 'num_workers': '1', 'ucx_devices': ''}
2025-10-09 13:18:36 NIXL INFO    nixl_api_example.py:50 Backend parameters:
['DRAM_SEG', 'VRAM_SEG']
{}
2025-10-09 13:18:36 NIXL INFO    _api.py:361 Backend UCX was instantiated
2025-10-09 13:18:36 NIXL INFO    _api.py:251 Initialized NIXL agent: initiator
2025-10-09 13:18:36 NIXL INFO    nixl_api_example.py:100 Loaded name from metadata: b'target'
2025-10-09 13:18:36 NIXL INFO    nixl_api_example.py:130 Initiator done
2025-10-09 13:18:36 NIXL INFO    nixl_api_example.py:135 Target done
2025-10-09 13:18:36 NIXL INFO    nixl_api_example.py:130 Initiator done
2025-10-09 13:18:36 NIXL INFO    nixl_api_example.py:135 Target done
2025-10-09 13:18:36 NIXL INFO    nixl_api_example.py:153 sent notif:
b'DESCS: \x80\x04\x95\xb5\x00\x00\x00\x00\x00\x00\x00\x8c\x0enixl._bindings\x94\x8c\x0cnixlRegDList\x94\x93\x94)\x81\x94C\x8bnixlSerDes|nixlDList\n\x00\x00\x00\x00\x00\x00\x00nixlSDList|t\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00|n\x08\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00|\x19\x00\x00\x00\x00\x00\x00\x00\xb0hD \x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00a|\x19\x00\x00\x00\x00\x00\x00\x00\xb0iD \x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00b|\x94b.'
2025-10-09 13:18:36 NIXL INFO    nixl_api_example.py:160 received message from initiator
2025-10-09 13:18:36 NIXL INFO    nixl_api_example.py:165 notif test complete, doing transfer 2
2025-10-09 13:18:36 NIXL INFO    nixl_api_example.py:184 Transfer 2 started
2025-10-09 13:18:36 NIXL INFO    nixl_api_example.py:194 Initiator done
2025-10-09 13:18:36 NIXL INFO    nixl_api_example.py:199 Target done
2025-10-09 13:18:36 NIXL INFO    nixl_api_example.py:212 Test Complete.

kvbench:

HOST=$(hostname | cut -d '.' -f 1)

etcd --listen-client-urls http://0.0.0.0:2379 --advertise-client-urls http://$HOST:2379 &

export NIXL_ETCD_ENDPOINTS=$HOST:2379
export SLURM_PROCID=0
export SLURM_NTASKS=2
unset UCX_NET_DEVICES
export CUDA_VISIBLE_DEVICES=0,1
/workspace/nixl/.venv/bin/python /workspace/nixl/benchmark/kvbench/main.py sequential-ct-perftest matrices_2ranks/metadata.yaml &

export NIXL_ETCD_ENDPOINTS=$HOST:2379
export SLURM_PROCID=1
export SLURM_NTASKS=2
export CUDA_VISIBLE_DEVICES=0,1
unset UCX_NET_DEVICES
/workspace/nixl/.venv/bin/python /workspace/nixl/benchmark/kvbench/main.py sequential-ct-perftest matrices_2ranks/metadata.yaml &

2025-10-09 13:24:55 NIXL INFO    _api.py:361 Backend UCX was instantiated
2025-10-09 13:24:55 NIXL INFO    _api.py:251 Initialized NIXL agent: 1
2025-10-09 13:24:55 NIXL INFO    sequential_custom_traffic_perftest.py:178 [Rank 0] Preparing TPs
2025-10-09 13:24:55 NIXL INFO    sequential_custom_traffic_perftest.py:178 [Rank 1] Preparing TPs
2025-10-09 13:24:55 NIXL INFO    sequential_custom_traffic_perftest.py:200 [Rank 0] Running isolated benchmark (to measure perf without noise)
2025-10-09 13:24:55 NIXL INFO    sequential_custom_traffic_perftest.py:200 [Rank 1] Running isolated benchmark (to measure perf without noise)
2025-10-09 13:24:56 NIXL INFO    sequential_custom_traffic_perftest.py:246 [Rank 1] Running workload benchmark
2025-10-09 13:24:56 NIXL INFO    sequential_custom_traffic_perftest.py:246 [Rank 0] Running workload benchmark
2025-10-09 13:24:56 NIXL INFO    sequential_custom_traffic_perftest.py:359 Iteration 1/3
  Transfer size (GB)    Latency (ms)    Isolated Latency (ms)    Num Senders    Mean BW (GB/s)
--------------------  --------------  -----------------------  -------------  ----------------
               0.365           1.176                    1.172              1           308.967
               1.046           3.287                    3.657              1           308.967
               1.321           4.550                    4.757              1           308.967
               0.758           2.609                    2.602              1           308.967
               1.170           3.827                    3.899              1           308.967
               0.716           2.778                    2.403              1           308.967
               0.783           2.480                    2.554              1           308.967
               0.354           1.179                    1.144              1           308.967
               0.643           2.034                    2.143              1           308.967
               0.854           2.765                    2.955              1           308.967
2025-10-09 13:24:56 NIXL INFO    sequential_custom_traffic_perftest.py:359 Iteration 2/3
  Transfer size (GB)    Latency (ms)    Isolated Latency (ms)    Num Senders    Mean BW (GB/s)
--------------------  --------------  -----------------------  -------------  ----------------
               0.365           1.187                    1.172              1           305.126
               1.046           3.285                    3.657              1           305.126
               1.321           4.519                    4.757              1           305.126
               0.758           2.905                    2.602              1           305.126
               1.170           3.832                    3.899              1           305.126
               0.716           2.759                    2.403              1           305.126
               0.783           2.596                    2.554              1           305.126
               0.354           1.364                    1.144              1           305.126
               0.643           2.033                    2.143              1           305.126
               0.854           2.800                    2.955              1           305.126
2025-10-09 13:24:56 NIXL INFO    sequential_custom_traffic_perftest.py:405 [Rank 1] Finished run, destroying objects
2025-10-09 13:24:56 NIXL INFO    sequential_custom_traffic_perftest.py:359 Iteration 3/3
  Transfer size (GB)    Latency (ms)    Isolated Latency (ms)    Num Senders    Mean BW (GB/s)
--------------------  --------------  -----------------------  -------------  ----------------
               0.365           1.179                    1.172              1           304.090
               1.046           3.280                    3.657              1           304.090
               1.321           4.554                    4.757              1           304.090
               0.758           3.105                    2.602              1           304.090
               1.170           3.864                    3.899              1           304.090
               0.716           2.567                    2.403              1           304.090
               0.783           2.533                    2.554              1           304.090
               0.354           1.142                    1.144              1           304.090
               0.643           2.028                    2.143              1           304.090
               0.854           2.810                    2.955              1           304.090
2025-10-09 13:24:56 NIXL INFO    sequential_custom_traffic_perftest.py:405 [Rank 0] Finished run, destroying objects

[1]-  Done                    /workspace/nixl/.venv/bin/python /workspace/nixl/benchmark/kvbench/main.py sequential-ct-perftest matrices_2ranks/metadata.yaml
[2]+  Done                    /workspace/nixl/.venv/bin/python /workspace/nixl/benchmark/kvbench/main.py sequential-ct-perftest matrices_2ranks/metadata.yaml

Copy link

👋 Hi ovidiusm! Thank you for contributing to ai-dynamo/nixl.

Your PR reviewers will review your contribution then trigger the CI to test your changes.

🚀

@ovidiusm
Copy link
Contributor Author

/build

@ovidiusm
Copy link
Contributor Author

ovidiusm commented Oct 9, 2025

/build

1 similar comment
@ovidiusm
Copy link
Contributor Author

ovidiusm commented Oct 9, 2025

/build

Signed-off-by: Ovidiu Mara <[email protected]>
@ovidiusm
Copy link
Contributor Author

/build

@aranadive
Copy link
Contributor

/build

@aranadive aranadive merged commit 9ada51f into ai-dynamo:main Oct 13, 2025
21 checks passed
@ovidiusm ovidiusm deleted the nixlbench-cuda-13 branch October 13, 2025 07:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants