NIXLBENCH: Add CUDA 13 support #820

ovidiusm · 2025-09-22T14:55:14Z

What?

Fix nixlbench container build with CUDA 13.0

Why?

To unblock testing CUDA 13.0 in NIXL and UCX

How?

pytorch 2.10 will be compatible with CUDA 13 but has not been released yet. Adding uv flag to pull from nightly index.
Separated venv creation from build
Removed uv run commands. They are not needed if venv is activated; and they pull automatically packages, which we do not want since we cannot pass the nightly pytorch index (using uv venv would lead to downgrading pytorch packages)

Tested:

container build
nixlbench on GPU worker
python bindings (nixl_api_example)
kvbench (sequential test on GPU worker)

Issues:

gpunetio plugin still links against a cuda 12 binary (it links against both cudart 12 and 13):

Failed to load plugin from /workspace/nixl/.venv/lib/python3.12/site-packages/.nixl.mesonpy.libs/plugins/libplugin_GPUNETIO.so: libcudart.so.12: cannot open shared object file: No such file or directory

Seems to be a problem in the DOCA dependency. We should look into it separately.

Tests

Build:

./benchmark/nixlbench/contrib/build.sh --base-image-tag 25.09-cuda13.0-devel-ubuntu24.04

Nixlbench:

docker run --privileged --device=/dev/infiniband --net=host --ipc=host --pid=host --gpus all -e NVIDIA_VISIBLE_DEVICES=all --rm -ti $IMG nixlbench --etcd-endpoints http://$SERVER:2379 --backend UCX --initiator_seg_type VRAM
^A^[[C
==========
== CUDA ==
==========

NVIDIA Release  (build )
CUDA Version 13.0.1.012
Container image Copyright (c) 2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

GOVERNING TERMS: The software and materials are governed by the NVIDIA Software License Agreement
(found at https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-software-license-agreement/)
and the Product-Specific Terms for NVIDIA AI Products
(found at https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products/).

NOTE: CUDA Forward Compatibility mode ENABLED.
  Using CUDA 13.0 driver version 580.82.07 with kernel driver version 575.57.08.
  See https://docs.nvidia.com/deploy/cuda-compatibility/ for details.

WARNING: Adjusting num_iter to 1008 to allow equal distribution to 1 threads
WARNING: Adjusting warmup_iter to 112 to allow equal distribution to 1 threads
Connecting to ETCD at http://soul05:2379
ETCD Runtime: Registered as rank 0 item 1 of 2
E1009 13:00:03.722458  150881 nixl_plugin_manager.cpp:122] Failed to load plugin from /usr/local/nixl/lib/x86_64-linux-gnu/plugins/libplugin_GPUNETIO.so: libcudart.so.12: cannot open shared object file: No such file or directory
E1009 13:00:03.722516  150881 nixl_plugin_manager.cpp:288] Failed to load plugin 'GPUNETIO' from any directory
Init nixl worker, dev all rank 0, type initiator, hostname soul05
Waiting for all processes to start... (expecting 2 total: 1 initiators and 1 targets)
All processes are ready to proceed
****************************************************************************************************************************************************************
NIXLBench Configuration
****************************************************************************************************************************************************************
Runtime (--runtime_type=[etcd])                             : ETCD
ETCD Endpoint                                               : http://soul05:2379
Worker type (--worker_type=[nixl,nvshmem])                  : nixl
Backend (--backend=[UCX,UCX_MO,GDS,GDS_MT,POSIX,Mooncake,HF3FS,OBJ]): UCX
Enable pt (--enable_pt=[0,1])                               : 0
Progress threads (--progress_threads=N)                     : 0
Device list (--device_list=dev1,dev2,...)                   : all
Enable VMM (--enable_vmm=[0,1])                             : 0
Initiator seg type (--initiator_seg_type=[DRAM,VRAM])       : VRAM
Target seg type (--target_seg_type=[DRAM,VRAM])             : DRAM
Scheme (--scheme=[pairwise,manytoone,onetomany,tp])         : pairwise
Mode (--mode=[SG,MG])                                       : SG
Op type (--op_type=[READ,WRITE])                            : WRITE
Check consistency (--check_consistency=[0,1])               : 0
Total buffer size (--total_buffer_size=N)                   : 8589934592
Num initiator dev (--num_initiator_dev=N)                   : 1
Num target dev (--num_target_dev=N)                         : 1
Start block size (--start_block_size=N)                     : 4096
Max block size (--max_block_size=N)                         : 67108864
Start batch size (--start_batch_size=N)                     : 1
Max batch size (--max_batch_size=N)                         : 1
Num iter (--num_iter=N)                                     : 1008
Warmup iter (--warmup_iter=N)                               : 112
Large block iter factor (--large_blk_iter_ftr=N)            : 16
Num threads (--num_threads=N)                               : 1
----------------------------------------------------------------------------------------------------------------------------------------------------------------

Block Size (B)      Batch Size     B/W (GB/Sec)   Avg Lat. (us)  Avg Prep (us)  P99 Prep (us)  Avg Post (us)  P99 Post (us)  Avg Tx (us)    P99 Tx (us)
----------------------------------------------------------------------------------------------------------------------------------------------------------------
4096                1              0.936229       4.4            11.0           11.0           0.7            1.0            3.7            5.0
8192                1              1.859805       4.4            12.0           12.0           0.9            1.0            3.5            4.0
16384               1              2.979447       5.5            12.0           12.0           0.9            1.0            4.5            5.0
32768               1              5.330882       6.1            12.0           12.0           0.9            1.0            5.3            6.0
65536               1              8.678440       7.6            12.0           12.0           0.9            1.0            6.6            7.0
131072              1              12.594907      10.4           13.0           13.0           0.9            1.0            9.5            11.0
262144              1              16.341444      16.0           12.0           12.0           0.9            1.0            15.2           17.0
524288              1              19.239226      27.3           12.0           12.0           0.9            1.0            26.4           30.0
1048576             1              21.213114      49.4           12.0           12.0           0.9            1.0            48.5           52.0
2097152             1              22.208871      94.4           12.0           12.0           1.0            9.0            93.1           97.0
4194304             1              22.856254      183.5          13.0           13.0           1.0            9.0            182.2          186.0
8388608             1              23.217745      361.3          13.0           13.0           1.0            9.0            360.0          366.0
16777216            1              23.385207      717.4          12.0           12.0           1.0            10.0           716.1          722.0
33554432            1              23.412143      1433.2         13.0           13.0           1.0            10.0           1431.9         1459.0
67108864            1              23.493715      2856.5         13.0           13.0

Plugins dependencies:

docker run --privileged --device=/dev/infiniband --net=host --ipc=host --pid=host --gpus all -e NVIDIA_VISIBLE_DEVICES=all --rm -ti $IMG sh -c "find /usr/local/nixl -name '*so' | xargs -IF sh -c 'ldd F | grep -q libcudart && echo F && ldd F | grep libcudart'"

/usr/local/nixl/lib/x86_64-linux-gnu/plugins/libplugin_UCX_MO.so
        libcudart.so.13 => /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.13 (0x00007f4d8bc00000)
/usr/local/nixl/lib/x86_64-linux-gnu/plugins/libplugin_GPUNETIO.so
        libcudart.so.13 => /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.13 (0x00007f7b79600000)
        libcudart.so.12 => not found
/usr/local/nixl/lib/x86_64-linux-gnu/plugins/libplugin_GDS_MT.so
        libcudart.so.13 => /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.13 (0x00007fea2ea00000)
/usr/local/nixl/lib/x86_64-linux-gnu/plugins/libplugin_GDS.so
        libcudart.so.13 => /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.13 (0x00007f42cd800000)
/usr/local/nixl/lib/x86_64-linux-gnu/plugins/libplugin_LIBFABRIC.so
        libcudart.so.13 => /usr/local/cuda/targets/x86_64-linux/lib/libcudart.so.13 (0x00007fb794200000)

Pytorch:

python3
Python 3.12.3 (main, Aug 14 2025, 17:47:21) [GCC 13.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> print("PyTorch version:", torch.__version__)
PyTorch version: 2.10.0.dev20251008+cu130
>>> print("CUDA available:", torch.cuda.is_available())
CUDA available: True
>>> print("CUDA version:", torch.version.cuda)
CUDA version: 13.0
>>> print("cuDNN version:", torch.backends.cudnn.version())
cuDNN version: 91300

Python example:

/workspace/nixl/examples/python# ./nixl_api_example.py
2025-10-09 13:18:32 NIXL INFO    nixl_api_example.py:35 Using NIXL Plugins from:
/workspace/nixl/.venv/lib/python3.12/site-packages/.nixl.mesonpy.libs/plugins/
E1009 13:18:32.498575  158786 nixl_plugin_manager.cpp:122] Failed to load plugin from /workspace/nixl/.venv/lib/python3.12/site-packages/.nixl.mesonpy.libs/plugins/libplugin_GPUNETIO.so: libcudart.so.12: cannot open shared object file: No such file or directory
E1009 13:18:32.498604  158786 nixl_plugin_manager.cpp:288] Failed to load plugin 'GPUNETIO' from any directory
2025-10-09 13:18:36 NIXL INFO    _api.py:361 Backend UCX was instantiated
2025-10-09 13:18:36 NIXL INFO    _api.py:251 Initialized NIXL agent: target
2025-10-09 13:18:36 NIXL INFO    nixl_api_example.py:44 Plugin parameters:
['DRAM_SEG', 'VRAM_SEG']
{'ucx_error_handling_mode': 'peer', 'num_workers': '1', 'ucx_devices': ''}
2025-10-09 13:18:36 NIXL INFO    nixl_api_example.py:50 Backend parameters:
['DRAM_SEG', 'VRAM_SEG']
{}
2025-10-09 13:18:36 NIXL INFO    _api.py:361 Backend UCX was instantiated
2025-10-09 13:18:36 NIXL INFO    _api.py:251 Initialized NIXL agent: initiator
2025-10-09 13:18:36 NIXL INFO    nixl_api_example.py:100 Loaded name from metadata: b'target'
2025-10-09 13:18:36 NIXL INFO    nixl_api_example.py:130 Initiator done
2025-10-09 13:18:36 NIXL INFO    nixl_api_example.py:135 Target done
2025-10-09 13:18:36 NIXL INFO    nixl_api_example.py:130 Initiator done
2025-10-09 13:18:36 NIXL INFO    nixl_api_example.py:135 Target done
2025-10-09 13:18:36 NIXL INFO    nixl_api_example.py:153 sent notif:
b'DESCS: \x80\x04\x95\xb5\x00\x00\x00\x00\x00\x00\x00\x8c\x0enixl._bindings\x94\x8c\x0cnixlRegDList\x94\x93\x94)\x81\x94C\x8bnixlSerDes|nixlDList\n\x00\x00\x00\x00\x00\x00\x00nixlSDList|t\x04\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00|n\x08\x00\x00\x00\x00\x00\x00\x00\x02\x00\x00\x00\x00\x00\x00\x00|\x19\x00\x00\x00\x00\x00\x00\x00\xb0hD \x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00a|\x19\x00\x00\x00\x00\x00\x00\x00\xb0iD \x00\x00\x00\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00b|\x94b.'
2025-10-09 13:18:36 NIXL INFO    nixl_api_example.py:160 received message from initiator
2025-10-09 13:18:36 NIXL INFO    nixl_api_example.py:165 notif test complete, doing transfer 2
2025-10-09 13:18:36 NIXL INFO    nixl_api_example.py:184 Transfer 2 started
2025-10-09 13:18:36 NIXL INFO    nixl_api_example.py:194 Initiator done
2025-10-09 13:18:36 NIXL INFO    nixl_api_example.py:199 Target done
2025-10-09 13:18:36 NIXL INFO    nixl_api_example.py:212 Test Complete.

kvbench:

HOST=$(hostname | cut -d '.' -f 1)

etcd --listen-client-urls http://0.0.0.0:2379 --advertise-client-urls http://$HOST:2379 &

export NIXL_ETCD_ENDPOINTS=$HOST:2379
export SLURM_PROCID=0
export SLURM_NTASKS=2
unset UCX_NET_DEVICES
export CUDA_VISIBLE_DEVICES=0,1
/workspace/nixl/.venv/bin/python /workspace/nixl/benchmark/kvbench/main.py sequential-ct-perftest matrices_2ranks/metadata.yaml &

export NIXL_ETCD_ENDPOINTS=$HOST:2379
export SLURM_PROCID=1
export SLURM_NTASKS=2
export CUDA_VISIBLE_DEVICES=0,1
unset UCX_NET_DEVICES
/workspace/nixl/.venv/bin/python /workspace/nixl/benchmark/kvbench/main.py sequential-ct-perftest matrices_2ranks/metadata.yaml &

2025-10-09 13:24:55 NIXL INFO    _api.py:361 Backend UCX was instantiated
2025-10-09 13:24:55 NIXL INFO    _api.py:251 Initialized NIXL agent: 1
2025-10-09 13:24:55 NIXL INFO    sequential_custom_traffic_perftest.py:178 [Rank 0] Preparing TPs
2025-10-09 13:24:55 NIXL INFO    sequential_custom_traffic_perftest.py:178 [Rank 1] Preparing TPs
2025-10-09 13:24:55 NIXL INFO    sequential_custom_traffic_perftest.py:200 [Rank 0] Running isolated benchmark (to measure perf without noise)
2025-10-09 13:24:55 NIXL INFO    sequential_custom_traffic_perftest.py:200 [Rank 1] Running isolated benchmark (to measure perf without noise)
2025-10-09 13:24:56 NIXL INFO    sequential_custom_traffic_perftest.py:246 [Rank 1] Running workload benchmark
2025-10-09 13:24:56 NIXL INFO    sequential_custom_traffic_perftest.py:246 [Rank 0] Running workload benchmark
2025-10-09 13:24:56 NIXL INFO    sequential_custom_traffic_perftest.py:359 Iteration 1/3
  Transfer size (GB)    Latency (ms)    Isolated Latency (ms)    Num Senders    Mean BW (GB/s)
--------------------  --------------  -----------------------  -------------  ----------------
               0.365           1.176                    1.172              1           308.967
               1.046           3.287                    3.657              1           308.967
               1.321           4.550                    4.757              1           308.967
               0.758           2.609                    2.602              1           308.967
               1.170           3.827                    3.899              1           308.967
               0.716           2.778                    2.403              1           308.967
               0.783           2.480                    2.554              1           308.967
               0.354           1.179                    1.144              1           308.967
               0.643           2.034                    2.143              1           308.967
               0.854           2.765                    2.955              1           308.967
2025-10-09 13:24:56 NIXL INFO    sequential_custom_traffic_perftest.py:359 Iteration 2/3
  Transfer size (GB)    Latency (ms)    Isolated Latency (ms)    Num Senders    Mean BW (GB/s)
--------------------  --------------  -----------------------  -------------  ----------------
               0.365           1.187                    1.172              1           305.126
               1.046           3.285                    3.657              1           305.126
               1.321           4.519                    4.757              1           305.126
               0.758           2.905                    2.602              1           305.126
               1.170           3.832                    3.899              1           305.126
               0.716           2.759                    2.403              1           305.126
               0.783           2.596                    2.554              1           305.126
               0.354           1.364                    1.144              1           305.126
               0.643           2.033                    2.143              1           305.126
               0.854           2.800                    2.955              1           305.126
2025-10-09 13:24:56 NIXL INFO    sequential_custom_traffic_perftest.py:405 [Rank 1] Finished run, destroying objects
2025-10-09 13:24:56 NIXL INFO    sequential_custom_traffic_perftest.py:359 Iteration 3/3
  Transfer size (GB)    Latency (ms)    Isolated Latency (ms)    Num Senders    Mean BW (GB/s)
--------------------  --------------  -----------------------  -------------  ----------------
               0.365           1.179                    1.172              1           304.090
               1.046           3.280                    3.657              1           304.090
               1.321           4.554                    4.757              1           304.090
               0.758           3.105                    2.602              1           304.090
               1.170           3.864                    3.899              1           304.090
               0.716           2.567                    2.403              1           304.090
               0.783           2.533                    2.554              1           304.090
               0.354           1.142                    1.144              1           304.090
               0.643           2.028                    2.143              1           304.090
               0.854           2.810                    2.955              1           304.090
2025-10-09 13:24:56 NIXL INFO    sequential_custom_traffic_perftest.py:405 [Rank 0] Finished run, destroying objects

[1]-  Done                    /workspace/nixl/.venv/bin/python /workspace/nixl/benchmark/kvbench/main.py sequential-ct-perftest matrices_2ranks/metadata.yaml
[2]+  Done                    /workspace/nixl/.venv/bin/python /workspace/nixl/benchmark/kvbench/main.py sequential-ct-perftest matrices_2ranks/metadata.yaml

Signed-off-by: Ovidiu Mara <[email protected]>

github-actions · 2025-09-22T14:55:23Z

👋 Hi ovidiusm! Thank you for contributing to ai-dynamo/nixl.

Your PR reviewers will review your contribution then trigger the CI to test your changes.

🚀

Signed-off-by: Ovidiu Mara <[email protected]>

ovidiusm · 2025-09-22T15:03:16Z

/build

Signed-off-by: Ovidiu Mara <[email protected]>

ovidiusm · 2025-10-09T13:15:35Z

/build

ovidiusm · 2025-10-09T20:21:16Z

/build

benchmark/nixlbench/contrib/Dockerfile

Signed-off-by: Ovidiu Mara <[email protected]>

ovidiusm · 2025-10-10T12:26:38Z

/build

Signed-off-by: Ovidiu Mara <[email protected]>

aranadive · 2025-10-13T04:36:50Z

/build

ovidiusm added 6 commits September 2, 2025 18:32

Upgrade DOCA to 3.1 in nixlbench container

1a015d0

Signed-off-by: Ovidiu Mara <[email protected]>

Simplify scripts

2db0ece

Signed-off-by: Ovidiu Mara <[email protected]>

Do not hardcode distro name

86aa093

Signed-off-by: Ovidiu Mara <[email protected]>

Merge remote-tracking branch 'dynamo/main' into nixlbench-doca-3.1

2fe9453

Signed-off-by: Ovidiu Mara <[email protected]>

Install torch from nightly image

378e932

Signed-off-by: Ovidiu Mara <[email protected]>

Fixes to use pytorch nightly without downgrades

21b5080

Signed-off-by: Ovidiu Mara <[email protected]>

pull-request-size bot added the size/M label Sep 22, 2025

copy-pr-bot bot temporarily deployed to SWX_AWS September 22, 2025 14:55 Inactive

copy-pr-bot bot temporarily deployed to GITLAB September 22, 2025 14:55 Inactive

github-actions bot added the external-contribution label Sep 22, 2025

copy-pr-bot bot temporarily deployed to GITLAB September 22, 2025 14:55 Inactive

Move variable to where it belongs

3d86aa3

Signed-off-by: Ovidiu Mara <[email protected]>

copy-pr-bot bot temporarily deployed to SWX_AWS September 22, 2025 15:03 Inactive

copy-pr-bot bot temporarily deployed to GITLAB September 22, 2025 15:03 Inactive

copy-pr-bot bot temporarily deployed to SWX_AWS September 22, 2025 15:03 Inactive

copy-pr-bot bot temporarily deployed to GITLAB September 22, 2025 15:03 Inactive

Merge remote-tracking branch 'dynamo/main' into nixlbench-cuda-13

d50f47c

Signed-off-by: Ovidiu Mara <[email protected]>

copy-pr-bot bot temporarily deployed to SWX_AWS September 22, 2025 19:35 Inactive

copy-pr-bot bot temporarily deployed to GITLAB September 22, 2025 19:35 Inactive

copy-pr-bot bot temporarily deployed to GITLAB September 22, 2025 19:40 Inactive

aranadive reviewed Oct 9, 2025

View reviewed changes

benchmark/nixlbench/contrib/Dockerfile Outdated Show resolved Hide resolved

Add CUDA test

97c3779

Signed-off-by: Ovidiu Mara <[email protected]>

ovidiusm requested a review from a team as a code owner October 10, 2025 12:26

copy-pr-bot bot temporarily deployed to SWX_AWS October 10, 2025 12:26 Inactive

copy-pr-bot bot temporarily deployed to GITLAB October 10, 2025 12:26 Inactive

Reorganize dockerfile for faster incremental builds

6eae56f

Signed-off-by: Ovidiu Mara <[email protected]>

copy-pr-bot bot temporarily deployed to SWX_AWS October 10, 2025 14:58 Inactive

copy-pr-bot bot had a problem deploying to SWX_AWS October 10, 2025 14:58 Failure

copy-pr-bot bot temporarily deployed to GITLAB October 10, 2025 14:58 Inactive

copy-pr-bot bot temporarily deployed to GITLAB October 10, 2025 14:59 Inactive

aranadive approved these changes Oct 13, 2025

View reviewed changes

Merge branch 'main' into nixlbench-cuda-13

41f4d64

copy-pr-bot bot temporarily deployed to SWX_AWS October 13, 2025 04:36 Inactive

copy-pr-bot bot temporarily deployed to GITLAB October 13, 2025 04:36 Inactive

copy-pr-bot bot temporarily deployed to GITLAB October 13, 2025 04:37 Inactive

aranadive merged commit 9ada51f into ai-dynamo:main Oct 13, 2025
21 checks passed

ovidiusm deleted the nixlbench-cuda-13 branch October 13, 2025 07:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

NIXLBENCH: Add CUDA 13 support #820

NIXLBENCH: Add CUDA 13 support #820

Uh oh!

ovidiusm commented Sep 22, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Sep 22, 2025

Uh oh!

ovidiusm commented Sep 22, 2025

Uh oh!

ovidiusm commented Oct 9, 2025

Uh oh!

ovidiusm commented Oct 9, 2025

Uh oh!

Uh oh!

ovidiusm commented Oct 10, 2025

Uh oh!

aranadive commented Oct 13, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

NIXLBENCH: Add CUDA 13 support #820

NIXLBENCH: Add CUDA 13 support #820

Uh oh!

Conversation

ovidiusm commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What?

Why?

How?

Tested:

Tests

Uh oh!

github-actions bot commented Sep 22, 2025

Uh oh!

ovidiusm commented Sep 22, 2025

Uh oh!

ovidiusm commented Oct 9, 2025

Uh oh!

ovidiusm commented Oct 9, 2025

Uh oh!

Uh oh!

ovidiusm commented Oct 10, 2025

Uh oh!

aranadive commented Oct 13, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ovidiusm commented Sep 22, 2025 •

edited

Loading