Support arbitrary hidden sizes in intranode combine kernel #408

varun-sundar-rabindranath · 2025-09-16T00:03:39Z

Intranode combine kernels fail for arbitrary hidden sizes.

The kernel has all the required support to handle arbitrary hidden size. The PR simply fixes a spot that seemed to assume hidden_int4 % 32 == 0.

A for-loop inside the kernel invokes __syncwarp(). This is fine, when all the threads in a warp executes the __syncwarp() (the good case). But this is not always the case with arbitrary hidden size. There can be residuals where only a few threads in a warp would execute the for-loop. To facilitate this, the PR makes appropriate warp_masks that the __syncwarp() could be called with.

Test:
Tested with hidden-size 2880 (gpt-oss hidden-dim)

This PR

python3 test_intranode.py --num-processes=2 --hidden 2880 
[config] num_tokens=4096, hidden=2880, num_topk=8
[layout] Kernel performance: 0.033 ms

[testing] Running with BF16, without top-k (async=False, previous=False) ... passed
[testing] Running with BF16, with top-k (async=False, previous=False) ... passed
[testing] Running with BF16, without top-k (async=False, previous=False) ... passed
[testing] Running with BF16, with top-k (async=False, previous=False) ... passed
[testing] Running with BF16, without top-k (async=True, previous=False) ... passed
[testing] Running with BF16, with top-k (async=True, previous=False) ... passed
[testing] Running with BF16, without top-k (async=True, previous=False) ... passed
[testing] Running with BF16, with top-k (async=True, previous=False) ... passed
[testing] Running with BF16, without top-k (async=False, previous=True) ... passed
[testing] Running with BF16, with top-k (async=False, previous=True) ... passed
[testing] Running with BF16, without top-k (async=False, previous=True) ... passed
[testing] Running with BF16, with top-k (async=False, previous=True) ... passed
[testing] Running with BF16, without top-k (async=True, previous=True) ... passed
[testing] Running with BF16, with top-k (async=True, previous=True) ... passed
[testing] Running with BF16, without top-k (async=True, previous=True) ... passed
[testing] Running with BF16, with top-k (async=True, previous=True) ... passed

[tuning] SMs 24, NVL chunk 4: 97.48 GB/s (NVL), avg_t: 482.75 us
[tuning] SMs 24, NVL chunk 6: 124.54 GB/s (NVL), avg_t: 377.87 us
[tuning] SMs 24, NVL chunk 8: 139.62 GB/s (NVL), avg_t: 337.05 us
[tuning] SMs 24, NVL chunk 10: 150.22 GB/s (NVL), avg_t: 313.26 us
...

main

python3 test_intranode.py --num-processes=2 --hidden 2880  

[config] num_tokens=4096, hidden=2880, num_topk=8
[layout] Kernel performance: 0.034 ms

[testing] Running with BF16, without top-k (async=False, previous=False) ...WARNING: destroy() was not called before DeepEP buffer destruction, which can leak resources.
WARNING: destroy() was not called before DeepEP buffer destruction, which can leak resources.
[rank0]:[W916 00:21:13.202446982 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
W0916 00:21:14.463000 1862405 torch/multiprocessing/spawn.py:169] Terminating process 1862487 via signal SIGTERM
Traceback (most recent call last):
  File "/home/varun/code/deps/DeepEP/tests/test_intranode.py", line 277, in <module>
    torch.multiprocessing.spawn(test_loop, args=(num_processes, args), nprocs=num_processes)
  File "/home/varun/code/deps/deps-test/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 340, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
  File "/home/varun/code/deps/deps-test/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 296, in start_processes
    while not context.join():
  File "/home/varun/code/deps/deps-test/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 215, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/home/varun/code/deps/deps-test/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 90, in _wrap
    fn(i, *args)
  File "/home/varun/code/deps/DeepEP/tests/test_intranode.py", line 247, in test_loop
    test_main(args, i, local_rank, num_ranks, rank, buffer, group)
  File "/home/varun/code/deps/DeepEP/tests/test_intranode.py", line 160, in test_main
    assert calc_diff(check_x, ref_x) < 5e-6
AssertionError

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

varun-sundar-rabindranath · 2025-09-16T19:05:55Z

Requesting review from @LyricZhao as this PR touches elect_one_sync . Thanks 🙌

LyricZhao · 2025-09-17T01:34:36Z

csrc/kernels/intranode.cu

                        out_dtypes[j] = static_cast<dtype_t>(values[j]);

 #ifndef DISABLE_SM90_FEATURES
+		    auto const warp_mask = __activemask();


Sorry, __activemask can not ensure the mask is right. For example, even the hidden size makes all 32 lanes work together, but some lanes may wait for global memory load arrive and lead to a divergence.

I see. Thanks @LyricZhao. I have updated the code to compute the active mask in the code directly. PTAL!

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

sphish · 2025-09-17T06:24:25Z

#413 We have fixed this issue using an alternative approach, which also resolved the FP8 tests. However, the current internode kernels still do not support cases where hidden_size % 128 != 0. We plan to address this in the future refactoring.

varun-sundar-rabindranath mentioned this pull request Sep 16, 2025

[DP/EP][GPTOSS] Use triton matmul-ogs kernels for GPTOSS DP/EP vllm-project/vllm#24588

Merged

5 tasks

Varun Sundar Rabindranath added 2 commits September 16, 2025 15:09

add warpmask for __syncwarp

73ec503

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

pass membermask to elect_one_sync

f7fb36f

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

varun-sundar-rabindranath force-pushed the intranode/fix-hidden-size branch from 9bfbcd6 to f7fb36f Compare September 16, 2025 19:02

LyricZhao requested changes Sep 17, 2025

View reviewed changes

Varun Sundar Rabindranath added 2 commits September 17, 2025 04:11

compute mask

592aa14

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

fix whitespace

fd4e7e4

Signed-off-by: Varun Sundar Rabindranath <[email protected]>

varun-sundar-rabindranath requested a review from LyricZhao September 17, 2025 04:19

sphish closed this Sep 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support arbitrary hidden sizes in intranode combine kernel #408

Support arbitrary hidden sizes in intranode combine kernel #408

Uh oh!

varun-sundar-rabindranath commented Sep 16, 2025 •

edited

Loading

Uh oh!

varun-sundar-rabindranath commented Sep 16, 2025

Uh oh!

LyricZhao Sep 17, 2025

Uh oh!

varun-sundar-rabindranath Sep 17, 2025

Uh oh!

sphish commented Sep 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Support arbitrary hidden sizes in intranode combine kernel #408

Support arbitrary hidden sizes in intranode combine kernel #408

Uh oh!

Conversation

varun-sundar-rabindranath commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

varun-sundar-rabindranath commented Sep 16, 2025

Uh oh!

LyricZhao Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

varun-sundar-rabindranath Sep 17, 2025

Choose a reason for hiding this comment

Uh oh!

sphish commented Sep 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

varun-sundar-rabindranath commented Sep 16, 2025 •

edited

Loading