Skip to content

Conversation

varun-sundar-rabindranath
Copy link

@varun-sundar-rabindranath varun-sundar-rabindranath commented Sep 16, 2025

Intranode combine kernels fail for arbitrary hidden sizes.

The kernel has all the required support to handle arbitrary hidden size. The PR simply fixes a spot that seemed to assume hidden_int4 % 32 == 0.

A for-loop inside the kernel invokes __syncwarp(). This is fine, when all the threads in a warp executes the __syncwarp() (the good case). But this is not always the case with arbitrary hidden size. There can be residuals where only a few threads in a warp would execute the for-loop. To facilitate this, the PR makes appropriate warp_masks that the __syncwarp() could be called with.

Test:
Tested with hidden-size 2880 (gpt-oss hidden-dim)

This PR

python3 test_intranode.py --num-processes=2 --hidden 2880 
[config] num_tokens=4096, hidden=2880, num_topk=8
[layout] Kernel performance: 0.033 ms

[testing] Running with BF16, without top-k (async=False, previous=False) ... passed
[testing] Running with BF16, with top-k (async=False, previous=False) ... passed
[testing] Running with BF16, without top-k (async=False, previous=False) ... passed
[testing] Running with BF16, with top-k (async=False, previous=False) ... passed
[testing] Running with BF16, without top-k (async=True, previous=False) ... passed
[testing] Running with BF16, with top-k (async=True, previous=False) ... passed
[testing] Running with BF16, without top-k (async=True, previous=False) ... passed
[testing] Running with BF16, with top-k (async=True, previous=False) ... passed
[testing] Running with BF16, without top-k (async=False, previous=True) ... passed
[testing] Running with BF16, with top-k (async=False, previous=True) ... passed
[testing] Running with BF16, without top-k (async=False, previous=True) ... passed
[testing] Running with BF16, with top-k (async=False, previous=True) ... passed
[testing] Running with BF16, without top-k (async=True, previous=True) ... passed
[testing] Running with BF16, with top-k (async=True, previous=True) ... passed
[testing] Running with BF16, without top-k (async=True, previous=True) ... passed
[testing] Running with BF16, with top-k (async=True, previous=True) ... passed

[tuning] SMs 24, NVL chunk 4: 97.48 GB/s (NVL), avg_t: 482.75 us
[tuning] SMs 24, NVL chunk 6: 124.54 GB/s (NVL), avg_t: 377.87 us
[tuning] SMs 24, NVL chunk 8: 139.62 GB/s (NVL), avg_t: 337.05 us
[tuning] SMs 24, NVL chunk 10: 150.22 GB/s (NVL), avg_t: 313.26 us
...

main

python3 test_intranode.py --num-processes=2 --hidden 2880  

[config] num_tokens=4096, hidden=2880, num_topk=8
[layout] Kernel performance: 0.034 ms

[testing] Running with BF16, without top-k (async=False, previous=False) ...WARNING: destroy() was not called before DeepEP buffer destruction, which can leak resources.
WARNING: destroy() was not called before DeepEP buffer destruction, which can leak resources.
[rank0]:[W916 00:21:13.202446982 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
W0916 00:21:14.463000 1862405 torch/multiprocessing/spawn.py:169] Terminating process 1862487 via signal SIGTERM
Traceback (most recent call last):
  File "/home/varun/code/deps/DeepEP/tests/test_intranode.py", line 277, in <module>
    torch.multiprocessing.spawn(test_loop, args=(num_processes, args), nprocs=num_processes)
  File "/home/varun/code/deps/deps-test/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 340, in spawn
    return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
  File "/home/varun/code/deps/deps-test/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 296, in start_processes
    while not context.join():
  File "/home/varun/code/deps/deps-test/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 215, in join
    raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException: 

-- Process 1 terminated with the following error:
Traceback (most recent call last):
  File "/home/varun/code/deps/deps-test/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 90, in _wrap
    fn(i, *args)
  File "/home/varun/code/deps/DeepEP/tests/test_intranode.py", line 247, in test_loop
    test_main(args, i, local_rank, num_ranks, rank, buffer, group)
  File "/home/varun/code/deps/DeepEP/tests/test_intranode.py", line 160, in test_main
    assert calc_diff(check_x, ref_x) < 5e-6
AssertionError

Varun Sundar Rabindranath added 2 commits September 16, 2025 15:09
Signed-off-by: Varun Sundar Rabindranath <[email protected]>
Signed-off-by: Varun Sundar Rabindranath <[email protected]>
@varun-sundar-rabindranath
Copy link
Author

Requesting review from @LyricZhao as this PR touches elect_one_sync . Thanks 🙌

out_dtypes[j] = static_cast<dtype_t>(values[j]);

#ifndef DISABLE_SM90_FEATURES
auto const warp_mask = __activemask();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, __activemask can not ensure the mask is right. For example, even the hidden size makes all 32 lanes work together, but some lanes may wait for global memory load arrive and lead to a divergence.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Thanks @LyricZhao. I have updated the code to compute the active mask in the code directly. PTAL!

Varun Sundar Rabindranath added 2 commits September 17, 2025 04:11
Signed-off-by: Varun Sundar Rabindranath <[email protected]>
Signed-off-by: Varun Sundar Rabindranath <[email protected]>
@sphish
Copy link
Collaborator

sphish commented Sep 17, 2025

#413 We have fixed this issue using an alternative approach, which also resolved the FP8 tests. However, the current internode kernels still do not support cases where hidden_size % 128 != 0. We plan to address this in the future refactoring.

@sphish sphish closed this Sep 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants