[SYCLCompat] Optimize/(fix?) permute_sub_group_by_xor if `logical_sub_group_size == 32` #16646

JackAKirk · 2025-01-15T16:30:59Z

syclcompat::permute_sub_group_by_xor was reported to flakily fail on L0. Closer inspection revealed that the implementation of permute_sub_group_by_xor is incorrect for cases where logical_sub_group_size != 32, which is one of the test cases. This implies that the test itself is wrong.

In this PR we first optimize the part of the implementation that is valid assuming that Intel spirv builtins are correct (which is also the only case realistically a user will program): case logical_sub_group_size == 32, in order to:

Ensure the only useful case is working via the correct optimized route.
Check that this improvement doesn't break the suspicious test.

A follow on PR can fix the other cases where logical_sub_group_size != 32: this is better to do later, since

the only use case I know of for this is to implement non-uniform group algorithms that we already have implemented (e.g. see [SYCL][CUDA] Non-uniform algorithm implementations for ext_oneapi_cuda. #9671) and any user is advised to use such algorithms instead of reimplementing them themselves.
This must I think require a complete reworking of the test and would otherwise delay the more important change here.

Signed-off-by: JackAKirk <[email protected]>

JackAKirk · 2025-01-15T16:38:16Z

syclomatic translates __shfl_xor_sync() to permute_sub_group_by_xor (
__shfl_xor_sync() is defined as (see https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#warp-shuffle-description):

"
__shfl_xor_sync() calculates a source lane ID by performing a bitwise XOR of the caller’s lane ID with laneMask: the value of var held by the resulting lane ID is returned. If width is less than warpSize then each group of width consecutive threads are able to access elements from earlier groups of threads, however if they attempt to access elements from later groups of threads their own value of var will be returned. This mode implements a butterfly addressing pattern such as is used in tree reduction and broadcast.
"

However as per its own description

https://github.com/intel/llvm/blob/sycl/sycl/include/syclcompat/util.hpp#L291

permute_sub_group_by_xor is implemented according to a different definition unless logical_sub_group_size == 32.

Signed-off-by: JackAKirk <[email protected]>

JackAKirk · 2025-01-16T13:50:41Z

@intel/syclcompat-lib-reviewers this is ready for review now.

Signed-off-by: JackAKirk <[email protected]>

GeorgeWeb

Looks good

JackAKirk · 2025-01-20T12:00:02Z

@intel/llvm-gatekeepers this is ready for merge.

Thanks

Optimize/(fix?) permute_sub_group_by_xor

1b09219

Signed-off-by: JackAKirk <[email protected]>

JackAKirk requested a review from a team as a code owner January 15, 2025 16:31

JackAKirk temporarily deployed to WindowsCILock January 15, 2025 16:32 — with GitHub Actions Inactive

JackAKirk temporarily deployed to WindowsCILock January 15, 2025 17:13 — with GitHub Actions Inactive

Split test into two test cases for easier debugging.

bf13d41

Signed-off-by: JackAKirk <[email protected]>

JackAKirk had a problem deploying to WindowsCILock January 16, 2025 11:30 — with GitHub Actions Error

Add missing host_dev_data_u

affc058

Signed-off-by: JackAKirk <[email protected]>

JackAKirk temporarily deployed to WindowsCILock January 16, 2025 12:11 — with GitHub Actions Inactive

JackAKirk temporarily deployed to WindowsCILock January 16, 2025 12:39 — with GitHub Actions Inactive

Remove unnecessary else statement

bfe680e

Signed-off-by: JackAKirk <[email protected]>

JackAKirk temporarily deployed to WindowsCILock January 16, 2025 13:53 — with GitHub Actions Inactive

JackAKirk temporarily deployed to WindowsCILock January 16, 2025 14:22 — with GitHub Actions Inactive

GeorgeWeb approved these changes Jan 20, 2025

View reviewed changes

martygrant merged commit 291eeee into intel:sycl Jan 20, 2025
17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SYCLCompat] Optimize/(fix?) permute_sub_group_by_xor if `logical_sub_group_size == 32` #16646

[SYCLCompat] Optimize/(fix?) permute_sub_group_by_xor if `logical_sub_group_size == 32` #16646

Uh oh!

JackAKirk commented Jan 15, 2025 •

edited

Loading

Uh oh!

JackAKirk commented Jan 15, 2025 •

edited

Loading

Uh oh!

JackAKirk commented Jan 16, 2025

Uh oh!

GeorgeWeb left a comment

Uh oh!

JackAKirk commented Jan 20, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SYCLCompat] Optimize/(fix?) permute_sub_group_by_xor if logical_sub_group_size == 32 #16646

[SYCLCompat] Optimize/(fix?) permute_sub_group_by_xor if logical_sub_group_size == 32 #16646

Uh oh!

Conversation

JackAKirk commented Jan 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JackAKirk commented Jan 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JackAKirk commented Jan 16, 2025

Uh oh!

GeorgeWeb left a comment

Choose a reason for hiding this comment

Uh oh!

JackAKirk commented Jan 20, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SYCLCompat] Optimize/(fix?) permute_sub_group_by_xor if `logical_sub_group_size == 32` #16646

[SYCLCompat] Optimize/(fix?) permute_sub_group_by_xor if `logical_sub_group_size == 32` #16646

JackAKirk commented Jan 15, 2025 •

edited

Loading

JackAKirk commented Jan 15, 2025 •

edited

Loading