Skip to content

Conversation

JackAKirk
Copy link
Contributor

@JackAKirk JackAKirk commented Jan 15, 2025

syclcompat::permute_sub_group_by_xor was reported to flakily fail on L0. Closer inspection revealed that the implementation of permute_sub_group_by_xor is incorrect for cases where logical_sub_group_size != 32, which is one of the test cases. This implies that the test itself is wrong.

In this PR we first optimize the part of the implementation that is valid assuming that Intel spirv builtins are correct (which is also the only case realistically a user will program): case logical_sub_group_size == 32, in order to:

  • Ensure the only useful case is working via the correct optimized route.
  • Check that this improvement doesn't break the suspicious test.

A follow on PR can fix the other cases where logical_sub_group_size != 32: this is better to do later, since

  • the only use case I know of for this is to implement non-uniform group algorithms that we already have implemented (e.g. see [SYCL][CUDA] Non-uniform algorithm implementations for ext_oneapi_cuda. #9671) and any user is advised to use such algorithms instead of reimplementing them themselves.
  • This must I think require a complete reworking of the test and would otherwise delay the more important change here.

@JackAKirk JackAKirk requested a review from a team as a code owner January 15, 2025 16:31
@JackAKirk
Copy link
Contributor Author

JackAKirk commented Jan 15, 2025

syclomatic translates __shfl_xor_sync() to permute_sub_group_by_xor (
__shfl_xor_sync() is defined as (see https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#warp-shuffle-description):

"
__shfl_xor_sync() calculates a source lane ID by performing a bitwise XOR of the caller’s lane ID with laneMask: the value of var held by the resulting lane ID is returned. If width is less than warpSize then each group of width consecutive threads are able to access elements from earlier groups of threads, however if they attempt to access elements from later groups of threads their own value of var will be returned. This mode implements a butterfly addressing pattern such as is used in tree reduction and broadcast.
"

However as per its own description

https://github.com/intel/llvm/blob/sycl/sycl/include/syclcompat/util.hpp#L291

permute_sub_group_by_xor is implemented according to a different definition unless logical_sub_group_size == 32.

@JackAKirk
Copy link
Contributor Author

@intel/syclcompat-lib-reviewers this is ready for review now.

Copy link
Contributor

@GeorgeWeb GeorgeWeb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

@JackAKirk
Copy link
Contributor Author

@intel/llvm-gatekeepers this is ready for merge.

Thanks

@martygrant martygrant merged commit 291eeee into intel:sycl Jan 20, 2025
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants