Skip to content
This repository was archived by the owner on Dec 22, 2021. It is now read-only.

Packed shift #110

Open
penzn opened this issue Sep 23, 2019 · 2 comments
Open

Packed shift #110

penzn opened this issue Sep 23, 2019 · 2 comments

Comments

@penzn
Copy link
Contributor

penzn commented Sep 23, 2019

I looked into single-precision Mersenne Twister after @ngzhian did double-precision port, and its core function relies on "shift bytes" behavior, represented by PSLLDQ/PSRLDQ on x86 and VEXT on Arm:

https://github.com/penzn/SFMT-wasm/blob/master/SFMT-sse2.h#L37
https://github.com/penzn/SFMT-wasm/blob/master/SFMT-neon.h#L36

In current state of this proposal the only way to achieve this result is to use shuffle, which lowers to hardware shuffle instructions. Using hardware "shift bytes" might yield better throughput, though it would not hurt to prototype this to know for sure. In case they do, there are two options:

  • Weaken shuffle in code generator if the indices are consecutive
  • Add shuffle bytes instruction

I am not sure how easy the former is, as I remember there was talk about doing the same to leverage shuffles with wider lanes, but I am not sure any implementation does that.

I have worked on a Mersenne Twister Wasm port which is using shuffle to get the same results, will try to post it online, stay tuned.

@penzn penzn changed the title Shift bytes Packed shift Sep 24, 2019
@penzn
Copy link
Contributor Author

penzn commented Oct 17, 2019

SSE, Neon, and Wasm SIMD implementations. Packed shift exists on both native platforms.

SSE:

inline static void mm_recursion(__m128i * r, __m128i a, __m128i b, __m128i c, __m128i d)                 
{                                                                     
    __m128i v, x, y, z;                                               
                                                                      
    y = _mm_srli_epi32(b, SFMT_SR1);                                  
    z = _mm_srli_si128(c, SFMT_SR2);                                  
    v = _mm_slli_epi32(d, SFMT_SL1);                                  
    z = _mm_xor_si128(z, a);                                          
    z = _mm_xor_si128(z, v);                                          
    x = _mm_slli_si128(a, SFMT_SL2);                                  
    y = _mm_and_si128(y, simd128_param_mask.si);                      
    z = _mm_xor_si128(z, x);                                          
    z = _mm_xor_si128(z, y);                                          
    *r = z;                                                           
}                                                                     

Neon:

inline static void neon_recursion(uint32x4_t * r, uint32x4_t a, uint32x4_t b, uint32x4_t c, uint32x4_t d)
{
    uint32x4_t v, x, y, z;
    static const uint32x4_t vzero = {0,0,0,0};
    static const uint32x4_t vmask = {SFMT_MSK1, SFMT_MSK2, SFMT_MSK3, SFMT_MSK4};

#define rotate_bytes(A, B, C) vreinterpretq_u32_u8(vextq_u8(vreinterpretq_u8_u32(A),vreinterpretq_u8_u32(B),(C)))

    y = vshrq_n_u32(b, SFMT_SR1);
    z = rotate_bytes(c, vzero, SFMT_SR2);
    v = vshlq_n_u32(d, SFMT_SL1);
    z = veorq_u32(z, a);
    z = veorq_u32(z, v);
    x = rotate_bytes(vzero, a, 16-SFMT_SL2);
    y = vandq_u32(y, vmask);
    z = veorq_u32(z, x);
    z = veorq_u32(z, y);
    *r = z;
}

Wasm SIMD:

inline static void v128_recursion(v128_t * r, v128_t a, v128_t b, v128_t c, v128_t d)                                                                                                                                                 
{
    v128_t v, x, y, z;
    static const v128_t vzero = wasm_i32x4_const(0,0,0,0);

#define rotate_bytes(A, B, C) wasm_v8x16_shuffle((A), (B), (C), ((C) + 1), ((C) + 2), ((C) + 3), ((C) + 4), ((C) + 5), ((C) + 6), ((C) + 7), ((C) + 8), ((C) + 9), ((C) + 10), ((C) + 11), ((C) + 12), 13), ((C) + 14), ((C) + 15))
    y = wasm_u32x4_shr(b, SFMT_SR1);
    z = rotate_bytes(c, vzero, SFMT_SR2);
    v = wasm_i32x4_shl(d, SFMT_SL1);
    z = wasm_v128_xor(z, a);
    z = wasm_v128_xor(z, v);
    x = rotate_bytes(vzero, a, 16 - SFMT_SL2);
    y = wasm_v128_and(y, simd128_param_mask.si);
    z = wasm_v128_xor(z, x);
    z = wasm_v128_xor(z, y);
    *r = z;
}

@zeux
Copy link
Contributor

zeux commented Mar 6, 2020

Just a note, that v8 lowers the shuffles here to 1 instruction on both x64 and arm64: https://github.com/zeux/wasm-simd/blob/master/Shuffles.md#concats. Of course this is subject to LLVM not pessimizing the shuffle, but otherwise seems to be in the general category of a shuffle that can be pattern-matched.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants