Skip to content

VAES should not be restricted to AVX512 #1343

@SchrodingerZhu

Description

@SchrodingerZhu

pub unsafe fn _mm256_aesdec_epi128(a: __m256i, round_key: __m256i) -> __m256i {

It seems that the stdarch library unconditionally put VAES related intrinsics into the AVX512VL scope while VAES is actually available on platforms without AVX512 support (AMD Zen3).

I spot the issue when I was investigating VAES with ahash: tkaitchuck/aHash#85

When using _mm256_aesenc_epi128, instead of generating something like

   2cf7e:       c5 d1 6c ee             vpunpcklqdq %xmm6,%xmm5,%xmm5
   2cf82:       c4 c1 f9 6e f2          vmovq  %r10,%xmm6
   2cf87:       c5 c9 6c f7             vpunpcklqdq %xmm7,%xmm6,%xmm6
   2cf8b:       c4 e2 7d dc c4          vaesenc %ymm4,%ymm0,%ymm0
   2cf90:       c4 e2 75 dc cb          vaesenc %ymm3,%ymm1,%ymm1
   2cf95:       c4 e3 7d 38 f6 01       vinserti128 $0x1,%xmm6,%ymm0,%ymm6
   2cf9b:       c4 e3 55 02 ee f0       vpblendd $0xf0,%ymm6,%ymm5,%ymm5
   2cfa1:       c4 e2 55 00 ea          vpshufb %ymm2,%ymm5,%ymm5
   2cfa6:       c5 e5 d4 dd             vpaddq %ymm5,%ymm3,%ymm3
   2cfaa:       c4 e2 65 00 da          vpshufb %ymm2,%ymm3,%ymm3
   2cfaf:       c5 dd d4 db             vpaddq %ymm3,%ymm4,%ymm3
   2cfb3:       c4 e3 7d 39 dc 01       vextracti128 $0x1,%ymm3,%xmm4
   2cfb9:       c4 e3 f9 16 d8 01       vpextrq $0x1,%xmm3,%rax
   2cfbf:       c4 e1 f9 7e d9          vmovq  %xmm3,%rcx
   2cfc4:       c4 c3 f9 16 e1 01       vpextrq $0x1,%xmm4,%r9
   2cfca:       c4 c1 f9 7e e2          vmovq  %xmm4,%r10

the compiler generates

  2cf8d:       c5 fc 29 84 24 80 00    vmovaps %ymm0,0x80(%rsp)
   2cf94:       00 00 
   2cf96:       c5 fd 7f 94 24 20 01    vmovdqa %ymm2,0x120(%rsp)
   2cf9d:       00 00 
   2cf9f:       c5 fc 29 8c 24 40 01    vmovaps %ymm1,0x140(%rsp)
   2cfa6:       00 00 
   2cfa8:       c5 f8 77                vzeroupper
   2cfab:       e8 c0 db ff ff          call   2ab70 <core::core_arch::x86::avx512vaes::_mm256_aesenc_epi128::hdf0bd31f9011eecd>
   2cfb0:       c5 fd 6f 84 24 a0 00    vmovdqa 0xa0(%rsp),%ymm0
   2cfb7:       00 00 
   2cfb9:       c5 fd 6f 1c 24          vmovdqa (%rsp),%ymm3
   2cfbe:       c5 fd d4 44 24 60       vpaddq 0x60(%rsp),%ymm0,%ymm0
   2cfc4:       c4 e2 7d 00 05 d3 ad    vpshufb 0x8add3(%rip),%ymm0,%ymm0        # b7da0 <_fini+0x10a0>
   2cfcb:       08 00 
   2cfcd:       c5 fd d4 44 24 40       vpaddq 0x40(%rsp),%ymm0,%ymm0
   2cfd3:       c4 c3 f9 16 c7 01       vpextrq $0x1,%xmm0,%r15
   2cfd9:       c4 e1 f9 7e c3          vmovq  %xmm0,%rbx
   2cfde:       c4 e3 7d 39 c0 01       vextracti128 $0x1,%ymm0,%xmm0
   2cfe4:       c4 c3 f9 16 c6 01       vpextrq $0x1,%xmm0,%r14
   2cfea:       c4 e1 f9 7e c6          vmovq  %xmm0,%rsi

It is very strange that the compiler do not inline the function call under release profile even if target-cpu=native is set.
However, when I explicit write

    extern "C" {
        #[link_name = "llvm.x86.aesni.aesenc.256"]
        fn aesenc_256(a: __m256i, round_key: __m256i) -> __m256i;
    }

    unsafe {
        transmute(aesenc_256(transmute(value), transmute(xor)))
    }

The compiler will give the upper asm as expected.

I suspect that this is because of the intrinsic being marked as avx512vl instruction.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions