Skip to content

Conversation

tannergooding
Copy link
Member

This continues the work on #49397 which started with #53450

In particular, this moves Narrow and Widen to be implemented using the general SIMDAsHWIntrinsic logic and adding then having the new APIs in Vector64/128/256 use the same shared entry points.

There will be a few more PRs after this one covering:

  • Floating-point to Integer conversions
  • CompareAll and CompareAny for each comparison type (GT, GE, LT, LE, etc)
  • Native Integer support

@ghost ghost added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Oct 6, 2021
@ghost
Copy link

ghost commented Oct 6, 2021

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

This continues the work on #49397 which started with #53450

In particular, this moves Narrow and Widen to be implemented using the general SIMDAsHWIntrinsic logic and adding then having the new APIs in Vector64/128/256 use the same shared entry points.

There will be a few more PRs after this one covering:

  • Floating-point to Integer conversions
  • CompareAll and CompareAny for each comparison type (GT, GE, LT, LE, etc)
  • Native Integer support
Author: tannergooding
Assignees: -
Labels:

area-CodeGen-coreclr

Milestone: -

@tannergooding
Copy link
Member Author

Will run outerloop jobs (azp run runtime-coreclr jitstress-isas-x86, runtime-coreclr jitstress-isas-arm, runtime-coreclr outerloop) after the initial CI results come back as successful.

Also plan on collecting PMI diffs before marking this ready-for-review.

@tannergooding
Copy link
Member Author

/azp run runtime-coreclr jitstress-isas-x86, runtime-coreclr jitstress-isas-arm, runtime-coreclr outerloop

@azure-pipelines
Copy link

Azure Pipelines successfully started running 3 pipeline(s).

@tannergooding
Copy link
Member Author

/azp run runtime-coreclr jitstress-isas-x86, runtime-coreclr jitstress-isas-arm, runtime-coreclr outerloop

@azure-pipelines
Copy link

Azure Pipelines successfully started running 3 pipeline(s).

@tannergooding
Copy link
Member Author

/azp run runtime-coreclr jitstress-isas-x86, runtime-coreclr jitstress-isas-arm, runtime-coreclr outerloop

@tannergooding tannergooding marked this pull request as ready for review October 8, 2021 23:07
@azure-pipelines
Copy link

Azure Pipelines successfully started running 3 pipeline(s).

@tannergooding
Copy link
Member Author

No diff for benchmarks.

==============================

Frameworks has the following improvements:

Top method improvements (bytes):
        -605 (-68.52% of base) : System.Private.CoreLib.dasm - Vector:Widen(Vector`1,byref,byref) (7 methods)
        -605 (-66.92% of base) : System.Private.CoreLib.dasm - Vector256:Widen(Vector256`1):ValueTuple`2 (7 methods)
        -545 (-69.25% of base) : System.Private.CoreLib.dasm - Vector128:Widen(Vector128`1):ValueTuple`2 (7 methods)
        -431 (-55.19% of base) : System.Private.CoreLib.dasm - Vector64:Widen(Vector64`1):ValueTuple`2 (7 methods)
         -24 (-20.34% of base) : System.Private.CoreLib.dasm - Latin1Utility:WidenLatin1ToUtf16_Fallback(long,long,long)

Noting that the Vector:Widen improvements are due to the code being "vectorized" now since it forwards to WidenLower and WidenUpperhelper intrinsics. The actual interesting change is toWidenLatin1ToUtf16` which is small improvement to the codegen vs the previous:

; Latin1Utility:WidenLatin1ToUtf16_Fallback(long,long,long)
        vmovupd  ymm0, ymmword ptr[rcx+rax]
-       vpermq   ymm2, ymm0, -44
-       vxorps   ymm1, ymm1
-       vpunpcklbw ymm2, ymm1
-       vpermq   ymm0, ymm0, -24
-       vxorps   ymm1, ymm1
-       vpunpckhbw ymm0, ymm1
+       vmovaps  ymm1, ymm0
+       vpmovzxbw ymm1, ymm1
+       vextractf128 xmm0, xmm0, 1
+       vpmovzxbw ymm0, ymm0

and (no diff shown against the old software impl, just showing a vectorized example)

; Vector:Widen(Vector`1,byref,byref)
       vmovdqu  ymm0, ymmword ptr[rcx]
       vpmovzxbw ymm0, ymm0
       vmovupd  ymmword ptr[rdx], ymm0
       vmovupd  ymm0, ymmword ptr[rcx]
       vextractf128 xmm0, xmm0, 1
       vpmovzxbw ymm0, ymm0
       vmovupd  ymmword ptr[r8], ymm0

; Vector256:Widen(Vector256`1):ValueTuple`2
       vmovupd  ymm0, ymmword ptr[rdx]
       vmovaps  ymm1, ymm0
       vpmovzxbw ymm1, ymm1
       vextractf128 xmm0, xmm0, 1
       vpmovzxbw ymm0, ymm0
       vmovupd  ymmword ptr[rcx], ymm1
       vmovupd  ymmword ptr[rcx+32], ymm0

; Vector128:Widen(Vector128`1):ValueTuple`2
       vmovupd  xmm0, xmmword ptr [rdx]
       vpmovzxbw xmm1, xmm0
       vpsrldq  xmm0, xmm0, 8
       vpmovzxbw xmm0, xmm0
       vmovupd  xmmword ptr [rcx], xmm1
       vmovupd  xmmword ptr [rcx+16], xmm0

==============================

Tests has the following improvements:

Top file improvements (bytes):
       -3265 : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm (-0.13% of base)
       -2284 : JIT\HardwareIntrinsics\General\Vector256\Vector256_ro\Vector256_ro.dasm (-0.09% of base)
        -922 : JIT\HardwareIntrinsics\General\Vector256\Vector256_r\Vector256_r.dasm (-0.03% of base)
        -714 : JIT\HardwareIntrinsics\General\Vector128\Vector128_r\Vector128_r.dasm (-0.02% of base)
        -343 : JIT\HardwareIntrinsics\General\Vector64\Vector64_ro\Vector64_ro.dasm (-0.02% of base)
        -208 : JIT\SIMD\VectorConvert_r_Target_64Bit\VectorConvert_r_Target_64Bit.dasm (-0.17% of base)
        -200 : JIT\SIMD\VectorConvert_ro_Target_64Bit\VectorConvert_ro_Target_64Bit.dasm (-0.84% of base)
         -20 : JIT\Regression\JitBlue\GitHub_13568\GitHub_13568\GitHub_13568.dasm (-2.00% of base)
          -8 : JIT\Regression\JitBlue\GitHub_23159\GitHub_23159\GitHub_23159.dasm (-1.17% of base)

Top method improvements (bytes):
         -58 (-13.91% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorBinaryOpTest__NarrowDouble:RunStructFldScenario():this
         -58 (-13.91% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorBinaryOpTest__NarrowDouble:RunStructLclFldScenario():this
         -56 (-15.01% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - TestStruct:RunStructFldScenario(VectorBinaryOpTest__NarrowDouble):this
         -55 (-11.88% of base) : JIT\HardwareIntrinsics\General\Vector256\Vector256_ro\Vector256_ro.dasm - TestStruct:RunStructFldScenario(VectorBinaryOpTest__NarrowDouble):this
         -52 (-13.72% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorBinaryOpTest__NarrowDouble:RunClassFldScenario():this
         -52 (-12.75% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorBinaryOpTest__NarrowDouble:RunClassLclFldScenario():this
         -52 (-12.01% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorBinaryOpTest__NarrowDouble:RunClsVarScenario():this
         -51 (-10.87% of base) : JIT\HardwareIntrinsics\General\Vector256\Vector256_ro\Vector256_ro.dasm - VectorBinaryOpTest__NarrowDouble:RunClassFldScenario():this
         -51 (-10.24% of base) : JIT\HardwareIntrinsics\General\Vector256\Vector256_ro\Vector256_ro.dasm - VectorBinaryOpTest__NarrowDouble:RunClassLclFldScenario():this
         -51 (-9.75% of base) : JIT\HardwareIntrinsics\General\Vector256\Vector256_ro\Vector256_ro.dasm - VectorBinaryOpTest__NarrowDouble:RunClsVarScenario():this
         -50 (-11.99% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorBinaryOpTest__NarrowInt16:RunStructFldScenario():this
         -50 (-11.99% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorBinaryOpTest__NarrowInt16:RunStructLclFldScenario():this
         -50 (-11.99% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorBinaryOpTest__NarrowInt64:RunStructFldScenario():this
         -50 (-11.99% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorBinaryOpTest__NarrowInt64:RunStructLclFldScenario():this
         -50 (-11.99% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorBinaryOpTest__NarrowUInt16:RunStructFldScenario():this
         -50 (-11.99% of base) : JIT\HardwareIntrinsics\General\Vector128\Vector128_ro\Vector128_ro.dasm - VectorBinaryOpTest__NarrowUInt16:RunStructLclFldScenario():this

Noting that most of the improvements come from the Vector128/256:Narrow/Widen calls being intrinsified now, where they weren't before. The interesting diffs are for JIT\SIMD where Vector<T> codegen is improved from the past logic:

-       vmovupd  ymm6, ymmword ptr[rsi]
-       vmovupd  ymm0, ymmword ptr[rdi]
-       vcvtpd2ps ymm6, ymm6
-       vcvtpd2ps ymm1, ymm0
-       vinsertf128 xmm6, xmm1, 1
-       vcvtps2pd ymm7, ymm6
-       vextractf128 xmm8, xmm6, 1
-       vcvtps2pd ymm8, ymm8
+       vcvtpd2ps ymm0, ymmword ptr[rsi]
+       vcvtpd2ps ymm1, ymmword ptr[rdi]
+       vinsertf128 xmm6, xmm0, xmm1, 1
+       vmovaps  ymm0, ymm6
+       vcvtps2pd ymm7, ymm0
+       vextractf128 xmm0, xmm6, 1
+       vcvtps2pd ymm8, ymm0

@tannergooding
Copy link
Member Author

This should be ready for review now.

@tannergooding
Copy link
Member Author

Rebased onto dotnet/main. This is still ready for review.

@JulieLeeMSFT JulieLeeMSFT added this to the 7.0.0 milestone Nov 11, 2021
@JulieLeeMSFT
Copy link
Member

CC @echesakovMSFT PTAL.

[CLSCompliant(false)]
[Intrinsic]
public static unsafe (Vector64<ushort> Lower, Vector64<ushort> Upper) Widen(Vector64<byte> source)
=> (WidenLower(source), WidenUpper(source));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tannergooding Have you looked at the code produced for such function call? Do we pass the return values properly in two registers from a callee to the caller without copying to/from the memory?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The codegen looked good when I had checked, let me rebuild and get a disasm example to share however.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I remember seeing some strange artifacts with multi reg returns while I was working on #52424.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's worth noting this isn't a multi-reg return; rather its an a managed wrapper method that calls two helper intrinsics instead; since no platform implements this as a single instruction with multi-reg result.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, it's not a multi-reg intrinsic, but it's still a multi-reg call returning a value in D0 and D1 registers, so I was wondering whether the codegen for such calls (incl. the method) is optimal.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@echesakovMSFT, looks like yes there is still some oddities there and it doesn't preference q0/q1 as the target register nor does it directly do a mov and instead does a store/load.

        4EA01C10          mov     v16.16b, v0.16b
        0E617A10          fcvtl   v16.2d, v16.2s
        4E617811          fcvtl2  v17.2d, v0.4s
        3D8007B0          str     q16, [fp,#16]
        3D800BB1          str     q17, [fp,#32]
        3DC007A0          ldr     q0, [fp,#16]
        3DC00BA1          ldr     q1, [fp,#32]

It looks like this gets imported initially as:

               [000014] I-CXG-------              *  CALL      void   ValueTuple`2..ctor (exactContextHnd=0x00007FFA49A115C9)
               [000013] ------------ this in x0   +--*  ADDR      byref 
               [000012] -------N----              |  \--*  LCL_VAR   struct<ValueTuple`2, 32> V02 tmp1         
               [000018] ---XG------- arg1         +--*  OBJ       simd16<Vector128`1>
               [000017] ------------              |  \--*  ADDR      byref 
               [000004] -------N----              |     \--*  HWINTRINSIC simd16 float ConvertToDouble
               [000003] ------------              |        \--*  HWINTRINSIC simd8  float GetLower
               [000002] n-----------              |           \--*  OBJ       simd16<Vector128`1>
               [000001] ------------              |              \--*  ADDR      byref 
               [000000] -------N----              |                 \--*  LCL_VAR   simd16<Vector128`1> V00 arg0         
               [000016] ---XG------- arg2         \--*  OBJ       simd16<Vector128`1>
               [000015] ------------                 \--*  ADDR      byref 
               [000008] -------N----                    \--*  HWINTRINSIC simd16 float ConvertToDoubleUpper
               [000007] n-----------                       \--*  OBJ       simd16<Vector128`1>
               [000006] ------------                          \--*  ADDR      byref 
               [000005] -------N----                             \--*  LCL_VAR   simd16<Vector128`1> V00 arg0         

However, there are locals introduced and indirections kept that never truly go away; even in the face of HVA/HFA. So before rationalize we still have:

***** BB01
STMT00000 (IL 0x000...0x016)
N003 (  5,  5) [000011] IA------R---              *  ASG       struct (init) $VN.Void
N002 (  3,  2) [000009] D------N----              +--*  LCL_VAR   struct<ValueTuple`2, 32> V02 tmp1         d:1
N001 (  1,  2) [000010] ------------              \--*  CNS_INT   int    0 $40

***** BB01
STMT00005 (IL   ???...  ???)
N007 (  9,  8) [000037] -A--G---R---              *  ASG       simd16 (copy) $200
N006 (  1,  1) [000035] D------N----              +--*  LCL_VAR   simd16<Vector128`1> V03 tmp2         d:1 $200
N005 (  9,  8) [000004] -------N----              \--*  HWINTRINSIC simd16 float ConvertToDouble $200
N004 (  8,  7) [000003] ------------                 \--*  HWINTRINSIC simd8  float GetLower $1c0
N003 (  7,  6) [000002] n-----------                    \--*  OBJ       simd16<Vector128`1> $81
N002 (  1,  2) [000001] ------------                       \--*  ADDR      byref  $140
N001 (  1,  1) [000000] -------N----                          \--*  LCL_VAR   simd16<Vector128`1> V00 arg0         u:1 $80

***** BB01
STMT00006 (IL   ???...  ???)
N006 (  8,  7) [000040] -A--G---R---              *  ASG       simd16 (copy) $84
N005 (  1,  1) [000038] D------N----              +--*  LCL_VAR   simd16<Vector128`1> V04 tmp3         d:1 $84
N004 (  8,  7) [000008] -------N----              \--*  HWINTRINSIC simd16 float ConvertToDoubleUpper $84
N003 (  7,  6) [000007] n-----------                 \--*  OBJ       simd16<Vector128`1> $83
N002 (  1,  2) [000006] ------------                    \--*  ADDR      byref  $140
N001 (  1,  1) [000005] -------N----                       \--*  LCL_VAR   simd16<Vector128`1> V00 arg0         u:1 (last use) $80

***** BB01
STMT00003 (IL   ???...  ???)
N003 (  5,  6) [000027] -A------R---              *  ASG       simd16 (copy) $200
N002 (  3,  4) [000022] U------N----              +--*  LCL_FLD   simd16 V02 tmp1         ud:1->2[+0] Fseq[Item1] $300
N001 (  1,  1) [000023] ------------              \--*  LCL_VAR   simd16<Vector128`1> V03 tmp2         u:1 (last use) $200

***** BB01
STMT00004 (IL   ???...  ???)
N003 (  5,  6) [000034] -A------R---              *  ASG       simd16 (copy) $84
N002 (  3,  4) [000029] U------N----              +--*  LCL_FLD   simd16 V02 tmp1         ud:2->3[+16] Fseq[Item2] $301
N001 (  1,  1) [000030] ------------              \--*  LCL_VAR   simd16<Vector128`1> V04 tmp3         u:1 (last use) $84

***** BB01
STMT00002 (IL 0x016...  ???)
N002 (  4,  3) [000020] ------------              *  RETURN    struct $101
N001 (  3,  2) [000019] -------N----              \--*  LCL_VAR   struct<ValueTuple`2, 32> V02 tmp1         u:3 (last use) $301

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-- Noting this is without inlining. With inlining, the JIT sometimes does the right thing and other times does not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, thanks for checking - this is what I suspected would happen - we should work on the issue in .NET 7.

Copy link
Contributor

@echesakov echesakov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes LGTM.

@tannergooding tannergooding merged commit 5fa6dd3 into dotnet:main Nov 12, 2021
@EgorBo
Copy link
Member

EgorBo commented Nov 16, 2021

Ubuntu-x64 improvements: dotnet/perf-autofiling-issues#2339
Alpine-x64 improvements: dotnet/perf-autofiling-issues#2362
windows-x64 improvements: dotnet/perf-autofiling-issues#2346
windows-amd-x64 improvements: dotnet/perf-autofiling-issues#2374

@AndyAyersMS
Copy link
Member

AndyAyersMS commented Nov 23, 2021

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants