-
Notifications
You must be signed in to change notification settings - Fork 5.2k
Description
As part of effort to convert target dependent intrinsic in .NET libraries to target-independent Vector*
function, I went through intrinsics used in .NET libraries. I have the list below and some possible options to switch to cross-platform vectors if we either expand Vector API or have JIT optimize certain patterns where multiple Vector functions can achieve same result
-
Base64 Encoder/Decoder
a. Has AVX512 path
b.runtime/src/libraries/System.Private.CoreLib/src/System/Buffers/Text/Base64Decoder.cs
Line 40 in f94bab0
private static unsafe OperationStatus DecodeFromUtf8(ReadOnlySpan<byte> utf8, Span<byte> bytes, out int bytesConsumed, out int bytesWritten, bool isFinalBlock, bool ignoreWhiteSpace)
c.runtime/src/libraries/System.Private.CoreLib/src/System/Buffers/Text/Base64Encoder.cs
Line 38 in f94bab0
public static unsafe OperationStatus EncodeToUtf8(ReadOnlySpan<byte> bytes, Span<byte> utf8, out int bytesConsumed, out int bytesWritten, bool isFinalBlock = true)
d. Cannot convert everything to Vector* without expanding Vector surface area -
ProbablisiticMap
a. Has AVX512 paths
b.runtime/src/libraries/System.Private.CoreLib/src/System/SearchValues/ProbabilisticMap.cs
Lines 107 to 265 in f94bab0
[MethodImpl(MethodImplOptions.AggressiveInlining)] [CompExactlyDependsOn(typeof(Avx512Vbmi))] private static Vector512<byte> ContainsMask64CharsAvx512(Vector512<byte> charMap, ref char searchSpace0, ref char searchSpace1) { Vector512<ushort> source0 = Vector512.LoadUnsafe(ref searchSpace0); Vector512<ushort> source1 = Vector512.LoadUnsafe(ref searchSpace1); Vector512<byte> sourceLower = Avx512BW.PackUnsignedSaturate( (source0 & Vector512.Create((ushort)255)).AsInt16(), (source1 & Vector512.Create((ushort)255)).AsInt16()); Vector512<byte> sourceUpper = Avx512BW.PackUnsignedSaturate( (source0 >>> 8).AsInt16(), (source1 >>> 8).AsInt16()); Vector512<byte> resultLower = IsCharBitNotSetAvx512(charMap, sourceLower); Vector512<byte> resultUpper = IsCharBitNotSetAvx512(charMap, sourceUpper); return ~(resultLower | resultUpper); } [MethodImpl(MethodImplOptions.AggressiveInlining)] [CompExactlyDependsOn(typeof(Avx512Vbmi))] private static Vector512<byte> IsCharBitNotSetAvx512(Vector512<byte> charMap, Vector512<byte> values) { Vector512<byte> shifted = values >>> VectorizedIndexShift; Vector512<byte> bitPositions = Avx512BW.Shuffle(Vector512.Create(0x8040201008040201).AsByte(), shifted); Vector512<byte> index = values & Vector512.Create((byte)VectorizedIndexMask); Vector512<byte> bitMask = Avx512Vbmi.PermuteVar64x8(charMap, index); return Vector512.Equals(bitMask & bitPositions, Vector512<byte>.Zero); } [MethodImpl(MethodImplOptions.AggressiveInlining)] [CompExactlyDependsOn(typeof(Avx512Vbmi.VL))] private static Vector256<byte> ContainsMask32CharsAvx512(Vector256<byte> charMap, ref char searchSpace0, ref char searchSpace1) { Vector256<ushort> source0 = Vector256.LoadUnsafe(ref searchSpace0); Vector256<ushort> source1 = Vector256.LoadUnsafe(ref searchSpace1); Vector256<byte> sourceLower = Avx2.PackUnsignedSaturate( (source0 & Vector256.Create((ushort)255)).AsInt16(), (source1 & Vector256.Create((ushort)255)).AsInt16()); Vector256<byte> sourceUpper = Avx2.PackUnsignedSaturate( (source0 >>> 8).AsInt16(), (source1 >>> 8).AsInt16()); Vector256<byte> resultLower = IsCharBitNotSetAvx512(charMap, sourceLower); Vector256<byte> resultUpper = IsCharBitNotSetAvx512(charMap, sourceUpper); return ~(resultLower | resultUpper); } [MethodImpl(MethodImplOptions.AggressiveInlining)] [CompExactlyDependsOn(typeof(Avx512Vbmi.VL))] private static Vector256<byte> IsCharBitNotSetAvx512(Vector256<byte> charMap, Vector256<byte> values) { Vector256<byte> shifted = values >>> VectorizedIndexShift; Vector256<byte> bitPositions = Avx2.Shuffle(Vector256.Create(0x8040201008040201).AsByte(), shifted); Vector256<byte> index = values & Vector256.Create((byte)VectorizedIndexMask); Vector256<byte> bitMask = Avx512Vbmi.VL.PermuteVar32x8(charMap, index); return Vector256.Equals(bitMask & bitPositions, Vector256<byte>.Zero); } [MethodImpl(MethodImplOptions.AggressiveInlining)] [CompExactlyDependsOn(typeof(Avx2))] private static Vector256<byte> ContainsMask32CharsAvx2(Vector256<byte> charMapLower, Vector256<byte> charMapUpper, ref char searchSpace) { Vector256<ushort> source0 = Vector256.LoadUnsafe(ref searchSpace); Vector256<ushort> source1 = Vector256.LoadUnsafe(ref searchSpace, (nuint)Vector256<ushort>.Count); Vector256<byte> sourceLower = Avx2.PackUnsignedSaturate( (source0 & Vector256.Create((ushort)255)).AsInt16(), (source1 & Vector256.Create((ushort)255)).AsInt16()); Vector256<byte> sourceUpper = Avx2.PackUnsignedSaturate( (source0 >>> 8).AsInt16(), (source1 >>> 8).AsInt16()); Vector256<byte> resultLower = IsCharBitNotSetAvx2(charMapLower, charMapUpper, sourceLower); Vector256<byte> resultUpper = IsCharBitNotSetAvx2(charMapLower, charMapUpper, sourceUpper); return ~(resultLower | resultUpper); } [MethodImpl(MethodImplOptions.AggressiveInlining)] [CompExactlyDependsOn(typeof(Avx2))] private static Vector256<byte> IsCharBitNotSetAvx2(Vector256<byte> charMapLower, Vector256<byte> charMapUpper, Vector256<byte> values) { Vector256<byte> shifted = values >>> VectorizedIndexShift; Vector256<byte> bitPositions = Avx2.Shuffle(Vector256.Create(0x8040201008040201).AsByte(), shifted); Vector256<byte> index = values & Vector256.Create((byte)VectorizedIndexMask); Vector256<byte> bitMaskLower = Avx2.Shuffle(charMapLower, index); Vector256<byte> bitMaskUpper = Avx2.Shuffle(charMapUpper, index - Vector256.Create((byte)16)); Vector256<byte> mask = Vector256.GreaterThan(index, Vector256.Create((byte)15)); Vector256<byte> bitMask = Vector256.ConditionalSelect(mask, bitMaskUpper, bitMaskLower); return Vector256.Equals(bitMask & bitPositions, Vector256<byte>.Zero); } [MethodImpl(MethodImplOptions.AggressiveInlining)] [CompExactlyDependsOn(typeof(AdvSimd.Arm64))] [CompExactlyDependsOn(typeof(Sse2))] private static Vector128<byte> ContainsMask16Chars(Vector128<byte> charMapLower, Vector128<byte> charMapUpper, ref char searchSpace) { Vector128<ushort> source0 = Vector128.LoadUnsafe(ref searchSpace); Vector128<ushort> source1 = Vector128.LoadUnsafe(ref searchSpace, (nuint)Vector128<ushort>.Count); Vector128<byte> sourceLower = Sse2.IsSupported ? Sse2.PackUnsignedSaturate((source0 & Vector128.Create((ushort)255)).AsInt16(), (source1 & Vector128.Create((ushort)255)).AsInt16()) : AdvSimd.Arm64.UnzipEven(source0.AsByte(), source1.AsByte()); Vector128<byte> sourceUpper = Sse2.IsSupported ? Sse2.PackUnsignedSaturate((source0 >>> 8).AsInt16(), (source1 >>> 8).AsInt16()) : AdvSimd.Arm64.UnzipOdd(source0.AsByte(), source1.AsByte()); Vector128<byte> resultLower = IsCharBitNotSet(charMapLower, charMapUpper, sourceLower); Vector128<byte> resultUpper = IsCharBitNotSet(charMapLower, charMapUpper, sourceUpper); return ~(resultLower | resultUpper); } [MethodImpl(MethodImplOptions.AggressiveInlining)] [CompExactlyDependsOn(typeof(Sse2))] [CompExactlyDependsOn(typeof(Ssse3))] [CompExactlyDependsOn(typeof(AdvSimd))] [CompExactlyDependsOn(typeof(AdvSimd.Arm64))] [CompExactlyDependsOn(typeof(PackedSimd))] private static Vector128<byte> IsCharBitNotSet(Vector128<byte> charMapLower, Vector128<byte> charMapUpper, Vector128<byte> values) { Vector128<byte> shifted = values >>> VectorizedIndexShift; Vector128<byte> bitPositions = Vector128.ShuffleUnsafe(Vector128.Create(0x8040201008040201).AsByte(), shifted); Vector128<byte> index = values & Vector128.Create((byte)VectorizedIndexMask); Vector128<byte> bitMask; if (AdvSimd.Arm64.IsSupported) { bitMask = AdvSimd.Arm64.VectorTableLookup((charMapLower, charMapUpper), index); } else { Vector128<byte> bitMaskLower = Vector128.ShuffleUnsafe(charMapLower, index); Vector128<byte> bitMaskUpper = Vector128.ShuffleUnsafe(charMapUpper, index - Vector128.Create((byte)16)); Vector128<byte> mask = Vector128.GreaterThan(index, Vector128.Create((byte)15)); bitMask = Vector128.ConditionalSelect(mask, bitMaskUpper, bitMaskLower); } return Vector128.Equals(bitMask & bitPositions, Vector128<byte>.Zero); }
c. Uses the following
i. Avx512BW.PackUnsignedSaturate
ii. Avx512Vbmi.PermuteVar64x8
iii. Avx512BW.Shuffle
d. Cannot Upgrade- No way to switch PackUnsignedSaturate -
XxHashShared.c
a. No Avx512 path
b.if (Vector256.IsHardwareAccelerated && BitConverter.IsLittleEndian)
c. Uses Avx2.Multiply
d. Cannot switch Intrinsic Multiply to vector multiply -
BitArray.cs
a. Has AVX512 path
b.runtime/src/libraries/System.Collections/src/System/Collections/BitArray.cs
Lines 840 to 888 in f94bab0
if (Avx512F.IsSupported && (uint)m_length >= Vector512<byte>.Count) { Vector256<byte> upperShuffleMask_CopyToBoolArray256 = Vector256.Create(0x04040404_04040404, 0x05050505_05050505, 0x06060606_06060606, 0x07070707_07070707).AsByte(); Vector256<byte> lowerShuffleMask_CopyToBoolArray256 = Vector256.Create(lowerShuffleMask_CopyToBoolArray, upperShuffleMask_CopyToBoolArray); Vector512<byte> shuffleMask = Vector512.Create(lowerShuffleMask_CopyToBoolArray256, upperShuffleMask_CopyToBoolArray256); Vector512<byte> bitMask = Vector512.Create(0x80402010_08040201).AsByte(); Vector512<byte> ones = Vector512.Create((byte)1); fixed (bool* destination = &boolArray[index]) { for (; (i + Vector512<byte>.Count) <= (uint)m_length; i += (uint)Vector512<byte>.Count) { ulong bits = (ulong)(uint)m_array[i / (uint)BitsPerInt32] + ((ulong)m_array[(i / (uint)BitsPerInt32) + 1] << BitsPerInt32); Vector512<ulong> scalar = Vector512.Create(bits); Vector512<byte> shuffled = Avx512BW.Shuffle(scalar.AsByte(), shuffleMask); Vector512<byte> extracted = Avx512F.And(shuffled, bitMask); // The extracted bits can be anywhere between 0 and 255, so we normalise the value to either 0 or 1 // to ensure compatibility with "C# bool" (0 for false, 1 for true, rest undefined) Vector512<byte> normalized = Avx512BW.Min(extracted, ones); Avx512F.Store((byte*)destination + i, normalized); } } } else if (Avx2.IsSupported && (uint)m_length >= Vector256<byte>.Count) { Vector256<byte> shuffleMask = Vector256.Create(lowerShuffleMask_CopyToBoolArray, upperShuffleMask_CopyToBoolArray); Vector256<byte> bitMask = Vector256.Create(0x80402010_08040201).AsByte(); //Internal.Console.WriteLine(bitMask); Vector256<byte> ones = Vector256.Create((byte)1); fixed (bool* destination = &boolArray[index]) { for (; (i + Vector256<byte>.Count) <= (uint)m_length; i += (uint)Vector256<byte>.Count) { int bits = m_array[i / (uint)BitsPerInt32]; Vector256<int> scalar = Vector256.Create(bits); Vector256<byte> shuffled = Avx2.Shuffle(scalar.AsByte(), shuffleMask); Vector256<byte> extracted = Avx2.And(shuffled, bitMask); // The extracted bits can be anywhere between 0 and 255, so we normalise the value to either 0 or 1 // to ensure compatibility with "C# bool" (0 for false, 1 for true, rest undefined) Vector256<byte> normalized = Avx2.Min(extracted, ones); Avx.Store((byte*)destination + i, normalized); } } } else if (Ssse3.IsSupported && ((uint)m_length >= Vector512<byte>.Count * 2u))
c. Uses the following
i. Avx2.Shuffle
ii. Avx2.And
iii. Avx2.Min
iv. Avx.Store
d. Shuffle with non constant ‘indices’ will be problematic to convert- But should be fine with ShuffleUnsafe implemented -
AsciiStringSearchValuesTeddyBase.cs/ TeddyHelper.cs
a. Has AVX512F path
b.Lines 427 to 479 in f94bab0
private int IndexOfAnyN3Avx2(ReadOnlySpan<char> span) { // See comments in 'IndexOfAnyN3Vector128' above. // This method is the same, but operates on 32 input characters at a time. Debug.Assert(span.Length >= CharsPerIterationAvx2 + MatchStartOffsetN3); ref char searchSpace = ref MemoryMarshal.GetReference(span); ref char lastSearchSpaceStart = ref Unsafe.Add(ref searchSpace, span.Length - CharsPerIterationAvx2); searchSpace = ref Unsafe.Add(ref searchSpace, MatchStartOffsetN3); Vector256<byte> n0Low = _n0Low._lower, n0High = _n0High._lower; Vector256<byte> n1Low = _n1Low._lower, n1High = _n1High._lower; Vector256<byte> n2Low = _n2Low._lower, n2High = _n2High._lower; Vector256<byte> prev0 = Vector256<byte>.AllBitsSet; Vector256<byte> prev1 = Vector256<byte>.AllBitsSet; Loop: ValidateReadPosition(span, ref searchSpace); Vector256<byte> input = TStartCaseSensitivity.TransformInput(LoadAndPack32AsciiChars(ref searchSpace)); (Vector256<byte> result, prev0, prev1) = ProcessInputN3(input, prev0, prev1, n0Low, n0High, n1Low, n1High, n2Low, n2High); if (result != Vector256<byte>.Zero) { goto CandidateFound; } ContinueLoop: searchSpace = ref Unsafe.Add(ref searchSpace, CharsPerIterationAvx2); if (Unsafe.IsAddressGreaterThan(ref searchSpace, ref lastSearchSpaceStart)) { if (Unsafe.AreSame(ref searchSpace, ref Unsafe.Add(ref lastSearchSpaceStart, CharsPerIterationAvx2))) { return -1; } // We're switching which characters we will process in the next iteration. // prev0 and prev1 no longer point to the characters just before the current input, so we must reset them. prev0 = Vector256<byte>.AllBitsSet; prev1 = Vector256<byte>.AllBitsSet; searchSpace = ref lastSearchSpaceStart; } goto Loop; CandidateFound: if (TryFindMatch(span, ref searchSpace, result, MatchStartOffsetN3, out int offset)) { return offset; } goto ContinueLoop; }
c. Related : TeddyHelper :runtime/src/libraries/System.Private.CoreLib/src/System/SearchValues/Strings/Helpers/TeddyHelper.cs
Line 47 in f94bab0
[CompExactlyDependsOn(typeof(Avx2))]
d. Uses the following
i. PackUnsignedSaturate: no 1-1
ii. Shuffle – possible with shuffleunsafe
iii. Permute2x128
iv. AlignRight : no 1-1
v. PermuteVar8x64x2 -
SpanHelpers.cs : Consider all span under this umbrella
a. Has AVX512F path
b.// Avx2 branch also operates on Sse2 sizes, so check is combined.
c. Uses the following
i. Shuffle
ii. Avx2.Permute2x128
iii. PermuteVar8x32
iv. Permute4x64
v. Avx2.And
vi. Avx2.MultiplyHigh
vii. Avx2.MultiplyLow
viii. Avx2.Or
ix. Avx2.SubtractSaturate
x. Avx2.CompareGreaterThan
xi. Avx2.Subtract
xii. Avx2.Add -
IndexOfAnyAsciiSearcher
a. No AVx512F path – Tried impl/had issues
b.runtime/src/libraries/System.Private.CoreLib/src/System/SearchValues/IndexOfAnyAsciiSearcher.cs
Line 237 in f94bab0
if (Avx2.IsSupported && searchSpaceLength > 2 * Vector128<short>.Count)
c. Uses following
i. PackUnsignedSaturate
ii. Shuffle -
Matrix4x4.Impl
a. No avx512 path and in some cases avx paths
b.runtime/src/libraries/System.Private.CoreLib/src/System/Numerics/Matrix4x4.Impl.cs
Line 1422 in e9e33e1
return Avx.Permute(value, control);
c. Uses foll
i. Shuffle/Permute – constant indices..so possible?
ii. UnpackLow
iii. UnpackHigh -
Ascii.Equality
a. Avx512 path added
b.else if (Avx.IsSupported && length >= (uint)Vector256<TLeft>.Count)
c. Already uses Vector – switch check? -
Ascii.Utility.
a. Has avx512 path
b.private static bool VectorContainsNonAsciiChar(Vector256<ushort> utf16Vector)
c. Uses Testz/ PackUnsignedSaturate – can possibly move to more efficient patterns similar to ‘HasMatch’
BitArray is the only one where it’s feasible currently and that’s dependent on #99596
Some patterns we can consider
Sse2.multiply
– vector multiply does not work the same way. Vector version stores only the lower half after multiplication. Intrinsic version(for sse and avx upgrades type uint->ulong for eg). SoWiden -> Multiply
might work