Vectorize Convert.FromHexString #82521

EgorBo · 2023-02-23T11:52:35Z

Vectorize Convert.FromHexString (UTF-16) using "Algorithm 3" in http://0x80.pl/notesen/2022-01-17-validating-hex-parse.html by Geoff Langdale and Wojciech Mula

Motivation

Convert.ToHexString is accelerated while Convert.FromHexString is not. Also, this PR's UTF8-based impl can be re-used for other things like Guid parsing (both UTF16 and UTF8) to e.g. speed up json parsing with guids inside.

Benchmark

public static IEnumerable<string> HexTestData()
{
    yield return "ABCD";
    yield return "01234567";
    yield return "0123456789ABCDEF";
    yield return "0123456789ABCDEF0123456789ABCDEF";
    yield return "0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF";
    yield return "0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF";
}

[Benchmark]
[ArgumentsSource(nameof(HexTestData))]
public byte[] FromHexString(string hex) => Convert.FromHexString(hex);

|        Method |                   Toolchain |                  hex |       Mean | Ratio |
|-------------- |---------------------------- |--------------------- |-----------:|------:|
| FromHexString |      \Core_Root\corerun.exe |                 ABCD |  11.549 ns |  0.97 |
| FromHexString | \Core_Root_base\corerun.exe |                 ABCD |  11.926 ns |  1.00 |
|               |                             |                      |            |       |
| FromHexString |      \Core_Root\corerun.exe |             01234567 |  15.492 ns |  0.96 |
| FromHexString | \Core_Root_base\corerun.exe |             01234567 |  16.212 ns |  1.00 |
|               |                             |                      |            |       |
| FromHexString |      \Core_Root\corerun.exe |     0123456789ABCDEF |   9.597 ns |  0.39 |
| FromHexString | \Core_Root_base\corerun.exe |     0123456789ABCDEF |  24.550 ns |  1.00 |
|               |                             |                      |            |       |
| FromHexString |      \Core_Root\corerun.exe | 01234(...)BCDEF [32] |  11.541 ns |  0.31 |
| FromHexString | \Core_Root_base\corerun.exe | 01234(...)BCDEF [32] |  37.090 ns |  1.00 |
|               |                             |                      |            |       |
| FromHexString |      \Core_Root\corerun.exe | 01234(...)BCDEF [64] |  15.928 ns |  0.21 |
| FromHexString | \Core_Root_base\corerun.exe | 01234(...)BCDEF [64] |  75.480 ns |  1.00 |
|               |                             |                      |            |       |
| FromHexString |      \Core_Root\corerun.exe |  0123(...)CDEF [128] |  24.333 ns |  0.17 | ~6x improvement
| FromHexString | \Core_Root_base\corerun.exe |  0123(...)CDEF [128] | 140.145 ns |  1.00 |

Regression for small inputs is around noise, I assume it's hidden under allocation overhead

EgorBo · 2023-02-24T10:24:13Z

This is ready for review, arm64 is a bit slower but still shows x3 improvement on my M1.

PTAL @dotnet/area-system-memory

ghost · 2023-02-27T08:55:03Z

Tagging subscribers to this area: @dotnet/area-system-memory
See info in area-owners.md if you want to be subscribed.

Issue Details

Vectorize Convert.FromHexString (UTF-16) using "Algorithm 3" in http://0x80.pl/notesen/2022-01-17-validating-hex-parse.html by Geoff Langdale and Wojciech Mula

Motivation

Convert.ToHexString is accelerated while Convert.FromHexString is not. Also, this PR's UTF8-based impl can be re-used for other things like Guid parsing (both UTF16 and UTF8) to e.g. speed up json parsing with guids inside.

Benchmark

public static IEnumerable<string> HexTestData()
{
    yield return "ABCD";
    yield return "01234567";
    yield return "0123456789ABCDEF";
    yield return "0123456789ABCDEF0123456789ABCDEF";
    yield return "0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF";
    yield return "0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF";
}

[Benchmark]
[ArgumentsSource(nameof(HexTestData))]
public byte[] FromHexString(string hex) => Convert.FromHexString(hex);

|        Method |                   Toolchain |                  hex |       Mean | Ratio |
|-------------- |---------------------------- |--------------------- |-----------:|------:|
| FromHexString |      \Core_Root\corerun.exe |                 ABCD |  11.549 ns |  0.97 |
| FromHexString | \Core_Root_base\corerun.exe |                 ABCD |  11.926 ns |  1.00 |
|               |                             |                      |            |       |
| FromHexString |      \Core_Root\corerun.exe |             01234567 |  15.492 ns |  0.96 |
| FromHexString | \Core_Root_base\corerun.exe |             01234567 |  16.212 ns |  1.00 |
|               |                             |                      |            |       |
| FromHexString |      \Core_Root\corerun.exe |     0123456789ABCDEF |   9.597 ns |  0.39 |
| FromHexString | \Core_Root_base\corerun.exe |     0123456789ABCDEF |  24.550 ns |  1.00 |
|               |                             |                      |            |       |
| FromHexString |      \Core_Root\corerun.exe | 01234(...)BCDEF [32] |  11.541 ns |  0.31 |
| FromHexString | \Core_Root_base\corerun.exe | 01234(...)BCDEF [32] |  37.090 ns |  1.00 |
|               |                             |                      |            |       |
| FromHexString |      \Core_Root\corerun.exe | 01234(...)BCDEF [64] |  15.928 ns |  0.21 |
| FromHexString | \Core_Root_base\corerun.exe | 01234(...)BCDEF [64] |  75.480 ns |  1.00 |
|               |                             |                      |            |       |
| FromHexString |      \Core_Root\corerun.exe |  0123(...)CDEF [128] |  24.333 ns |  0.17 | ~6x improvement
| FromHexString | \Core_Root_base\corerun.exe |  0123(...)CDEF [128] | 140.145 ns |  1.00 |

Regression for small inputs is around noise, I assume it's hidden under allocation overhead

Author:	EgorBo
Assignees:	EgorBo
Labels:	`area-System.Memory`
Milestone:	-

EgorBo · 2023-03-03T20:02:55Z

Ping @dotnet/area-system-memory @stephentoub any interest in this? The final goal was to use this work for Guid parsing (should be a small change)

EgorBo · 2023-03-29T00:36:03Z

Closing due to lack of interest after 3 pings

stephentoub · 2023-03-29T02:02:58Z

I'm interested :)

src/libraries/System.Runtime.Extensions/tests/System/Convert.FromHexString.cs

src/libraries/Common/src/System/HexConverter.cs

gfoidl · 2023-03-29T10:52:59Z

src/libraries/Common/src/System/HexConverter.cs

+                if (Ssse3.IsSupported)
+                {
+                    output = Ssse3.MultiplyAddAdjacent(nibbles,
+                        Vector128.Create((short)0x0110).AsSByte()).AsByte();
+                }
+                else
+                {
+                    // Workaround for missing MultiplyAddAdjacent on ARM
+                    Vector128<short> ones = Vector128.Create(0x10010).AsInt16();
+                    Vector128<short> vl = AdvSimd.Multiply(
+                        AdvSimd.ZeroExtendWideningLower(nibbles.GetLower()).AsInt16(), ones).AsInt16();
+                    Vector128<short> vu = AdvSimd.Multiply(
+                        AdvSimd.ZeroExtendWideningLower(nibbles.GetUpper()).AsInt16(), ones).AsInt16();
+                    output = AdvSimd.AddSaturate(
+                        AdvSimd.Arm64.UnzipEven(vl, vu),
+                        AdvSimd.Arm64.UnzipOdd(vl, vu)).AsByte();
+                }


Can this be a helper function for easier readability of the loop?

I was not sure how to call that helper because aparently it's not a cross-platform MultiplyAddAdjacent so that's why I decided to keep it here. Maybe we'll simplify it for arm in #82521 (comment)

src/libraries/Common/src/System/HexConverter.cs

tannergooding · 2023-03-29T15:06:52Z

Closing due to lack of interest after 3 pings

Sorry, this got missed in the general hectic period of recent area ownership changes. Looking now.

src/libraries/Common/src/System/HexConverter.cs

Co-authored-by: Günther Foidl <[email protected]>

tannergooding · 2023-03-29T15:57:00Z

src/libraries/Common/src/System/HexConverter.cs

+                    // Workaround for missing MultiplyAddAdjacent on ARM
+                    Vector128<short> ones = Vector128.Create(0x10010).AsInt16();
+                    Vector128<short> vl = AdvSimd.Multiply(
+                        AdvSimd.ZeroExtendWideningLower(nibbles.GetLower()).AsInt16(), ones).AsInt16();
+                    Vector128<short> vu = AdvSimd.Multiply(
+                        AdvSimd.ZeroExtendWideningLower(nibbles.GetUpper()).AsInt16(), ones).AsInt16();
+                    output = AdvSimd.AddSaturate(
+                        AdvSimd.Arm64.UnzipEven(vl, vu),
+                        AdvSimd.Arm64.UnzipOdd(vl, vu)).AsByte();


Is there not a better way to do this either? Arm64 has many specialized multiply instructions, including those that simultaneously widen or do addition or saturation.

@TamarChristinaArm might have some insight...

For reference, Vector128<short> MultiplyAddAdjacent(Vector128<byte> left, Vector128<sbyte> right) is PMADDUBSW. This effectively does the following:

DEST[0] = SaturateToSignedWord((short)(SRC[1] * DEST[1]) + (short)(SRC[0] * DEST[0])); DEST[1] = SaturateToSignedWord((short)(SRC[3] * DEST[3]) + (short)(SRC[2] * DEST[2])); ...

That is, it takes an unsigned byte (left) and a signed byte (right). It multiplies these to produce a signed short. It then does a pairwise addition.

In the case of xarch, we're taking an unsigned byte (nibbles) of unknown value and a signed byte of known value 0x0110 which is effectively unsigned.

Since for x-bit * x-bit = y-bit the sign is only relevant when y-bit > x-bit, we can effectively treat this as an unsigned multiplication and then reinterpret it back as signed.

That is, we can at least do:

Vector128<byte> multiplier = Vector128.Create((ushort)0x0110).AsByte()); Vector128<short> vl = AdvSimd.MultiplyWideningLower(nibbles.GetLower(), multiplier.GetLower()).AsInt16(); Vector128<short> vu = AdvSimd.MultiplyWideningUpper(nibbles, multiplier).AsInt16()

This will allow us to remove a couple of the operations. I'm not aware of a trivial way to do a pairwise saturation here, however so we may need to keep that latter bit as is.

What I did here is I took a general impl of MultiplyAddAdjacent and simplified it for the constant argument. I ran benchmarks on M1 and they overall demonstrated similar ratios as for x64. I also do think it's possible to make it more efficient

Hi :),

I'm on holidays till Monday, but Tanner's version is much better.
Unfortunately AArch64 lacks a lot of pairwise operations.

However from a quick look (so please double check the below before implementing) looking at the constant you used

Vector128<short> ones = Vector128.Create(0x10010).AsInt16();

This looks like it's multiplying even elements by 0x10 and odd by 0x1.
So evens are just shift left by 4 and odds just zero extend?

Atm the sequence from Tanner is:

movi + umull + umull + (uzp1 + uzp2) + sqadd

where the permutes run in parallel. So roughly 2 + 4 + 4 + 2 + 2 = 14 cycles on Neoverse-N1, 12 cycles assuming the movi got lifted.

Since you need the permute anyway and nibbles is unsigned, how about instead, create a zero vector outside the loop and do the permutes earlier to separate even/odd, but take advantage of this to get a zero extend free.

zip1 (with zero vector) + uzp2 + ushll 4 + sqadd

Which would be 2 + 2 + 2 = 6 since the permutes again run in parallel.

Should be a 2x speedup.

@TamarChristinaArm @tannergooding thanks for the suggestion! Did I get it correctly:

var zip1 = AdvSimd.Arm64.ZipLow(nibbles, Vector128<byte>.Zero); // zip1 var uzp2 = AdvSimd.Arm64.UnzipOdd(nibbles, nibbles); // uzp2 var ushl = AdvSimd.ShiftLeftLogicalWideningLower(uzp2.GetLower(), 4); // ushll 4 output = AdvSimd.AddSaturate(zip1, ushl.AsByte()); // sqadd

?
unfortunately it doesn't produce expected results 🙁

Unsure about the uzip2, but I think the principle is that because the multiplier is effectively multiplying even * 16 and odd * 1 and that we know both inputs are unsigned, we can do this:

var even = AdvSimd.Arm64.ZipLow(nibbles, Vector128<byte>.Zero).AsInt16(); var odd = AdvSimd.Arm64.ZipHigh(nibbles, Vector128<byte>.Zero).AsInt16(); even = AdvSimd.ShiftLeftLogical(even, 4); output = AdvSimd.AddSaturate(even, odd);

This gives us a 4 instruction sequence: zip1, zip2, shl, sqadd

This works because ZipLow and ZipHigh interleave the elements. That is ZipLow is:

for (int p = 0; p < Vector128<T>.Count / 2; p++) { result[(2 * p) + 0] = left[p]; result[(2 * p) + 1] = right[p]; }

While ZipHigh is:

for (int p = 0; p < Vector128<T>.Count / 2; p++) { result[(2 * p) + 0] = left[p + 1]; result[(2 * p) + 1] = right[p + 1]; }

This means we can get the effect of decoupling the even and odd elements while zero-extending at the same time.

We then just need to multiply the even results by 16, which is the same as shifting left by 4. We don't need to touch the odd results since they are just multiplied by 1.

Then we do the saturating add and are complete.

Ah, that's not quite right

ZipLow is doing lhs[0], rhs[0], lhs[1], rhs[1], ...
and we want lhs[0], rhs[0], lhs[2], rhs[2], ...

This should work.. TransposeEven is the one that does lhs[0], rhs[0], lhs[2], rhs[2], ... (And odd indices for TransposeOdd)

var even = AdvSimd.TransposeEven(nibbles, Vector128<byte>.Zero).AsInt16(); var odd = AdvSimd.TransposeOdd(nibbles, even.AsByte()).AsInt16(); even = AdvSimd.ShiftLeftLogical(even, 4); output = AdvSimd.AddSaturate(even, odd);

This should work.. TransposeEven is the one that does lhs[0], rhs[0], lhs[2], rhs[2], ...

Args. yes sorry, flipped the permutes operation in my head. So should be:

var zip1 = AdvSimd.Arm64.TransposeEven(nibbles, Vector128<byte>.Zero); // trn1 var uzp2 = AdvSimd.Arm64.UnzipOdd(nibbles, nibbles); // uzp2 var ushl = AdvSimd.ShiftLeftLogicalWideningLower(uzp2.GetLower(), 4); // ushll 4 output = AdvSimd.AddSaturate(zip1, ushl.AsByte()); // sqadd

or @tannergooding 's version above.

we don't get two movni reg, #0 emitted for Vector128.Zero

The loop doesn't look that big, and I assume there's no loop invariant motion? It might be worth lifting the 5 constants out of the loop. You'll definitely have enough left over on Advanced SIMD, but am unsure about AVX/SSE. But worth trying to see what comes out.

I assume there's no loop invariant motion?

We do loop invariant motion, it's just never done for simple things like zero vectors. Even if I do hoisting by hands it will be propagated back at its use. The main issue that our RA is not perfect so we try to minimize cheap cases to occupy registers. We'll improve that eventually. Overall, most constant vectors are hoisted from this loop by JIT (at least till the IsAscii check)

The main issue that our RA is not perfect so we try to minimize cheap cases to occupy registers.

Ah that's fine, Zero is typically free anyway, the movi idiom is handled at rename. So should only really be an issue in cases where it prevents you from squeezing stuff inside a single fetch block. So no reall priority, if the complicated ones are hoisted :)

Co-authored-by: Günther Foidl <[email protected]>

src/libraries/Common/src/System/HexConverter.cs

… vectorize-from-hex

EgorBo · 2023-03-29T16:52:24Z

I'm interested :)

Sorry, this got missed in the general hectic period of recent area ownership changes. Looking now.

No problem! I was just trying to minimize number of open PRs. I quite often close my PRs and then re-open them after a while and finish them 🙂

EgorBo · 2023-03-31T14:52:43Z

@tannergooding anything else on this PR? I assume we can optimize more things for arm64 specifcally here but it's a trade-off between code readability/portability and perf?

tannergooding · 2023-03-31T15:32:33Z

I think what we have is good now, and makes up for the perf difference of xarch having 1 instruction vs Arm64 having 4

EgorBo · 2023-03-31T19:02:32Z

Failure is #83655

Vectorize Convert.FromHexString

d50d542

ghost assigned EgorBo Feb 23, 2023

EgorBo added 3 commits February 23, 2023 14:15

Fix arm64, clean up

c86a880

oops

5c3660a

Improve codegen

e37e7f8

build-analysis bot mentioned this pull request Feb 23, 2023

Roslyn source generator crash on mono/linux/arm64 #81123

Closed

Fix arm64

2494303

EgorBo marked this pull request as ready for review February 23, 2023 20:56

EgorBo mentioned this pull request Feb 23, 2023

[API Proposal]: Vector128/256 AddSaturate/RemoveSaturate #82559

Closed

marek-safar added the area-System.Memory label Feb 27, 2023

adamsitnik added the tenet-performance Performance related issue label Feb 27, 2023

EgorBo closed this Mar 29, 2023

stephentoub reopened this Mar 29, 2023

danmoseley reviewed Mar 29, 2023

View reviewed changes

src/libraries/System.Runtime.Extensions/tests/System/Convert.FromHexString.cs Show resolved Hide resolved

build-analysis bot mentioned this pull request Mar 29, 2023

WasmTestOnBrowser-System.* test failures in CI #83655

Closed

gfoidl reviewed Mar 29, 2023

View reviewed changes

tannergooding reviewed Mar 29, 2023

View reviewed changes

src/libraries/Common/src/System/HexConverter.cs Outdated Show resolved Hide resolved

tannergooding reviewed Mar 29, 2023

View reviewed changes

src/libraries/Common/src/System/HexConverter.cs Outdated Show resolved Hide resolved

EgorBo and others added 2 commits March 29, 2023 17:37

Update src/libraries/Common/src/System/HexConverter.cs

23b28d1

Co-authored-by: Günther Foidl <[email protected]>

Merge branch 'main' of github.com:dotnet/runtime into vectorize-from-hex

290c232

tannergooding reviewed Mar 29, 2023

View reviewed changes

Update src/libraries/Common/src/System/HexConverter.cs

0fa9030

Co-authored-by: Günther Foidl <[email protected]>

tannergooding reviewed Mar 29, 2023

View reviewed changes

src/libraries/Common/src/System/HexConverter.cs Show resolved Hide resolved

Address feedback

48da798

Merge branch 'vectorize-from-hex' of github.com:EgorBo/runtime-1 into…

4539793

… vectorize-from-hex

EgorBo added 4 commits March 29, 2023 19:13

Add mixed case test cases

c800e10

Merge branch 'main' of github.com:dotnet/runtime into vectorize-from-hex

6d09055

fix test

dd913d8

Apply Tanner's suggestion

5c22b52

build-analysis bot mentioned this pull request Mar 31, 2023

[wasm] interpreter timeouts when WebSocket closes unexpectedly #84101

Closed

tannergooding approved these changes Mar 31, 2023

View reviewed changes

EgorBo merged commit f1b32fe into dotnet:main Mar 31, 2023

EgorBo deleted the vectorize-from-hex branch March 31, 2023 19:02

benaadams mentioned this pull request Apr 4, 2023

Vectorize HexConverter.TryDecodeFromUtf16 NethermindEth/nethermind#5539

Merged

12 tasks

ghost locked as resolved and limited conversation to collaborators May 1, 2023

Vectorize Convert.FromHexString #82521

Vectorize Convert.FromHexString #82521

Uh oh!

Conversation

EgorBo commented Feb 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Benchmark

Uh oh!

EgorBo commented Feb 24, 2023

Uh oh!

ghost commented Feb 27, 2023

Motivation

Benchmark

Uh oh!

EgorBo commented Mar 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

EgorBo commented Mar 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stephentoub commented Mar 29, 2023

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tannergooding commented Mar 29, 2023

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

EgorBo Mar 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tannergooding Mar 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

EgorBo commented Mar 29, 2023

Uh oh!

EgorBo commented Mar 31, 2023

Uh oh!

tannergooding commented Mar 31, 2023

Uh oh!

EgorBo commented Mar 31, 2023

Uh oh!

Uh oh!

EgorBo commented Feb 23, 2023 •

edited

Loading

EgorBo commented Mar 3, 2023 •

edited

Loading

EgorBo commented Mar 29, 2023 •

edited

Loading

EgorBo Mar 30, 2023 •

edited

Loading

tannergooding Mar 30, 2023 •

edited

Loading