-
Notifications
You must be signed in to change notification settings - Fork 5k
Vectorize Convert.FromHexString #82521
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This is ready for review, arm64 is a bit slower but still shows x3 improvement on my M1. PTAL @dotnet/area-system-memory |
Tagging subscribers to this area: @dotnet/area-system-memory Issue DetailsVectorize Motivation
Benchmarkpublic static IEnumerable<string> HexTestData()
{
yield return "ABCD";
yield return "01234567";
yield return "0123456789ABCDEF";
yield return "0123456789ABCDEF0123456789ABCDEF";
yield return "0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF";
yield return "0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789ABCDEF";
}
[Benchmark]
[ArgumentsSource(nameof(HexTestData))]
public byte[] FromHexString(string hex) => Convert.FromHexString(hex);
Regression for small inputs is around noise, I assume it's hidden under allocation overhead
|
Ping @dotnet/area-system-memory @stephentoub any interest in this? The final goal was to use this work for Guid parsing (should be a small change) |
Closing due to lack of interest after 3 pings |
I'm interested :) |
src/libraries/System.Runtime.Extensions/tests/System/Convert.FromHexString.cs
Show resolved
Hide resolved
if (Ssse3.IsSupported) | ||
{ | ||
output = Ssse3.MultiplyAddAdjacent(nibbles, | ||
Vector128.Create((short)0x0110).AsSByte()).AsByte(); | ||
} | ||
else | ||
{ | ||
// Workaround for missing MultiplyAddAdjacent on ARM | ||
Vector128<short> ones = Vector128.Create(0x10010).AsInt16(); | ||
Vector128<short> vl = AdvSimd.Multiply( | ||
AdvSimd.ZeroExtendWideningLower(nibbles.GetLower()).AsInt16(), ones).AsInt16(); | ||
Vector128<short> vu = AdvSimd.Multiply( | ||
AdvSimd.ZeroExtendWideningLower(nibbles.GetUpper()).AsInt16(), ones).AsInt16(); | ||
output = AdvSimd.AddSaturate( | ||
AdvSimd.Arm64.UnzipEven(vl, vu), | ||
AdvSimd.Arm64.UnzipOdd(vl, vu)).AsByte(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be a helper function for easier readability of the loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was not sure how to call that helper because aparently it's not a cross-platform MultiplyAddAdjacent
so that's why I decided to keep it here. Maybe we'll simplify it for arm in #82521 (comment)
Sorry, this got missed in the general hectic period of recent area ownership changes. Looking now. |
// Workaround for missing MultiplyAddAdjacent on ARM | ||
Vector128<short> ones = Vector128.Create(0x10010).AsInt16(); | ||
Vector128<short> vl = AdvSimd.Multiply( | ||
AdvSimd.ZeroExtendWideningLower(nibbles.GetLower()).AsInt16(), ones).AsInt16(); | ||
Vector128<short> vu = AdvSimd.Multiply( | ||
AdvSimd.ZeroExtendWideningLower(nibbles.GetUpper()).AsInt16(), ones).AsInt16(); | ||
output = AdvSimd.AddSaturate( | ||
AdvSimd.Arm64.UnzipEven(vl, vu), | ||
AdvSimd.Arm64.UnzipOdd(vl, vu)).AsByte(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there not a better way to do this either? Arm64 has many specialized multiply instructions, including those that simultaneously widen
or do addition
or saturation
.
@TamarChristinaArm might have some insight...
For reference, Vector128<short> MultiplyAddAdjacent(Vector128<byte> left, Vector128<sbyte> right)
is PMADDUBSW
. This effectively does the following:
DEST[0] = SaturateToSignedWord((short)(SRC[1] * DEST[1]) + (short)(SRC[0] * DEST[0]));
DEST[1] = SaturateToSignedWord((short)(SRC[3] * DEST[3]) + (short)(SRC[2] * DEST[2]));
...
That is, it takes an unsigned byte
(left
) and a signed byte
(right
). It multiplies these to produce a signed short
. It then does a pairwise addition.
In the case of xarch, we're taking an unsigned byte
(nibbles
) of unknown value and a signed byte
of known value 0x0110
which is effectively unsigned.
Since for x-bit * x-bit = y-bit
the sign is only relevant when y-bit > x-bit
, we can effectively treat this as an unsigned multiplication
and then reinterpret it back as signed
.
That is, we can at least do:
Vector128<byte> multiplier = Vector128.Create((ushort)0x0110).AsByte());
Vector128<short> vl = AdvSimd.MultiplyWideningLower(nibbles.GetLower(), multiplier.GetLower()).AsInt16();
Vector128<short> vu = AdvSimd.MultiplyWideningUpper(nibbles, multiplier).AsInt16()
This will allow us to remove a couple of the operations. I'm not aware of a trivial way to do a pairwise saturation here, however so we may need to keep that latter bit as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I did here is I took a general impl of MultiplyAddAdjacent
and simplified it for the constant argument. I ran benchmarks on M1 and they overall demonstrated similar ratios as for x64. I also do think it's possible to make it more efficient
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi :),
I'm on holidays till Monday, but Tanner's version is much better.
Unfortunately AArch64 lacks a lot of pairwise operations.
However from a quick look (so please double check the below before implementing) looking at the constant you used
Vector128<short> ones = Vector128.Create(0x10010).AsInt16();
This looks like it's multiplying even elements by 0x10
and odd by 0x1
.
So evens are just shift left by 4 and odds just zero extend?
Atm the sequence from Tanner is:
movi + umull + umull + (uzp1 + uzp2) + sqadd
where the permutes run in parallel. So roughly 2 + 4 + 4 + 2 + 2 = 14 cycles on Neoverse-N1, 12 cycles assuming the movi
got lifted.
Since you need the permute anyway and nibbles
is unsigned, how about instead, create a zero vector outside the loop and do the permutes earlier to separate even/odd, but take advantage of this to get a zero extend free.
zip1 (with zero vector) + uzp2 + ushll 4 + sqadd
Which would be 2 + 2 + 2 = 6 since the permutes again run in parallel.
Should be a 2x speedup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TamarChristinaArm @tannergooding thanks for the suggestion! Did I get it correctly:
var zip1 = AdvSimd.Arm64.ZipLow(nibbles, Vector128<byte>.Zero); // zip1
var uzp2 = AdvSimd.Arm64.UnzipOdd(nibbles, nibbles); // uzp2
var ushl = AdvSimd.ShiftLeftLogicalWideningLower(uzp2.GetLower(), 4); // ushll 4
output = AdvSimd.AddSaturate(zip1, ushl.AsByte()); // sqadd
?
unfortunately it doesn't produce expected results 🙁
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unsure about the uzip2
, but I think the principle is that because the multiplier is effectively multiplying even * 16
and odd * 1
and that we know both inputs are unsigned, we can do this:
var even = AdvSimd.Arm64.ZipLow(nibbles, Vector128<byte>.Zero).AsInt16();
var odd = AdvSimd.Arm64.ZipHigh(nibbles, Vector128<byte>.Zero).AsInt16();
even = AdvSimd.ShiftLeftLogical(even, 4);
output = AdvSimd.AddSaturate(even, odd);
This gives us a 4 instruction sequence: zip1, zip2, shl, sqadd
This works because ZipLow
and ZipHigh
interleave the elements. That is ZipLow
is:
for (int p = 0; p < Vector128<T>.Count / 2; p++)
{
result[(2 * p) + 0] = left[p];
result[(2 * p) + 1] = right[p];
}
While ZipHigh
is:
for (int p = 0; p < Vector128<T>.Count / 2; p++)
{
result[(2 * p) + 0] = left[p + 1];
result[(2 * p) + 1] = right[p + 1];
}
This means we can get the effect of decoupling the even and odd elements while zero-extending at the same time.
We then just need to multiply the even results by 16, which is the same as shifting left by 4. We don't need to touch the odd results since they are just multiplied by 1.
Then we do the saturating add and are complete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, that's not quite right
ZipLow
is doing lhs[0], rhs[0], lhs[1], rhs[1], ...
and we want lhs[0], rhs[0], lhs[2], rhs[2], ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should work.. TransposeEven is the one that does lhs[0], rhs[0], lhs[2], rhs[2], ...
(And odd indices for TransposeOdd)
var even = AdvSimd.TransposeEven(nibbles, Vector128<byte>.Zero).AsInt16();
var odd = AdvSimd.TransposeOdd(nibbles, even.AsByte()).AsInt16();
even = AdvSimd.ShiftLeftLogical(even, 4);
output = AdvSimd.AddSaturate(even, odd);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should work.. TransposeEven is the one that does lhs[0], rhs[0], lhs[2], rhs[2], ...
Args. yes sorry, flipped the permutes operation in my head. So should be:
var zip1 = AdvSimd.Arm64.TransposeEven(nibbles, Vector128<byte>.Zero); // trn1
var uzp2 = AdvSimd.Arm64.UnzipOdd(nibbles, nibbles); // uzp2
var ushl = AdvSimd.ShiftLeftLogicalWideningLower(uzp2.GetLower(), 4); // ushll 4
output = AdvSimd.AddSaturate(zip1, ushl.AsByte()); // sqadd
or @tannergooding 's version above.
we don't get two movni reg, #0 emitted for Vector128.Zero
The loop doesn't look that big, and I assume there's no loop invariant motion? It might be worth lifting the 5 constants out of the loop. You'll definitely have enough left over on Advanced SIMD, but am unsure about AVX/SSE. But worth trying to see what comes out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume there's no loop invariant motion?
We do loop invariant motion, it's just never done for simple things like zero vectors. Even if I do hoisting by hands it will be propagated back at its use. The main issue that our RA is not perfect so we try to minimize cheap cases to occupy registers. We'll improve that eventually. Overall, most constant vectors are hoisted from this loop by JIT (at least till the IsAscii check)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main issue that our RA is not perfect so we try to minimize cheap cases to occupy registers.
Ah that's fine, Zero
is typically free anyway, the movi
idiom is handled at rename. So should only really be an issue in cases where it prevents you from squeezing stuff inside a single fetch block. So no reall priority, if the complicated ones are hoisted :)
Co-authored-by: Günther Foidl <[email protected]>
… vectorize-from-hex
No problem! I was just trying to minimize number of open PRs. I quite often close my PRs and then re-open them after a while and finish them 🙂 |
@tannergooding anything else on this PR? I assume we can optimize more things for arm64 specifcally here but it's a trade-off between code readability/portability and perf? |
I think what we have is good now, and makes up for the perf difference of xarch having 1 instruction vs Arm64 having 4 |
Failure is #83655 |
Vectorize
Convert.FromHexString
(UTF-16) using "Algorithm 3" in http://0x80.pl/notesen/2022-01-17-validating-hex-parse.html by Geoff Langdale and Wojciech MulaMotivation
Convert.ToHexString
is accelerated whileConvert.FromHexString
is not. Also, this PR's UTF8-based impl can be re-used for other things likeGuid
parsing (both UTF16 and UTF8) to e.g. speed up json parsing with guids inside.Benchmark
Regression for small inputs is around noise, I assume it's hidden under allocation overhead