Skip to content

Improvements to the "Sum" SIMD algorithm #1112

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Oct 25, 2018
Merged

Improvements to the "Sum" SIMD algorithm #1112

merged 1 commit into from
Oct 25, 2018

Conversation

tannergooding
Copy link
Member

Does some cleanup so that we have a single "Sum" algorithm (rather than one for aligned and one for unaligned inputs).

For inputs with fewer elements than can fit in the Vector type, it falls back to scalar code.
For inputs that are not naturally aligned (the alignment is not a multiple of 4), it does exclusively unaligned loads
For all other inputs, it will do at most two unaligned loads (one each for any leading/trailing unaligned elements) and all other loads will be aligned.

@tannergooding
Copy link
Member Author

CC. @danmosemsft, @eerhardt, @Anipik

@tannergooding
Copy link
Member Author

This is a simple example of #836

@tannergooding
Copy link
Member Author

Results in some minor perf improvements for the Microsoft.ML.CpuMath.PerformanceTests.

Before:

Method Mean Error StdDev
Avx.SumU 149.6 us 0.9997 us 0.9351 us
Native.SumU 262.2 us 2.064 us 1.931 us
Sse.SumU 263.1 us 1.504 us 1.407 us

After:

Method Mean Error StdDev
Avx.Sum 129.7 us 1.103 us 0.9775 us
Native.Sum 261.2 us 1.013 us 0.8983 us
Sse.Sum 255.3 us 1.521 us 1.422 us

@tannergooding
Copy link
Member Author

This can serve as the basis for the other algorithms as well. Generally the only tweaking that needs to happen is when dealing with leading/trailing elements, where you may need additional masking/etc to get the elements lined up correctly. For example, Scale requires you to ensure the masked out elements are the original value, rather than zero (which requires a couple additional instructions).

@tannergooding
Copy link
Member Author

Test failure is for NormalizerTests.LpGcNormAndWhiteningWorkout and is due to a baseline diff for NormalizerTests.LpGcNormAndWhiteningWorkout where the last couple digits of the result may differ due to different indices being summed together depending on the alignment of the input data (for floating point: a + b + c can produce a different result than a + c + b).

@tannergooding
Copy link
Member Author

tannergooding commented Oct 2, 2018

Baseline Diff Delta
-0.176903129 -0.1769031 -0.000000029
0.114987023 0.114987031 -0.000000008
-0.153417692 -0.153417677 -0.000000015
-0.109801926 -0.109801918 -0.000000008
-0.0158602837 -0.0158602744 -0.0000000093
0.0344656035 0.0344656147 -0.0000000112
0.160775661 0.160775676 -0.000000015
0.169217348 0.169217363 -0.000000015
0.122788094 0.122788109 -0.000000015
0.17765902 0.177659035 -0.000000015

Full Baseline

#@ TextLoader{
#@   sep=tab
#@   col=lpnorm:R4:0-10
#@   col=gcnorm:R4:11-21
#@   col=whitened:R4:22-32
#@ }
-0.686319232	0.192169383	-0.152238086	0.03493989	0.346903175	0.09483684	-0.132272437	-0.124785319	-0.5315855	-0.0973325446	0.114802495	-0.626524031	0.289601743	-0.0695612058	0.125636056	0.4509648	0.188099176	-0.0487401523	-0.04093227	-0.465160966	-0.0123033375	0.208920211	-2.604605	0.829638362	-0.5992434	0.19860521	1.33247662	0.369197041	-0.5760094	-0.5490271	-1.94509208	-0.393351972	0.507488966
-0.20306389	-0.1231699	-0.039946992	0.183090389	-0.3328916	0.279628932	-0.0066578323	0.432759076	-0.0798939839	-0.1664458	-0.7057302	-0.137441739	-0.055349838	0.0301625486	0.259335726	-0.270841062	0.3585301	0.0643675	0.5158729	-0.0108833946	-0.09981628	-0.653936446	-0.5923902	-0.324390084	-0.114805378	0.6855182	-1.055579	0.8767955	-0.0392023772	1.21807373	-0.160801888	-0.47570774	-2.22817
-0.268398017	-0.28734377	0.571529865	0.006315247	-0.246294647	-0.445224941	-0.344181	-0.20524554	0.284186125	-0.116832078	-0.06946772	-0.176903129	-0.19703348	0.715542555	0.114987023	-0.153417692	-0.3647864	-0.257424533	-0.109801926	0.410232216	-0.0158602837	0.0344656035	-0.9132714	-0.911281645	1.814283	0.07471426	-0.8969923	-1.44387519	-1.19571114	-0.6542767	0.887983143	-0.4604767	-0.17543222
0.117021732	0.438831449	-0.100304335	0.125380427	-0.413755417	0.0794076	0.133739114	-0.397038	-0.497342378	-0.2632989	0.313451052	0.160775661	0.485780418	-0.0587080531	0.169217348	-0.3752711	0.122788094	0.17765902	-0.358387738	-0.459687948	-0.223320842	0.3591552	0.236966148	1.004758	-0.233154371	0.3862052	-1.02724624	0.240614042	0.299898773	-1.03102541	-1.13852251	-0.6675951	0.766793966

Full Diff:

#@ TextLoader{
#@   sep=tab
#@   col=lpnorm:R4:0-10
#@   col=gcnorm:R4:11-21
#@   col=whitened:R4:22-32
#@ }
-0.686319232	0.192169383	-0.152238086	0.03493989	0.346903175	0.09483684	-0.132272437	-0.124785319	-0.5315855	-0.0973325446	0.114802495	-0.626524031	0.289601743	-0.0695612058	0.125636056	0.4509648	0.188099176	-0.0487401523	-0.04093227	-0.465160966	-0.0123033375	0.208920211	-2.604605	0.829638362	-0.5992434	0.19860521	1.33247662	0.369197041	-0.5760094	-0.5490271	-1.94509208	-0.393351972	0.507488966
-0.20306389	-0.1231699	-0.039946992	0.183090389	-0.3328916	0.279628932	-0.0066578323	0.432759076	-0.0798939839	-0.1664458	-0.7057302	-0.137441739	-0.055349838	0.0301625486	0.259335726	-0.270841062	0.3585301	0.0643675	0.5158729	-0.0108833946	-0.09981628	-0.653936446	-0.5923902	-0.324390084	-0.114805378	0.6855182	-1.055579	0.8767955	-0.0392023772	1.21807373	-0.160801888	-0.47570774	-2.22817
-0.268398017	-0.28734377	0.571529865	0.006315247	-0.246294647	-0.445224941	-0.344181	-0.20524554	0.284186125	-0.116832078	-0.06946772	-0.1769031	-0.19703348	0.715542555	0.114987031	-0.153417677	-0.3647864	-0.257424533	-0.109801918	0.410232216	-0.0158602744	0.0344656147	-0.9132714	-0.911281645	1.814283	0.07471426	-0.8969923	-1.44387519	-1.19571114	-0.6542767	0.887983143	-0.4604767	-0.17543222
0.117021732	0.438831449	-0.100304335	0.125380427	-0.413755417	0.0794076	0.133739114	-0.397038	-0.497342378	-0.2632989	0.313451052	0.160775676	0.485780418	-0.0587080531	0.169217363	-0.3752711	0.122788109	0.177659035	-0.358387738	-0.459687948	-0.223320842	0.3591552	0.236966148	1.004758	-0.233154371	0.3862052	-1.02724624	0.240614042	0.299898773	-1.03102541	-1.13852251	-0.6675951	0.766793966

@danmoseley danmoseley requested a review from Anipik October 5, 2018 23:06
result128 = Sse.AddScalar(result128, Sse.LoadScalarVector128(pSrcCurrent));
pSrcCurrent++;
// Handle any trailing elements that don't fit into a 128-bit block by moving back so that the next
// unaligned load will read to the end of the array and then mask out any elements already processed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be "next aligned load"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, we are moving back from an aligned address to an unaligned one.

@tannergooding
Copy link
Member Author

CC. @eerhardt, @Anipik for review.

{
// bitwise comparison is needed because Abs(Inf-Inf) and Abs(NaN-NaN) are not 0s.
return FloatUtils.GetBits(x) == FloatUtils.GetBits(y) || Math.Abs(x - y) < DoubleEps;
}

private const float SingleEps = 1e-6f;

private static bool EqualWithEpsSingle(float x, float y)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, not a new issue, it would be nice if the code consistently used all C# or all .NET names for built-in types. float, Double etc..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. I am following the .NET framework guidelines for names here. We would ideally fixup the rest of the names to be the same (as has already been done in most of the public surface area).

@@ -1061,29 +1061,123 @@ public static unsafe void MulElementWiseU(ReadOnlySpan<float> src1, ReadOnlySpan
}
}

public static unsafe float SumU(ReadOnlySpan<float> src)
public static unsafe float Sum(ReadOnlySpan<float> src)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming you're running on a machine supporting AVX -- unit tests would not hit this -- unless you ran them with the env variable set?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume not since @fiigii change didn't go in yet.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have unit/perf tests that explicitly call these methods/code-paths

Copy link
Member

@eerhardt eerhardt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

Copy link
Contributor

@Anipik Anipik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tannergooding
Copy link
Member Author

Rebased to resolve conflicts.

@tannergooding tannergooding merged commit 76d1203 into dotnet:master Oct 25, 2018
@ghost ghost locked as resolved and limited conversation to collaborators Mar 28, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants