Skip to content

Conversation

brianpopow
Copy link
Collaborator

Prerequisites

  • I have written a descriptive pull-request title
  • I have verified that there are no overlapping pull-requests open
  • I have verified that I am following the existing coding patterns and practice as demonstrated in the repository. These follow strict Stylecop rules 👮.
  • I have provided test coverage for my change (where applicable)

Description

This PR add SSE2 version of Mean16x4 which is used during lossy encoding with mode 1.
Related to #1786

Profile Results

Before:
Mean

After:
mean_sse

The overall encoding time was 9002 ms.
Its an improvement, but only a small one. I guess every bit counts.

@codecov
Copy link

codecov bot commented Nov 7, 2021

Codecov Report

Merging #1814 (7d8225b) into master (7495a91) will increase coverage by 0.22%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1814      +/-   ##
==========================================
+ Coverage   87.13%   87.35%   +0.22%     
==========================================
  Files         936      936              
  Lines       48128    48154      +26     
  Branches     6037     6038       +1     
==========================================
+ Hits        41934    42063     +129     
+ Misses       5190     5092      -98     
+ Partials     1004      999       -5     
Flag Coverage Δ
unittests 87.35% <100.00%> (+0.22%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/ImageSharp/Formats/Webp/Lossy/LossyUtils.cs 100.00% <100.00%> (ø)
...rc/ImageSharp/Formats/Webp/Lossy/Vp8EncIterator.cs 100.00% <100.00%> (ø)
.../ImageSharp/Formats/Webp/Lossy/WebpLossyDecoder.cs 97.56% <100.00%> (ø)
src/ImageSharp/Formats/Webp/Lossy/YuvConversion.cs 99.33% <100.00%> (+0.05%) ⬆️
...rc/ImageSharp/Formats/Webp/Lossless/Vp8LEncoder.cs 97.51% <0.00%> (+0.12%) ⬆️
.../ImageSharp/Formats/Webp/Lossless/LosslessUtils.cs 97.54% <0.00%> (+8.85%) ⬆️
...ageSharp/Formats/Webp/Lossless/PredictorEncoder.cs 98.29% <0.00%> (+9.07%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7495a91...7d8225b. Read the comment docs.

Comment on lines 819 to 820
fixed (byte* inputPtr = input)
fixed (ushort* tmpPtr = tmp)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same recommendations to avoid pinning. Alternatively you can pin YuvIn before the for (k = 0; k < 16; k += 4) loop, and pass pointers to the method.

ref ushort outputRef = ref MemoryMarshal.GetReference(tmp);
Unsafe.As<ushort, Vector128<ushort>>(ref outputRef) = f0.AsUInt16();

dc[0] = (uint)(tmp[1] + tmp[0]);
Copy link
Member

@JimBobSquarePants JimBobSquarePants Nov 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks to me like if you reverse these span assignments you'll cut out 9 of 12 bounds checks.

Comment on lines 973 to 976
dc[0] = (uint)(tmp[1] + tmp[0]);
dc[1] = (uint)(tmp[3] + tmp[2]);
dc[2] = (uint)(tmp[5] + tmp[4]);
dc[3] = (uint)(tmp[7] + tmp[6]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this the same as _mm_hadd_epi16 aka. Ssse3.HorizontalAdd?

I'm afraid 12 span indexer bound checks have measureable impact here. All of them seem unnecessary, since is tmp is always of 16 size and dc is always of 4 size. If we can't find any matching HorizontalAdd for this, maybe we should consider passing tmp as a pointer and and indexing dc with Unsafe.* stuff.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes it is the same as Ssse3.HorizontalAdd, good catch

[MethodImpl(InliningOptions.ShortMethod)]
public static void YuvToBgr(int y, int u, int v, Span<byte> bgr)
{
bgr[0] = (byte)YuvToB(y, u);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverse these also.

#if SUPPORTS_RUNTIME_INTRINSICS
if (Sse2.IsSupported)
{
tmp.Clear();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really needed? We override the contents in the end.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah i think you are right, its not needed

Reverse access to dc

Co-authored-by: James Jackson-South <[email protected]>
Copy link
Member

@JimBobSquarePants JimBobSquarePants left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

Comment on lines 972 to 975
dc[3] = (uint)lower.GetElement(3);
dc[2] = (uint)lower.GetElement(2);
dc[1] = (uint)lower.GetElement(1);
dc[0] = (uint)lower.GetElement(0);
Copy link
Member

@antonfirsov antonfirsov Nov 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what does GetElement compile to but it can be terribly inefficient. dc[0...3] is 64 bits if I'm not mistaken:

Suggested change
dc[3] = (uint)lower.GetElement(3);
dc[2] = (uint)lower.GetElement(2);
dc[1] = (uint)lower.GetElement(1);
dc[0] = (uint)lower.GetElement(0);
Unsafe.As<uint, Vector64<short>>(ref MemoryMarshal.GetReference(dc)) = lower;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No sorry, it's actually 128 bits. It needs some widening with intrinsics then.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this GetLower is not needed at all, now that i think again about it.

What about this: 1452ba0

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@antonfirsov so you think we should try to avoid GetElement entirely?

Copy link
Member

@antonfirsov antonfirsov Nov 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a good practice to avoid GetElement whenever possible. It looks like it compiles to VPEXTRW extracting scalar registers from the SIMD register, but then you do scalar copy for each value. We can get much better results with shuffling:
SHARPLAB

I think we are looking for _mm_unpacklo_epi16 / Sse2.UnpackLow interleaving lower with zeros to get equivalent uint values, but not sure how would this act with negative short-s. Can hadd end up having negative values?

Sse2.UnpackLow(lower, Vector128<short>.Zero)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks indeed much better. I dont think there can be negative values.

Thanks for your help!

Copy link
Collaborator Author

@brianpopow brianpopow Nov 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using Sse2.Store looks even a tiny bit better. Should I go for that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dc is stackallocked, so you can pass it as a pointer easily if you want to, though I'm not sure if it's worth the trouble.
The main benefit would be the removal of one Slice() on the call site, the disadvantage is the inconsistency of the method signature (input is a span, dc is a pointer).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its not worth it, lets keep the method signature consistent and keep it as it is.

This was referenced Jul 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants