-
Notifications
You must be signed in to change notification settings - Fork 5.2k
Description
This report provides an overview of the major performance improvements and regressions in WASM, Mono AOT, and Interpreter during the timeframe of .NET 8 per-preview releases. It focuses on relevant improvements and regressions that are either in progress or investigating, and they are tracked separately. Reports #77490 and #79288 track active speed and size regressions respectively.
Full benchmark report will be available in form similar to #79245 and https://devblogs.microsoft.com/dotnet/performance_improvements_in_net_7/ when .NET 8 is released.
Setup
According to the https://github.com/dotnet/perf-autofiling-issues, the following configurations are used.
Operating System | Bit | Processor Name |
---|---|---|
macOS 13.0 | Arm64 | Apple M1 |
ubuntu 18.04 | X64 | Intel Xeon CPU E5-1650 v4 3.60GHz |
More details on .NET performance benchmarking are available at https://github.com/dotnet/performance.
Preview 7
The following section presents only major improvements with high-level analysis. The analysis should be taken dubiously and readers are encouraged to examine benchmark reports for thorough analysis. We encourage readers to examine the benchmark reports and to call out major improvements not mentioned in this report.
Mono AOT compiler
The performance regressions and improvements are analyzed separately in #89238.
Mono Interpreter
The following sections presents improvements and regressions introduced in Interpreter in the Preview 7.
Improvements
Here is a list of top 20 microbenchmarks improvements in Preview 7.
Name | Baseline Value | Compare Value | % Difference |
---|---|---|---|
PerfLabTests.EnumPerf.EnumEquals | 646.25 | 229.29 | -64.52 |
System.Tests.Perf_Enum.ToString_NonFlags_Small(value: TopDirectoryOnly) | 633.28 | 235.90 | -62.74 |
"System.Tests.Perf_Enum.ToString_Format_Flags_Large(value: All | format: ""g"")" | 667.24 | 271.04 |
System.Reflection.Attributes.IsDefinedClassHitInherit | 1315.59 | 562.93 | -57.21 |
System.Reflection.Activator<EmptyStruct>.CreateInstanceGeneric | 721.39 | 330.82 | -54.14 |
System.Numerics.Tests.Perf_Vector4.SubtractOperatorBenchmark | 20.82 | 9.59 | -53.92 |
System.Reflection.Invoke.Method0_NoParms | 853.86 | 399.59 | -53.20 |
System.Numerics.Tests.Perf_Matrix4x4.CreateRotationZBenchmark | 78.54 | 40.02 | -49.03 |
System.Reflection.Attributes.IsDefinedMethodBaseMissInherit | 2512.81 | 1431.26 | -43.04 |
System.Numerics.Tests.Perf_Matrix4x4.MultiplyByScalarBenchmark | 183.31 | 106.83 | -41.71 |
System.Tests.Perf_Enum.InterpolateIntoStringBuilder_Flags(value: 32) | 7501.15 | 4383.76 | -41.55 |
System.Numerics.Tests.Perf_Vector3.TransformNormalByMatrix4x4Benchmark | 189.92 | 111.79 | -41.13 |
"System.IO.Tests.Perf_RandomAccess.ReadScatter(fileSize: 1048576 | buffersSize: 16384 | options: None)" | 400115.22 |
System.Numerics.Tests.Perf_Matrix4x4.CreateRotationXWithCenterBenchmark | 90.04 | 60.34 | -32.98 |
"System.Globalization.Tests.StringSearch.IsSuffix_DifferentLastChar(Options: (en-US | IgnoreCase | True))" | 1024.28 |
"System.Tests.Perf_Enum.StringFormat(value: Red | Green)" | 7002.80 | 4942.10 |
"System.Tests.Perf_Enum.ToString_Flags(value: Red | Orange | Yellow | Green |
System.Numerics.Tests.Perf_VectorOf<Byte>.AddBenchmark | 11.28 | 8.19 | -27.44 |
System.Numerics.Tests.Perf_Vector4.DivideByScalarBenchmark | 30.25 | 21.97 | -27.36 |
System.Numerics.Tests.Perf_Vector2.EqualsBenchmark | 35.85 | 27.68 | -22.78 |
Vectorization of Vector4 in #87822 improved over 100 microbenchmarks in dotnet/perf-autofiling-issues#19758 and dotnet/perf-autofiling-issues#19760.
Fix path for empty partition in Enumerable.Select in #88425 improved EmptyTakeSelectToArray microbenchmarks as reported in dotnet/perf-autofiling-issues#19761.
Improved BigInteger operators +, - and * for trivial cases in #84733 improved some of BigInteger microbenchmarks in dotnet/perf-autofiling-issues#19762.
Precomputing the CallInfo structure in #88369 improved about 200 microbenchmarks.
The BCL change #86287 and vectorization of Vector128 in #88064 improved a dozen of Equals microbenchmarks.
Regressions
Here is a list of top 20 regressed microbenchmarks in Preview 7.
Name | Baseline Value | Compare Value | % Difference |
---|---|---|---|
System.Collections.CtorFromCollection<String>.FrozenDictionary(Size: 512) | 44266.49 | 396363.53 | 795.40 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int32>.EqualsAllBenchmark | 6.90 | 9.58 | 38.82 |
"Microsoft.Extensions.DependencyInjection.TimeToFirstService.Scoped(Mode: ""Expressions"")" | 49567.25 | 65031.35 | 31.19 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt64>.BitwiseOrOperatorBenchmark | 9.62 | 12.45 | 29.41 |
System.Numerics.Tests.Perf_VectorOf<SByte>.OnesComplementOperatorBenchmark | 6.04 | 7.80 | 29.23 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.AllBitsSetBenchmark | 2.04 | 2.61 | 28.32 |
System.Tests.Perf_GC<Byte>.NewOperator_Array(length: 10000) | 4495.94 | 5733.46 | 27.52 |
System.Memory.Span<Char>.SequenceEqual(Size: 33) | 85.83 | 108.56 | 26.49 |
System.Numerics.Tests.Perf_VectorOf<Single>.AddOperatorBenchmark | 7.67 | 9.58 | 24.98 |
"Microsoft.Extensions.DependencyInjection.TimeToFirstService.Scoped(Mode: ""ILEmit"")" | 49928.88 | 62377.01 | 24.93 |
System.Memory.Constructors<String>.SpanFromArray | 15.59 | 19.40 | 24.46 |
Microsoft.Extensions.DependencyInjection.ScopeValidation.TransientWithScopeValidation | 1815.08 | 2227.85 | 22.74 |
System.Numerics.Tests.Perf_VectorOf<Int64>.EqualityOperatorBenchmark | 6.56 | 7.77 | 18.48 |
System.IO.Tests.Perf_File.CopyToOverwrite(size: 4096) | 47118.52 | 55507.12 | 17.80 |
"System.Tests.Perf_Decimal.TryParse(value: ""123456.789"")" | 895.48 | 1023.98 | 14.34 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Double>.AllBitsSetBenchmark | 1.48 | 1.69 | 14.11 |
System.Numerics.Tests.Perf_VectorOf<UInt16>.AndNotBenchmark | 9.16 | 10.44 | 13.96 |
System.Memory.Span<Byte>.IndexOfValue(Size: 33) | 58.20 | 65.95 | 13.31 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Int.BitwiseOrOperatorBenchmark | 7.62 | 8.61 | 12.96 |
"System.Tests.Perf_Int32.ParseSpan(value: ""2147483647"")" | 206.91 | 233.69 | 12.94 |
Preview 6
The following section presents only major improvements with high-level analysis. The analysis should be taken dubiously and readers are encouraged to examine benchmark reports for thorough analysis. We encourage readers to examine the benchmark reports and to call out major improvements not mentioned in this report.
Mono AOT WASM
The following sections presents improvements and regressions introduced in Mono AOT WASM in the Preview 6.
Improvements
Here is a list of top 20 microbenchmarks improvements in Preview 6.
Name | Baseline Value | Compare Value | % Difference |
---|---|---|---|
System.Numerics.Tests.Perf_Quaternion.LengthBenchmark | 0.38 | 0.00 | -100 |
System.Numerics.Tests.Perf_Quaternion.NegationOperatorBenchmark | 1.87 | 0.00 | -100 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Int.CountBenchmark | 0.34 | 0.00 | -100 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int64>.CountBenchmark | 0.22 | 0.00 | -100 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt16>.InequalityOperatorBenchmark | 0.97 | 0.00 | -100 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt64>.CountBenchmark | 0.29 | 0.00 | -100 |
System.Tests.Perf_Enum.HasFlag | 1.35 | 0.00 | -100 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt16>.EqualityOperatorBenchmark | 2.28 | 0.01 | < |
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int64>.CountBenchmark | 0.22 | 0.00 | -99.57 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Double>.GreaterThanAllBenchmark | 2.50 | 0.02 | -99.35 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.UnaryNegateOperatorBenchmark | 85.94 | 2.58 | -97.00 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.UnaryNegateOperatorBenchmark | 85.93 | 4.27 | -95.02 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.UnaryNegateOperatorBenchmark | 85.94 | 4.30 | -94.99 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.UnaryNegateOperatorBenchmark | 85.93 | 4.35 | -94.94 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int64>.LessThanOrEqualBenchmark | 2.91 | 0.26 | -91.04 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int16>.EqualityOperatorBenchmark | 2.26 | 0.25 | -88.80 |
System.Numerics.Tests.Perf_Vector3.UnitZBenchmark | 3.84 | 0.54 | -85.93 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.BitwiseAndBenchmark | 4.07 | 0.69 | -83.07 |
System.Runtime.Intrinsics.Tests.Perf_Vector128.FloorFloatBenchmark | 20.82 | 3.59 | -82.73 |
System.Net.Primitives.Tests.IPAddressPerformanceTests.TryWriteBytes(address: 1020:3040:5060:7080:9010:1112:1314:1516) | 78.86 | 13.78 | -82.52 |
Regressions
Here is a list of top 20 regressed microbenchmarks in Preview 6.
Name | Baseline Value | Compare Value | % Difference |
---|---|---|---|
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.CountBenchmark | 0.00 | 0.14 | 26004.19 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.CountBenchmark | 0.00 | 0.07 | 12106.45 |
System.Numerics.Tests.Perf_VectorOf<Double>.CountBenchmark | 0.09 | 3.36 | 3767.73 |
System.Numerics.Tests.Perf_VectorOf<Single>.CountBenchmark | 0.00 | 0.06 | 2106.86 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt64>.AllBitsSetBenchmark | 1.95 | 10.77 | 452.08 |
System.Numerics.Tests.Perf_VectorOf<Single>.CountBenchmark | 0.00 | 0.01 | 405.57 |
System.Numerics.Tests.Perf_VectorOf<UInt16>.MaxBenchmark | 0.75 | 3.50 | 365.24 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int32>.DotBenchmark | 0.87 | 3.58 | 312.42 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Int.GreaterThanOrEqualBenchmark | 0.92 | 3.67 | 300.46 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Int.GreaterThanOrEqualBenchmark | 0.92 | 3.55 | 286.90 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Double>.DotBenchmark | 0.78 | 2.61 | 236.42 |
System.Numerics.Tests.Perf_VectorOf<SByte>.OnesComplementOperatorBenchmark | 0.75 | 2.51 | 236.33 |
System.Numerics.Tests.Perf_VectorOf<SByte>.BitwiseOrBenchmark | 2.62 | 8.52 | 225.70 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt64>.ZeroBenchmark | 2.00 | 5.96 | 198.55 |
System.Numerics.Tests.Perf_VectorOf<Int64>.ZeroBenchmark | 1.98 | 5.88 | 196.21 |
System.Numerics.Tests.Perf_VectorOf<UInt16>.MultiplyBenchmark | 3.10 | 9.12 | 194.26 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Int.EqualsBenchmark | 0.98 | 2.75 | 180.71 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Int.EqualsBenchmark | 0.98 | 2.69 | 174.16 |
System.Numerics.Tests.Perf_VectorOf<SByte>.UnaryNegateOperatorBenchmark | 1.08 | 2.80 | 159.06 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.MinBenchmark | 2.70 | 6.92 | 156.32 |
Mono AOT compiler
The performance regressions and improvements are analyzed separately in #89238.
Mono Interpreter
The following sections presents improvements and regressions introduced in Mono Interpreter in the Preview 6.
Improvements
Here is a list of top 20 microbenchmarks improvements in Preview 6.
Name | Baseline Value | Compare Value | % Difference |
---|---|---|---|
System.Numerics.Tests.Perf_VectorOf<Double>.CountBenchmark | 0.00 | 0.00 | -100 |
System.Numerics.Tests.Perf_VectorOf<Int32>.CountBenchmark | 0.02 | 0.00 | -100 |
System.Numerics.Tests.Perf_VectorOf<UInt32>.CountBenchmark | 0.00 | 0.00 | -100 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.CountBenchmark | 0.40 | 0.00 | -100 |
System.Numerics.Tests.Perf_VectorOf<SByte>.OneBenchmark | 76.06 | 1.57 | -97.93 |
System.Numerics.Tests.Perf_VectorOf<Byte>.OneBenchmark | 76.01 | 1.87 | -97.53 |
System.Numerics.Tests.Perf_VectorOf<SByte>.NegateBenchmark | 221.32 | 6.26 | -97.16 |
System.Numerics.Tests.Perf_VectorOf<SByte>.UnaryNegateOperatorBenchmark | 221.61 | 6.27 | -97.16 |
System.Numerics.Tests.Perf_VectorOf<Byte>.UnaryNegateOperatorBenchmark | 214.44 | 6.20 | -97.10 |
System.Numerics.Tests.Perf_VectorOf<Byte>.NegateBenchmark | 214.55 | 6.37 | -97.02 |
System.Numerics.Tests.Perf_VectorOf<SByte>.SubtractBenchmark | 231.29 | 7.90 | -96.58 |
System.Numerics.Tests.Perf_VectorOf<SByte>.SubtractionOperatorBenchmark | 221.04 | 7.90 | -96.42 |
System.Numerics.Tests.Perf_VectorOf<UInt16>.OneBenchmark | 50.92 | 1.83 | -96.41 |
System.Numerics.Tests.Perf_VectorOf<Byte>.AddBenchmark | 216.21 | 7.83 | -96.37 |
System.Numerics.Tests.Perf_VectorOf<Byte>.SubtractBenchmark | 214.79 | 7.79 | -96.37 |
System.Numerics.Tests.Perf_VectorOf<Byte>.SubtractionOperatorBenchmark | 215.60 | 7.92 | -96.32 |
System.Numerics.Tests.Perf_VectorOf<SByte>.MultiplyOperatorBenchmark | 225.86 | 8.35 | -96.30 |
System.Numerics.Tests.Perf_VectorOf<Byte>.AddOperatorBenchmark | 209.41 | 7.95 | -96.20 |
System.Numerics.Tests.Perf_VectorOf<SByte>.MultiplyBenchmark | 217.21 | 8.39 | -96.13 |
System.Numerics.Tests.Perf_VectorOf<SByte>.AddOperatorBenchmark | 214.44 | 8.33 | -96.11 |
Vectorization of Vector<T> operators
in dotnet/perf-autofiling-issues#18537 improved over 200 microbenchmarks.
Changes in #87219 introduced Math.BigMul
in NextUInt64 random method and improved several microbenchmarks reported in dotnet/perf-autofiling-issues#18690.
About 120 microbenchmarks were improved dotnet/perf-autofiling-issues#19027 potentialy by #87555 or other interpreter and BCL changes.
Fozen dictionary creation is improved by 72% in #87510.
Regressions
Here is a list of top 20 regressed microbenchmarks in Preview 6.
Name | Baseline Value | Compare Value | % Difference |
---|---|---|---|
System.Numerics.Tests.Perf_VectorOf<Int64>.CountBenchmark | 0.01 | 0.23 | 2775.54 |
System.Numerics.Tests.Perf_VectorOf<UInt64>.CountBenchmark | 0.01 | 0.17 | 2177.17 |
System.Numerics.Tests.Perf_VectorOf<UInt16>.ZeroBenchmark | 2.24 | 4.95 | 121.29 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int16>.EqualityOperatorBenchmark | 7.65 | 16.63 | 117.46 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt16>.OnesComplementOperatorBenchmark | 3.03 | 6.11 | 101.75 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int64>.CountBenchmark | 0.04 | 0.08 | 86.25 |
System.Numerics.Tests.Perf_VectorOf<UInt64>.GreaterThanAllBenchmark | 18.37 | 33.12 | 80.26 |
"System.Net.Http.Tests.SocketsHttpHandlerPerfTest.Get_EnumerateHeaders_Validated(ssl: True, chunkedResponse: False, responseLength: 100000)" | 2230622.93 | 3965252.94 | 77.76 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int32>.CountBenchmark | 0.12 | 0.20 | 69.81 |
"System.Net.Http.Tests.SocketsHttpHandlerPerfTest.Get(ssl: True, chunkedResponse: False, responseLength: 100000)" | 2181340.94 | 3635706.61 | 66.67 |
System.Numerics.Tests.Perf_VectorOf<Byte>.LessThanOrEqualAnyBenchmark | 18.27 | 30.07 | 64.56 |
System.Numerics.Tests.Perf_Vector4.ZeroBenchmark | 1.36 | 2.10 | 55.23 |
HardwareIntrinsics.RayTracer.SoA.Render | 1.15 | 1.76 | 52.81 |
System.Numerics.Tests.Perf_Vector2.DivideByScalarBenchmark | 13.77 | 20.17 | 46.46 |
"System.Net.Http.Tests.SocketsHttpHandlerPerfTest.Get(ssl: True, chunkedResponse: True, responseLength: 100000)" | 2621801.93 | 3807493.79 | 45.22 |
System.Runtime.Intrinsics.Tests.Perf_Vector128.ConvertDoubleToLongBenchmark | 64.48 | 89.74 | 39.17 |
System.Linq.Tests.Perf_Enumerable.WhereSingleOrDefault_LastElementMatches(input: Array) | 2714.67 | 3708.23 | 36.59 |
System.Memory.Constructors_ValueTypesOnly<Byte>.SpanFromPointerLength | 6.95 | 9.47 | 36.28 |
Span.IndexerBench.CoveredIndex3(length: 1024) | 16595.22 | 22106.92 | 33.21 |
"System.Buffers.Tests.RentReturnArrayPoolTests<Object>.ProducerConsumer(RentalSize: 4096, ManipulateArray: False, Async: True, UseSharedPool: False)" | 867.68 | 1154.02 | 33.00 |
Preview 5
There are a number of improvements introduced in Preview 5 to individually call out. The following section presents only major improvements with high-level analysis. The analysis should be taken dubiously and readers are encouraged to examine benchmark reports for thorough analysis. We encourage readers to examine the benchmark reports and to call out major improvements not mentioned in this report.
Mono AOT compiler
The performance regressions and improvements are analyzed separately in #89238.
Mono Interpreter
The following sections presents improvements and regressions introduced in Mono Interpreter in the Preview 5.
Improvements
Here is a list of top 20 microbenchmarks improvements in Preview 5.
Name | Baseline Value | Compare Value | % Difference |
---|---|---|---|
System.Numerics.Tests.Perf_VectorOf<Single>.CountBenchmark | 0.18 | 0.00 | -100 |
System.Numerics.Tests.Perf_VectorOf<UInt16>.CountBenchmark | 0.10 | 0.00 | -100 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Float.CountBenchmark | 0.01 | 0.00 | -100 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.CountBenchmark | 0.03 | 0.00 | -100 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int16>.CountBenchmark | 1.12 | 0.00 | -100 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt16>.CountBenchmark | 0.22 | 0.00 | -100 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt64>.CountBenchmark | 0.08 | 0.00 | -100 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Int.CountBenchmark | 0.48 | 0.00 | -99.74 |
System.Numerics.Tests.Perf_VectorOf<UInt32>.CountBenchmark | 0.14 | 0.00 | -99.30 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int32>.CountBenchmark | 2.36 | 0.12 | -95.07 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.DivideBenchmark | 127.11 | 7.82 | -93.85 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.MultiplyOperatorBenchmark | 123.89 | 7.68 | -93.80 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Float.MultiplyBenchmark | 126.45 | 7.94 | -93.71 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Float.MultiplyOperatorBenchmark | 125.08 | 7.87 | -93.70 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Float.DivisionOperatorBenchmark | 123.79 | 7.83 | -93.67 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Float.DivideBenchmark | 126.19 | 8.05 | -93.62 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.MultiplyBenchmark | 127.05 | 8.23 | -93.52 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.DivisionOperatorBenchmark | 123.95 | 8.22 | -93.37 |
System.Numerics.Tests.Perf_VectorOf<UInt64>.CountBenchmark | 0.06 | 0.01 | -86.49 |
System.Collections.Tests.Perf_Dictionary.ContainsValue(Items: 3000) | 483385521.57 | 66414495.75 | -86.26 |
Vectorization of IndexOf in #85437 improved System.Text.RegularExpressions
microbenchmarks reported in dotnet/perf-autofiling-issues#17517. Addition of Vector128 and PackedSimd in #82773 improved about 70 microbenchmarks reported in dotnet/perf-autofiling-issues#17563 and dotnet/perf-autofiling-issues#17819.
Change in Plane and Quaternion improved several microbenchmarks in dotnet/perf-autofiling-issues#18043.
Change in #85528 addressed performance problems with code like EqualityComparer<T>.Default.Equals()
which improved over 200 microbenchmarks reported in dotnet/perf-autofiling-issues#18349. Implementation of float32 Vector128.Equals
intrnsic improved System.Numerics.Tests
microbenchmarks.
Regressions
Here is a list of top 20 regressed microbenchmarks in Preview 5.
Name | Baseline Value | Compare Value | % Difference |
---|---|---|---|
System.Numerics.Tests.Perf_Vector2.ZeroBenchmark | 0.03 | 1.05 | 3076.49 |
System.Numerics.Tests.Perf_VectorOf<Double>.ZeroBenchmark | 2.96 | 9.10 | 207.86 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.BitwiseOrOperatorBenchmark | 8.51 | 21.64 | 154.37 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt32>.GreaterThanOrEqualAnyBenchmark | 24.29 | 47.23 | 94.44 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.InequalityOperatorBenchmark | 3.94 | 7.15 | 81.24 |
System.Numerics.Tests.Perf_Plane.CreateFromVerticesBenchmark | 76.92 | 132.40 | 72.12 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int32>.ConditionalSelectBenchmark | 11.14 | 17.45 | 56.64 |
System.Buffers.Tests.RentReturnArrayPoolTests<Byte>.ProducerConsumer(RentalSize: 4096, ManipulateArray: False, Async: False, UseSharedPool: False) | 1877.78 | 2918.99 | 55.44 |
System.Diagnostics.Perf_Process.StartAndWaitForExit | 1286337.51 | 1968645.19 | 53.04 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.LessThanAllBenchmark | 24.23 | 36.78 | 51.79 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int64>.ZeroBenchmark | 2.99 | 4.47 | 49.41 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int32>.SubtractionOperatorBenchmark | 7.62 | 11.13 | 45.99 |
System.Memory.Span<Char>.Reverse(Size: 512) | 789.89 | 1116.00 | 41.28 |
System.Buffers.Tests.RentReturnArrayPoolTests<Object>.ProducerConsumer(RentalSize: 4096, ManipulateArray: False, Async: False, UseSharedPool: False) | 1963.38 | 2745.38 | 39.82 |
System.Numerics.Tests.Perf_VectorOf<Single>.LessThanAllBenchmark | 59.72 | 82.75 | 38.57 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Double>.EqualityOperatorBenchmark | 27.40 | 37.64 | 37.35 |
System.Globalization.Tests.StringSearch.IndexOf_Word_NotFound(Options: (, None, False)) | 6382.39 | 8678.93 | 35.98 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int64>.OnesComplementBenchmark | 6.38 | 8.61 | 34.98 |
System.Numerics.Tests.Perf_VectorOf<Int64>.ZeroBenchmark | 2.81 | 3.78 | 34.72 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Float.LessThanOrEqualAllBenchmark | 26.61 | 35.79 | 34.51 |
Preview 4
There are a number of improvements introduced in Preview 4 to individually call out. The following section presents only major improvements with high-level analysis. The analysis should be taken dubiously and readers are encouraged to examine benchmark reports for thorough analysis. We encourage readers to examine the benchmark reports and to call out major improvements not mentioned in this report.
Mono AOT compiler
The following sections presents improvements and regressions introduced in Mono AOT compiler in the Preview 4.
Improvements
Here is a list of top 20 microbenchmarks improvements in Preview 4.
Name | Baseline Value | Compare Value | % Difference |
---|---|---|---|
System.Numerics.Tests.Perf_VectorOf<SByte>.CountBenchmark | 0.01 | 0.00 | -100 |
System.Numerics.Tests.Perf_VectorOf<UInt16>.CountBenchmark | 0.01 | 0.00 | -100 |
System.Numerics.Tests.Perf_VectorOf<UInt32>.CountBenchmark | 0.01 | 0.00 | -100 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Double>.CountBenchmark | 0.01 | 0.00 | -100 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int64>.CountBenchmark | 0.01 | 0.00 | -100 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.CountBenchmark | 0.01 | 0.00 | -100 |
System.Tests.Perf_DateTime.ToString(format: "s") | 417.41 | 103.88 | -75.11 |
System.Tests.Perf_DateTimeOffset.ToString(format: "s") | 431.57 | 114.37 | -73.49 |
System.IO.MemoryMappedFiles.Tests.Perf_MemoryMappedFile.CreateNew(capacity: 100000) | 25903.87 | 7803.06 | -69.87 |
System.IO.MemoryMappedFiles.Tests.Perf_MemoryMappedFile.CreateNew(capacity: 10000) | 25653.57 | 7923.08 | -69.11 |
System.IO.MemoryMappedFiles.Tests.Perf_MemoryMappedFile.CreateNew(capacity: 10000000) | 24916.24 | 7700.13 | -69.09 |
System.IO.MemoryMappedFiles.Tests.Perf_MemoryMappedFile.CreateNew(capacity: 1000000) | 25328.88 | 7962.83 | -68.56 |
System.Collections.Tests.Add_Remove_SteadyState<Int32>.Queue(Count: 512) | 18.37 | 8.31 | -54.78 |
System.Threading.Tests.Perf_Volatile.Read_double | 0.26 | 0.12 | -53.92 |
System.Numerics.Tests.Perf_VectorOf<Byte>.ZeroBenchmark | 5.66 | 2.67 | -52.77 |
System.Net.Primitives.Tests.IPAddressPerformanceTests.TryFormat(address: 1020:3040:5060:7080:9010:1112:1314:1516) | 243.27 | 128.93 | -46.99 |
System.Numerics.Tests.Perf_Vector3.DistanceSquaredBenchmark | 16.92 | 9.15 | -45.90 |
System.Numerics.Tests.Perf_Vector3.DistanceBenchmark | 23.13 | 13.70 | -40.79 |
PerfLabTests.EnumPerf.ObjectGetType | 0.03 | 0.02 | -38.31 |
System.Numerics.Tests.Perf_Vector3.DivideByVector3OperatorBenchmark | 17.44 | 10.91 | -37.47 |
BCL changes in #84210 and #84210 improved Guid.Parse
and vectorized all sets in Regex
, as reported in dotnet/perf-autofiling-issues#15183 and dotnet/perf-autofiling-issues#15177.
Implementation of fast path for mini_init_method_rgctx in #84226 improved over 50 microbenchmarks reported in dotnet/perf-autofiling-issues#15717, dotnet/perf-autofiling-issues#15796, and dotnet/perf-autofiling-issues#15799.
Intrinsics get_Count
and get_AllBitsSet
on arm64 improved around 400 microbenchmarks, as reported in dotnet/perf-autofiling-issues#15800, dotnet/perf-autofiling-issues#15718, and dotnet/perf-autofiling-issues#15797.
Allow inlining methods containing constructor calls and Intrinsified additional calls to Type:op_Equality
improved over 100 microbenchmarks reported in dotnet/perf-autofiling-issues#16371 and dotnet/perf-autofiling-issues#16509.
V128 SIMD intrinsics on Arm64 across all codegen engines in #84289 improved over 400 microbenchmarks reported in dotnet/perf-autofiling-issues#16460, dotnet/perf-autofiling-issues#16621, and dotnet/perf-autofiling-issues#16660. Adding Vector128.ConvertXX and Vector128.Create as intrinsics on arm64 improved 48 microbenchmarks reported in dotnet/perf-autofiling-issues#17314 and in dotnet/perf-autofiling-issues#17315.
Make Guid.HexsToChars aggressively inlined in #85322 improved a couple of microbenchmarks.
Regressions
Here is a list of top 20 regressed microbenchmarks in Preview 4.
Name | Baseline Value | Compare Value | % Difference |
---|---|---|---|
System.Tests.Perf_String.Substring_IntInt(s: "dzsdzsDDZSDZSDZSddsz", i1: 7, i2: 4) | 23.92 | 42.38 | 77.13 |
System.Buffers.Text.Tests.Utf8FormatterTests.FormatterUInt64(value: 0) | 14.05 | 23.66 | 68.37 |
System.Buffers.Text.Tests.Utf8FormatterTests.FormatterInt32(value: 4) | 13.98 | 22.92 | 64.00 |
Benchstone.BenchI.IniArray.Test | 186909527.87 | 304502098.85 | 62.91 |
Span.IndexerBench.Ref(length: 1024) | 686.54 | 1110.42 | 61.74 |
System.Tests.Perf_Int64.TryParse(value: "9223372036854775807") | 58.15 | 93.40 | 60.60 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Int.DivideBenchmark | 23.30 | 37.16 | 59.44 |
System.Tests.Perf_Int64.TryParse(value: "-9223372036854775808") | 59.06 | 93.58 | 58.45 |
System.Tests.Perf_Int64.TryParseSpan(value: "9223372036854775807") | 59.71 | 93.89 | 57.26 |
System.Buffers.Binary.Tests.BinaryReadAndWriteTests.MeasureReverseUsingNtoH | 1432.42 | 2191.50 | 52.99 |
System.Tests.Perf_Int64.TryParseSpan(value: "-9223372036854775808") | 61.80 | 94.18 | 52.39 |
System.Threading.Tests.Perf_Volatile.Write_double | 0.23 | 0.35 | 52.13 |
System.Numerics.Tests.Perf_VectorOf<Int32>.EqualsBenchmark | 0.81 | 1.23 | 50.47 |
System.Tests.Perf_String.Trim(s: "Test ") | 76.12 | 113.79 | 49.48 |
System.Tests.Perf_UInt16.Parse(value: "12345") | 35.63 | 52.72 | 47.98 |
System.Tests.Perf_Int64.Parse(value: "-9223372036854775808") | 62.30 | 91.72 | 47.22 |
System.Tests.Perf_UInt64.Parse(value: "18446744073709551615") | 70.51 | 103.27 | 46.44 |
System.Tests.Perf_Int64.Parse(value: "9223372036854775807") | 61.62 | 90.17 | 46.34 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Double>.SumBenchmark | 2.76 | 3.99 | 44.34 |
System.Collections.Tests.Perf_BitArray.BitArrayGet(Size: 512) | 8039.61 | 11602.79 | 44.32 |
Mono Interpreter
The following sections presents improvements and regressions introduced in Mono Interpreter in the Preview 4.
Improvements
Here is a list of top 20 microbenchmarks improvements in Preview 4.
Name | Baseline Value | Compare Value | % Difference |
---|---|---|---|
System.Numerics.Tests.Perf_VectorOf<Byte>.CountBenchmark | 0.00 | 0.00 | -100 |
System.Numerics.Tests.Perf_VectorOf<Int16>.CountBenchmark | 0.18 | 0.00 | -100 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Double>.CountBenchmark | 0.16 | 0.00 | -100 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int16>.CountBenchmark | 1.29 | 0.00 | -100 |
System.Numerics.Tests.Perf_VectorOf<SByte>.CountBenchmark | 0.20 | 0.00 | -99.20 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Float.CountBenchmark | 0.07 | 0.00 | -95.73 |
System.Tests.Perf_DateTime.ToString(format: "s") | 2233.23 | 281.76 | -87.38 |
System.Text.Json.Serialization.Tests.ColdStartSerialization<SimpleStructWithProperties>.NewJsonSerializerContext | 185975.98 | 28969.63 | -84.42 |
System.Tests.Perf_DateTimeOffset.ToString(format: "s") | 2311.74 | 385.39 | -83.32 |
System.Numerics.Tests.Perf_VectorOf<Int32>.CountBenchmark | 0.44 | 0.10 | -77.43 |
System.IO.MemoryMappedFiles.Tests.Perf_MemoryMappedFile.CreateNew(capacity: 10000000) | 45039.52 | 12494.67 | -72.25 |
System.IO.MemoryMappedFiles.Tests.Perf_MemoryMappedFile.CreateNew(capacity: 10000) | 44649.63 | 12502.98 | -71.99 |
System.IO.MemoryMappedFiles.Tests.Perf_MemoryMappedFile.CreateNew(capacity: 1000000) | 45124.15 | 13007.76 | -71.17 |
System.IO.MemoryMappedFiles.Tests.Perf_MemoryMappedFile.CreateNew(capacity: 100000) | 44604.36 | 13258.02 | -70.27 |
System.Reflection.Invoke.Ctor0_NoParams | 393.98 | 123.35 | -68.69 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int32>.CountBenchmark | 0.00 | 0.00 | -68.38 |
System.Tests.Perf_DateTimeOffset.ToString(format: null) | 6639.43 | 2509.03 | -62.21 |
System.Reflection.Activator<EmptyClass>.CreateInstanceGeneric | 575.27 | 221.73 | -61.45 |
System.Tests.Perf_DateTimeOffset.ToString(value: 12/30/2017 3:45:22 AM -08:00) | 6959.23 | 2746.69 | -60.53 |
System.Memory.ReadOnlySpan.Trim(input: "") | 49.19 | 19.80 | -59.73 |
Implementation of IUtf8SpanFormattable
in #84469 caused both improvements and regressions as reported in dotnet/perf-autofiling-issues#15630 and dotnet/perf-autofiling-issues#15626. DateTime{Offset}
formatting improvement about 120 microbenchmarks reported in dotnet/perf-autofiling-issues#17009. PR #85288 improved about 30 microbenchmarks reported in dotnet/perf-autofiling-issues#17245. Handling of the Utf8Formatter.TryFormat and then delegating to the relevant helpers in #85277 improved about 30 microbenchmarks.
Regressions
Here is a list of top 20 regressed microbenchmarks in Preview 4.
Name | Baseline Value | Compare Value | % Difference |
---|---|---|---|
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.CountBenchmark | 0.00 | 0.23 | 9893.94 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int64>.CountBenchmark | 0.02 | 0.75 | 4216.78 |
System.Numerics.Tests.Perf_VectorOf<UInt32>.CountBenchmark | 0.00 | 0.12 | 3988.20 |
Microsoft.Extensions.DependencyInjection.ActivatorUtilitiesBenchmark.Factory | 276.60 | 852.40 | 208.17 |
System.Numerics.Tests.Perf_VectorOf<UInt64>.AbsBenchmark | 2.32 | 4.51 | 94.06 |
System.Numerics.Tests.Perf_VectorOf<UInt16>.AbsBenchmark | 2.37 | 4.34 | 83.29 |
System.Numerics.Tests.Perf_Vector2.ZeroBenchmark | 0.44 | 0.78 | 78.01 |
System.Memory.Constructors<Byte>.ArrayAsSpan | 12.20 | 21.63 | 77.34 |
Microsoft.Extensions.Primitives.Performance.StringValuesBenchmark.Indexer_FirstElement_String | 8.60 | 14.85 | 72.68 |
System.Net.Http.Tests.SocketsHttpHandlerPerfTest.Get(ssl: True, chunkedResponse: False, responseLength: 100000) | 1903905.78 | 3227992.49 | 69.54 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Int.OnesComplementBenchmark | 6.62 | 10.83 | 63.43 |
System.Buffers.Text.Tests.Utf8FormatterTests.FormatterDecimal(value: 123456.789) | 491.42 | 801.06 | 63.00 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt64>.OnesComplementOperatorBenchmark | 6.29 | 10.12 | 60.75 |
Microsoft.AspNetCore.Server.Kestrel.Performance.PipeThroughputBenchmark.Parse_ParallelAsync(Length: 4096, Chunks: 1) | 8112.10 | 12805.61 | 57.85 |
System.Memory.Constructors<Byte>.MemoryMarshalCreateReadOnlySpan | 7.75 | 12.19 | 57.15 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.CountBenchmark | 0.12 | 0.19 | 54.21 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int16>.BitwiseAndBenchmark | 8.47 | 12.73 | 50.32 |
System.Numerics.Tests.Constructor.ConstructorBenchmark_Int16 | 29.48 | 43.17 | 46.45 |
System.Numerics.Tests.Perf_VectorOf<UInt16>.InequalityOperatorBenchmark | 19.53 | 27.98 | 43.23 |
System.Numerics.Tests.Perf_VectorOf<UInt64>.BitwiseOrBenchmark | 39.39 | 55.74 | 41.51 |
Preview 3
The following section overviews only major improvements with high-level analysis. The analysis should be taken dubiously and readers are encouraged to examine benchmark reports for thorough analysis. We encourage readers to examine the benchmark reports and to call out major improvements not mentioned in this report.
Mono AOT compiler
The following sections presents improvements and regressions introduced in Mono AOT compiler in the Preview 3.
Improvements
Here is a list of top 20 microbenchmarks improvements in Preview 3.
Name | Baseline Value | Compare Value | % Difference |
---|---|---|---|
System.Numerics.Tests.Perf_VectorOf<Byte>.CountBenchmark | 0.01 | 0.00 | -100 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int32>.CountBenchmark | 0.01 | 0.00 | -100 |
System.Tests.Perf_Boolean.ToString(value: True) | 0.23 | 0.00 | -100 |
System.Numerics.Tests.Perf_Vector4.EqualityOperatorBenchmark | 1.96 | 0.80 | -59.04 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt32>.SumBenchmark | 6.65 | 3.26 | -50.93 |
System.Numerics.Tests.Perf_Vector4.InequalityOperatorBenchmark | 1.39 | 0.74 | -46.53 |
System.Tests.Perf_Enum.HasFlag | 0.23 | 0.13 | -44.47 |
System.Numerics.Tests.Perf_BitOperations.LeadingZeroCount_uint | 1096.23 | 667.83 | -39.07 |
System.Numerics.Tests.Perf_BitOperations.LeadingZeroCount_ulong | 1102.75 | 746.09 | -32.34 |
System.Numerics.Tests.Perf_BitOperations.Log2_ulong | 1320.59 | 895.14 | -32.21 |
System.Tests.Perf_String.IndexerCheckLengthHoisting | 88.84 | 60.29 | -32.13 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.LessThanOrEqualAllBenchmark | 4.44 | 3.03 | -31.65 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Double>.SumBenchmark | 4.02 | 2.76 | -31.25 |
System.Numerics.Tests.Perf_VectorOf<SByte>.MinBenchmark | 48.27 | 33.34 | -30.93 |
Inlining.InlineGCStruct.WithFormat | 2.86 | 1.99 | -30.52 |
PerfLabTests.CastingPerf.ObjScalarValueType | 108762.72 | 76497.64 | -29.66 |
System.Numerics.Tests.Perf_VectorOf<Byte>.InequalityOperatorBenchmark | 0.55 | 0.39 | -29.07 |
Microsoft.Extensions.Primitives.StringSegmentBenchmark.Equals_Object_Invalid | 2.86 | 2.04 | -28.66 |
System.Numerics.Tests.Perf_VectorOf<UInt64>.EqualityOperatorBenchmark | 0.52 | 0.37 | -28.49 |
System.Numerics.Tests.Perf_VectorOf<UInt64>.InequalityOperatorBenchmark | 0.62 | 0.45 | -28.32 |
The most improved groupings of benchmark are System.Numerics
as outlined dotnet/perf-autofiling-issues#14023, dotnet/perf-autofiling-issues#14224, dotnet/perf-autofiling-issues#14573, and dotnet/perf-autofiling-issues#14322. The changes implemented in #82420, #83337, and #83094 introduced Arm64 SIMD operations and improved about 1000 microbenchmarks.
Regressions
Here is a list of top 20 regressed microbenchmarks in Preview 3.
Name | Baseline Value | Compare Value | % Difference |
---|---|---|---|
System.Numerics.Tests.Perf_VectorOf<Byte>.ZeroBenchmark | 2.65 | 5.66 | 113.78 |
System.Numerics.Tests.Perf_BitOperations.Log2_uint | 791.53 | 1539.09 | 94.44 |
System.Collections.Tests.Add_Remove_SteadyState<Int32>.Queue(Count: 512) | 9.64 | 18.37 | 90.64 |
System.Text.Json.Reader.Tests.Perf_Base64.ReadBase64EncodedByteArray_HeavyEscaping(NumberOfBytes: 1000) | 2769.97 | 5142.05 | 85.63 |
System.Text.Json.Reader.Tests.Perf_Base64.ReadBase64EncodedByteArray_NoEscaping(NumberOfBytes: 1000) | 2771.03 | 5139.62 | 85.47 |
System.Text.Json.Reader.Tests.Perf_Base64.ReadBase64EncodedByteArray_HeavyEscaping(NumberOfBytes: 100) | 377.30 | 646.53 | 71.35 |
System.Numerics.Tests.Perf_BitOperations.PopCount_uint | 668.42 | 1104.04 | 65.17 |
System.Text.Json.Reader.Tests.Perf_Base64.ReadBase64EncodedByteArray_NoEscaping(NumberOfBytes: 100) | 377.61 | 598.53 | 58.50 |
System.Threading.Tests.Perf_Volatile.Read_double | 0.16 | 0.26 | 57.96 |
System.Memory.Span<Char>.Reverse(Size: 512) | 258.69 | 407.47 | 57.51 |
PerfLabTests.LowLevelPerf.StructWithInterfaceInterfaceMethod | 154024.04 | 239168.34 | 55.27 |
System.Text.Json.Tests.Perf_Segment.ReadSingleSegmentSequenceByN(numberOfBytes: 8192, TestCase: Json4KB) | 13635.35 | 20935.97 | 53.54 |
System.Text.Json.Tests.Perf_Reader.ReadSpanEmptyLoop(IsDataCompact: True, TestCase: Json4KB) | 10415.86 | 15732.85 | 51.04 |
System.Text.Json.Tests.Perf_Reader.ReadSingleSpanSequenceEmptyLoop(IsDataCompact: True, TestCase: Json4KB) | 10436.16 | 15712.23 | 50.55 |
System.Numerics.Tests.Perf_VectorOf<Int32>.EqualityOperatorBenchmark | 0.24 | 0.36 | 50.01 |
System.Collections.IndexerSetReverse.Array(Size: 512) | 456.86 | 681.13 | 49.08 |
System.Collections.IndexerSet<Int32>.Span(Size: 512) | 458.27 | 682.26 | 48.87 |
System.Numerics.Tests.Perf_VectorOf<Int64>.EqualityOperatorBenchmark | 0.27 | 0.40 | 48.57 |
System.Numerics.Tests.Perf_BitOperations.PopCount_ulong | 745.13 | 1102.84 | 48.00 |
System.Text.Json.Tests.Perf_Reader.ReadReturnBytes(IsDataCompact: False, TestCase: Json40KB) | 158074.36 | 231420.75 | 46.39 |
Mono Interpreter
The following sections presents improvements and regressions introduced in Mono Interpreter in the Preview 3.
Improvements
Here is a list of top 20 microbenchmarks improvements in Preview 3.
Name | Baseline Value | Compare Value | % Difference |
---|---|---|---|
System.Numerics.Tests.Perf_VectorOf<Single>.CountBenchmark | 0.16 | 0.00 | -100 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.CountBenchmark | 0.01 | 0.00 | -100 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt16>.CountBenchmark | 0.11 | 0.00 | -100 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<UInt32>.CountBenchmark | 0.43 | 0.00 | -100 |
System.Tests.Perf_String.Replace_Char(text: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgblu3at20nfab2wei1kxfbvsbpzwhanjczcqa2psra3aacxb67qnwbnfp2tok6v0a58lzfdql1fehvs91yzkt9xam7ahjbhvpd9edll13ab46i74ktwwgkgbi792e5gkuuzevo5qm8qt83edag7zovoe686gmtw730kms2i5xgji4xcp25287q68fvhwszd3mszht2uh7bchlgkj5qnq1x9m4lg7vwn8cq5l756akua6oyx9k71bmxbysnmhvxvlxde4k9maumfgxd8gxhxx4mwpph2ttyox9zilt3ylv1q9s4bopfuoa8qlrzodg2q67sh85wx4slcd6w7ufnendaxai633ove2ktbaxdt2sz6y6mo42473xd274gz833p6hj3mu77c4m4od9e5s8btxleh0efqnu9zj9rwtbk5758lio35b3q426j5fwwq1qyknfedrsmqyfw1m38mkkotdf7n0vr6p3erhy8dkzntr9fwjrslxjgrbegih0n6bpb5bfuy55bu65ce9kejcfifxwpcs05umrsb8kvd64q2iwugbbi7vd35g5ho0rff9rhombgzzaniyq7bbjbqr88jyw4ccgnoyl31of3a5thv0vg08gnrqzxas800hewtw8tnwgw5pav81ntdpdd62689x3iqpc317y82b3e2trbpdzieoxldaz009tz37gqmh4bdp1bv9lnl5s58udb11z0h7i2sdl5nbyhjyfzxwzezmp4qx0i3eyvsd3fg8sryq9jhlvkonnfcvb4snl4mcbimdzg49tzdhqjmfxfcq3p1st6b9x2xyevo17evpqp4yc4f2rm0f26ivr3t2f5m0boc44vituxaovcqy1jrkcs6im2kdu3jvcexx2k76egve63aon5a6nbxss4rcke90npmqp35qluf571ms160y2nhaqef835wah41qru8tauu362v0r8konl8", oldChar: 'b', newChar: '+') | 99861.87 | 2074.68 | -97.92 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Float.CountBenchmark | 2.79 | 0.07 | -97.41 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.UnaryNegateOperatorBenchmark | 234.80 | 6.26 | -97.33 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.UnaryNegateOperatorBenchmark | 246.33 | 6.63 | -97.30 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.NegateBenchmark | 235.81 | 6.49 | -97.24 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.NegateBenchmark | 235.54 | 6.56 | -97.21 |
System.Numerics.Tests.Perf_VectorOf<UInt64>.CountBenchmark | 3.10 | 0.09 | -97.00 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.LessThanBenchmark | 273.32 | 8.63 | -96.84 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.LessThanBenchmark | 273.20 | 8.91 | -96.73 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.EqualsStaticBenchmark | 273.84 | 9.19 | -96.64 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.SubtractBenchmark | 247.26 | 8.65 | -96.50 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.GreaterThanBenchmark | 250.97 | 8.85 | -96.47 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.SubtractBenchmark | 244.27 | 8.76 | -96.41 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.MultiplyOperatorBenchmark | 249.17 | 8.97 | -96.40 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Byte>.AddBenchmark | 238.40 | 8.67 | -96.36 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.AddOperatorBenchmark | 236.35 | 8.68 | -96.32 |
The most improved groupings of benchmark are System.Buffers
, System.Collections
, System.Memory
, and System.Text
as outlined in dotnet/perf-autofiling-issues#14324, dotnet/perf-autofiling-issues#14325, dotnet/perf-autofiling-issues#14326, dotnet/perf-autofiling-issues#14325, dotnet/perf-autofiling-issues#14355, dotnet/perf-autofiling-issues#14359, and dotnet/perf-autofiling-issues#14361. The changes implemented in #83498 and #83490 increased inlining length limit from 20 to 30 and implemented shr.un.imm
which improved over 1000 microbenchmarks.
Add vector horizontal sums on Arm64 #83675 improved about 20 microbenchmarks, as detailed in dotnet/perf-autofiling-issues#14531.
Changes in #83512 caused both improvements and regressions as reported in dotnet/perf-autofiling-issues#15008 and dotnet/perf-autofiling-issues#15154.
Regressions
Here is a list of top 20 regressed microbenchmarks in Preview 3.
Name | Baseline Value | Compare Value | % Difference |
---|---|---|---|
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<SByte>.CountBenchmark | 0.00 | 0.12 | 661187.94 |
System.Numerics.Tests.Perf_VectorOf<Int16>.CountBenchmark | 0.01 | 0.18 | 2061.26 |
System.Numerics.Tests.Perf_Vector3.EqualsBenchmark | 23.78 | 443.27 | 1764.35 |
System.Numerics.Tests.Perf_Vector4.EqualsBenchmark | 24.01 | 406.03 | 1590.83 |
System.Numerics.Tests.Perf_Vector2.EqualsBenchmark | 33.71 | 435.39 | 1191.71 |
System.Numerics.Tests.Perf_Matrix3x2.EqualsBenchmark | 162.13 | 1346.77 | 730.69 |
System.Numerics.Tests.Perf_Plane.EqualsBenchmark | 57.84 | 411.46 | 611.36 |
System.Numerics.Tests.Perf_Quaternion.EqualsBenchmark | 80.35 | 436.94 | 443.80 |
System.Numerics.Tests.Perf_VectorOf<SByte>.CountBenchmark | 0.04 | 0.20 | 431.24 |
System.Numerics.Tests.Perf_Matrix4x4.EqualsBenchmark | 376.19 | 1808.21 | 380.66 |
System.Numerics.Tests.Perf_Vector4.ZeroBenchmark | 0.99 | 2.52 | 154.02 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Double>.EqualsBenchmark | 124.90 | 305.09 | 144.27 |
System.Numerics.Tests.Perf_VectorOf<Int32>.CountBenchmark | 0.19 | 0.44 | 127.07 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Float.EqualsBenchmark | 191.86 | 410.58 | 113.99 |
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Single>.EqualsBenchmark | 199.71 | 410.56 | 105.57 |
System.Threading.Tests.Perf_Thread.CurrentThread | 3.50 | 6.37 | 81.95 |
System.Net.Http.Tests.SocketsHttpHandlerPerfTest.Get_EnumerateHeaders_Unvalidated(ssl: True, chunkedResponse: True, responseLength: 100000) | 1951914.28 | 3529445.53 | 80.81 |
System.Text.Json.Serialization.Tests.ReadJson<BinaryData>.DeserializeFromReader(Mode: SourceGen) | 33011.31 | 59326.04 | 79.71 |
System.Globalization.Tests.StringSearch.IsSuffix_DifferentLastChar(Options: (en-US, OrdinalIgnoreCase, False)) | 913.26 | 1618.90 | 77.26 |
System.Text.Json.Serialization.Tests.ReadJson<BinaryData>.DeserializeFromReader(Mode: Reflection) | 32968.66 | 58440.45 | 77.26 |
Preview 2
There are a number of improvements introduced in Preview 2 to individually call out. The following section presents only major improvements with high-level analysis. The analysis should be taken dubiously and readers are encouraged to examine benchmark reports for thorough analysis. We encourage readers to examine the benchmark reports and to call out major improvements not mentioned in this report.
Mono AOT compiler
The following sections presents improvements and regressions introduced in Mono AOT compiler in the Preview 2.
Improvements
Here is a list of top 20 microbenchmarks improvements in Preview 2. Full report available here.
Name | Baseline Value | Compare Value | Difference | % Difference |
---|---|---|---|---|
System.Collections.Concurrent.Count<Int32>.Dictionary(Size: 512) | 34.07 μs | 310.43 ns | -33756.76 ns | 99% |
System.Collections.Concurrent.Count<String>.Dictionary(Size: 512) | 17.32 μs | 314.25 ns | -17007.28 ns | 98% |
System.Tests.Perf_Decimal.Floor | 81.17 ns | 16.81 ns | -64.36 ns | 79% |
System.Tests.Perf_Decimal.Round | 82.24 ns | 18.69 ns | -63.55 ns | 77% |
System.Tests.Perf_UInt32.TryFormat(value: 0) | 78.23 ns | 20.05 ns | -58.18 ns | 74% |
System.Tests.Perf_Int32.TryFormat(value: 4) | 78.02 ns | 20.47 ns | -57.55 ns | 74% |
System.Collections.TryGetValueFalse<String, String>.ConcurrentDictionary(Size: 512) | 44.69 μs | 12.92 μs | -31.77 μs | 71% |
System.Tests.Perf_Decimal.Divide | 346.08 ns | 102.16 ns | -243.92 ns | 70% |
System.Collections.ContainsKeyFalse<String, String>.ConcurrentDictionary(Size: 512) | 45.29 μs | 13.50 μs | -31.79 μs | 70% |
System.Text.Json.Reader.Tests.Perf_Base64.ReadBase64EncodedByteArray_HeavyEscaping(NumberOfBytes: 1000) | 8.93 μs | 2.77 μs | -6.16 μs | 69% |
System.Text.Json.Reader.Tests.Perf_Base64.ReadBase64EncodedByteArray_NoEscaping(NumberOfBytes: 1000) | 8.83 μs | 2.77 μs | -6.06 μs | 69% |
System.Tests.Perf_UInt64.TryFormat(value: 0) | 84.40 ns | 26.53 ns | -57.87 ns | 69% |
System.Tests.Perf_Byte.ToString(value: 255) | 91.65 ns | 29.95 ns | -61.69 ns | 67% |
System.Tests.Perf_Version.TryFormat3 | 265.42 ns | 88.04 ns | -177.38 ns | 67% |
System.Tests.Perf_Version.TryFormat4 | 345.05 ns | 115.05 ns | -230.00 ns | 67% |
System.Collections.TryGetValueTrue<String, String>.ConcurrentDictionary(Size: 512) | 49.50 μs | 16.53 μs | -32.97 μs | 67% |
System.Tests.Perf_Version.TryFormat2 | 176.63 ns | 59.61 ns | -117.02 ns | 66% |
System.Collections.ContainsKeyTrue<String, String>.ConcurrentDictionary(Size: 512) | 50.43 μs | 17.54 μs | -32.89 μs | 65% |
LinqBenchmarks.Where01ForX | 1.57 secs | 548.00 ms | -1022.61 ms | 65% |
LinqBenchmarks.Where01LinqMethodX | 1.68 secs | 588.39 ms | -1095.38 ms | 65% |
The most improved groupings of benchmark are System.Collections
, System.Decimal
, System.Int
, and System.Text
as outlined in dotnet/perf-autofiling-issues#12996, dotnet/perf-autofiling-issues#13006, dotnet/perf-autofiling-issues#13217, and dotnet/perf-autofiling-issues#13264. The changes implemented in #81695 intrinsified RuntimeHelpers.CreateSpan<T>
widely used in the BCL and replaced icall
performance path.
Arm64 SIMD operations implemented in #83094 and #82420 improved over 1000 microbenchmarks according to the dotnet/perf-autofiling-issues#13808, dotnet/perf-autofiling-issues#13807, dotnet/perf-autofiling-issues#14023, and dotnet/perf-autofiling-issues#13990.
The grouping of benchmarks related to System.Collections
have been improved by the changes made in #81902. as outlined in dotnet/perf-autofiling-issues#13220. The changes added support for v128 constants and improved performance in about 75 microbenchmarks.
The benchmark grouping of System.Text
has been improved by the addition of S.R.I Vectors in JsonReaderHelper, introduced in #81758 and outlined in dotnet/perf-autofiling-issues#12993. Furthermore, improved handling of the ldtoken+ltoken+Type::op_EqualThe
optimization implemented in #81277 have significantly improved the benchmark grouping of System.Text
, as detailed in dotnet/perf-autofiling-issues#12313.
The changes introduced in #81306 removed types deriving from JsonTypeInfo<T>
have had a positive impact on the benchmark groupings of both System.Numerics
and System.Collections
, as reported in dotnet/perf-autofiling-issues#12488 and dotnet/perf-autofiling-issues#12550.
All above mentioned changes are speed-related improvements of microbechmarks. There was a significant size improvement on WASM and iOS by enabling deduplication of generics. Issue #80419 contains references to changes that reduced size on disk (SOD) for about 11% and 3% respectively.
Regressions
Here is a list of top 20 microbenchmarks regressions in Preview 2. Full report available here.
Name | Baseline Value | Compare Value | Difference | % Difference |
---|---|---|---|---|
System.Tests.Perf_Random.Next_long_unseeded | 10.17 ns | 28.84 ns | 18.67 ns | -184% |
System.Numerics.Tests.Perf_Vector4.EqualityOperatorBenchmark | 0.79 ns | 1.96 ns | 1.17 ns | -148% |
System.Numerics.Tests.Perf_Vector3.TransformByMatrix4x4Benchmark | 60.14 ns | 140.30 ns | 80.17 ns | -133% |
System.Numerics.Tests.Perf_Vector3.TransformNormalByMatrix4x4Benchmark | 60.73 ns | 132.19 ns | 71.46 ns | -118% |
System.Numerics.Tests.Perf_Vector4.TransformVector3ByMatrix4x4Benchmark | 62.72 ns | 131.48 ns | 68.76 ns | -110% |
System.Numerics.Tests.Perf_Vector4.TransformByMatrix4x4Benchmark | 63.09 ns | 131.10 ns | 68.00 ns | -108% |
System.Numerics.Tests.Perf_Vector2.TransformByMatrix4x4Benchmark | 56.47 ns | 112.12 ns | 55.65 ns | -99% |
System.Numerics.Tests.Perf_Quaternion.LengthSquaredBenchmark | 7.76 ns | 14.35 ns | 6.59 ns | -85% |
System.Numerics.Tests.Perf_Vector2.TransformNormalByMatrix4x4Benchmark | 56.66 ns | 103.10 ns | 46.44 ns | -82% |
System.Numerics.Tests.Perf_Vector4.TransformVector2ByMatrix4x4Benchmark | 61.08 ns | 103.66 ns | 42.58 ns | -70% |
System.Numerics.Tests.Perf_Vector2.TransformByMatrix3x2Benchmark | 20.85 ns | 35.00 ns | 14.15 ns | -68% |
System.Numerics.Tests.Perf_BitOperations.LeadingZeroCount_uint | 667.85 ns | 1.10 μs | 428.39 ns | -64% |
System.Tests.Perf_Random.Next_long_long_unseeded | 14.28 ns | 22.44 ns | 8.15 ns | -57% |
System.Numerics.Tests.Perf_Quaternion.ConjugateBenchmark | 18.32 ns | 28.76 ns | 10.44 ns | -57% |
System.Numerics.Tests.Perf_Quaternion.InverseBenchmark | 26.70 ns | 41.60 ns | 14.89 ns | -56% |
System.Numerics.Tests.Perf_Quaternion.LengthBenchmark | 13.45 ns | 20.35 ns | 6.90 ns | -51% |
System.Numerics.Tests.Perf_BitOperations.LeadingZeroCount_ulong | 745.74 ns | 1.10 μs | 357.01 ns | -48% |
System.Numerics.Tests.Perf_BitOperations.Log2_ulong | 894.61 ns | 1.32 μs | 425.98 ns | -48% |
System.Numerics.Tests.Perf_Vector2.TransformNormalByMatrix3x2Benchmark | 21.03 ns | 30.87 ns | 9.85 ns | -47% |
System.Numerics.Tests.Perf_Vector3.ReflectBenchmark | 37.23 ns | 54.13 ns | 16.90 ns | -45% |
Here is a list of ongoing regressions in Preview 2 snapshot with short description.
Issue report | Description |
---|---|
dotnet/perf-autofiling-issues#12546 | Quaternion and Plane SIMD intrinsics |
dotnet/perf-autofiling-issues#12957 | Improve ConcurrentDictionary performance for strings |
dotnet/perf-autofiling-issues#12660 | Improved codegen of the vector accelerated System.Numerics.* types |
dotnet/perf-autofiling-issues#13187 | Implementation of Lemire's nearly divisionless method |
dotnet/perf-autofiling-issues#13500 | Use of Array.Reverse<T> in ImmutableArray<T>.Builder.Reverse |
Mono Interpreter
The following sections presents improvements and regressions introduced in Mono Interpreter in the Preview 2.
Improvements
Here is a list of top 20 microbenchmarks improvements in Preview 2. Full report available here.
Name | Baseline Value | Compare Value | Difference | % Difference |
---|---|---|---|---|
System.Collections.Concurrent.Count<Int32>.Dictionary(Size: 512) | 140.03 μs | 1.76 μs | -138.26 μs | 99% |
System.Collections.Concurrent.Count<String>.Dictionary(Size: 512) | 136.03 μs | 1.86 μs | -134.17 μs | 99% |
System.Threading.Tests.Perf_Interlocked.CompareExchange_long | 37.56 ns | 6.66 ns | -30.90 ns | 82% |
System.Threading.Tests.Perf_Interlocked.CompareExchange_int | 34.18 ns | 8.33 ns | -25.85 ns | 76% |
System.Buffers.Tests.RentReturnArrayPoolTests<Byte>.ProducerConsumer(RentalSize: 4096, ManipulateArray: False, Async: True, UseSharedPool: False) | 3.81 μs | 1.09 μs | -2.72 μs | 71% |
System.Numerics.Tests.Perf_Vector4.ZeroBenchmark | 3.21 ns | 0.99 ns | -2.22 ns | 69% |
System.Buffers.Tests.RentReturnArrayPoolTests<Object>.ProducerConsumer(RentalSize: 4096, ManipulateArray: False, Async: True, UseSharedPool: False) | 3.42 μs | 1.06 μs | -2.36 μs | 69% |
System.Tests.Perf_Decimal.Floor | 175.25 ns | 65.77 ns | -109.48 ns | 62% |
System.Numerics.Tests.Perf_Quaternion.LengthBenchmark | 63.64 ns | 24.08 ns | -39.56 ns | 62% |
System.Numerics.Tests.Perf_Quaternion.InequalityOperatorBenchmark | 89.74 ns | 34.82 ns | -54.93 ns | 61% |
System.Buffers.Tests.RentReturnArrayPoolTests<Byte>.ProducerConsumer(RentalSize: 4096, ManipulateArray: False, Async: False, UseSharedPool: False) | 4.34 μs | 1.70 μs | -2.64 μs | 61% |
System.Tests.Perf_Decimal.Round | 191.52 ns | 75.77 ns | -115.76 ns | 60% |
System.Numerics.Tests.Perf_Quaternion.DotBenchmark | 77.60 ns | 31.33 ns | -46.27 ns | 60% |
System.Numerics.Tests.Perf_Quaternion.DivideBenchmark | 88.55 ns | 36.47 ns | -52.07 ns | 59% |
System.Tests.Perf_Random.Next_int_int_unseeded | 154.47 ns | 65.37 ns | -89.11 ns | 58% |
System.Numerics.Tests.Perf_Quaternion.IsIdentityBenchmark | 81.52 ns | 35.06 ns | -46.46 ns | 57% |
System.Numerics.Tests.Perf_Quaternion.SubtractionOperatorBenchmark | 83.75 ns | 36.09 ns | -47.67 ns | 57% |
System.Numerics.Tests.Perf_Quaternion.SubtractBenchmark | 84.49 ns | 36.50 ns | -47.99 ns | 57% |
System.Collections.CtorFromCollection<Int32>.ConcurrentDictionary(Size: 512) | 461.77 μs | 200.10 μs | -261.67 μs | 57% |
System.Tests.Perf_UInt64.TryFormat(value: 0) | 250.12 ns | 109.72 ns | -140.40 ns | 56% |
The most improved groupings of benchmark are System.Collections
, System.Numerics
, and System.Decimal
as outlined in dotnet/perf-autofiling-issues#12504, dotnet/perf-autofiling-issues#12544, dotnet/perf-autofiling-issues#13303, dotnet/perf-autofiling-issues#13247, dotnet/perf-autofiling-issues#13752, dotnet/perf-autofiling-issues#13761, and dotnet/perf-autofiling-issues#12744. The changes implemented in #81335 which intrinsified System.Numerics.*
types, in #82093 which intrinsified CreateSpan
, and in #81782 which introduced common Vector128 SIMD operations widely used in the BCL improved over 1000 microbenchmarks.
Implementation of synch block fast paths created a regression in Mono AOT compiler #81380, but led to an improvement of about 100 microbenchmarks in Mono Interpreter, as detailed in dotnet/perf-autofiling-issues#13245.
Similar to a change in AOT compiler, changes introduced in #81306 removed types deriving from JsonTypeInfo<T>
improved several microbenchmarks in Mono Interpreter. Improve ConcurrentDictionary performance for strings in #81557 improved dotnet/perf-autofiling-issues#13003. Also, code refactors led to several improvements presented in dotnet/perf-autofiling-issues#12301.
Regressions
Here is a list of top 20 microbenchmarks regressions in Preview 2. Full report available here.
Name | Baseline Value | Compare Value | Difference | % Difference |
---|---|---|---|---|
System.Numerics.Tests.Perf_VectorOf<UInt64>.CountBenchmark | 0.06 ns | 3.10 ns | 3.04 ns | -5,059% |
System.Runtime.Intrinsics.Tests.Perf_Vector128Of<Int16>.CountBenchmark | 0.36 ns | 1.75 ns | 1.39 ns | -391% |
System.Collections.TryAddDefaultSize<String>.ConcurrentDictionary(Count: 512) | 297.96 μs | 574.34 μs | 276.38 μs | -93% |
System.Numerics.Tests.Perf_Vector2.UnitYBenchmark | 7.38 ns | 13.69 ns | 6.31 ns | -85% |
HardwareIntrinsics.RayTracer.SoA.Render | 2.41 ns | 4.38 ns | 1.97 ns | -82% |
System.Numerics.Tests.Perf_Vector2.TransformByMatrix3x2Benchmark | 48.06 ns | 86.28 ns | 38.22 ns | -80% |
System.IO.Compression.Brotli.Compress_WithoutState(level: Fastest, file: "TestDocument.pdf") | 291.36 μs | 522.83 μs | 231.47 μs | -79% |
System.IO.Compression.Brotli.Compress_WithState(level: Fastest, file: "TestDocument.pdf") | 296.93 μs | 525.99 μs | 229.06 μs | -77% |
System.Numerics.Tests.Perf_Vector2.TransformNormalByMatrix3x2Benchmark | 44.65 ns | 75.61 ns | 30.96 ns | -69% |
System.Memory.Constructors_ValueTypesOnly<Byte>.ReadOnlyFromPointerLength | 6.33 ns | 10.49 ns | 4.16 ns | -66% |
PerfLabTests.EnumPerf.ObjectGetTypeNoBoxing | 3.87 ns | 6.20 ns | 2.32 ns | -60% |
System.Numerics.Tests.Perf_Vector3.SquareRootBenchmark | 23.34 ns | 37.02 ns | 13.68 ns | -59% |
System.Numerics.Tests.Perf_Vector3.TransformNormalByMatrix4x4Benchmark | 124.53 ns | 196.66 ns | 72.12 ns | -58% |
System.Diagnostics.Perf_Process.StartAndWaitForExit | 871.51 μs | 1.35 ms | 474.57 μs | -54% |
System.Numerics.Tests.Perf_Vector3.TransformByMatrix4x4Benchmark | 144.68 ns | 217.99 ns | 73.31 ns | -51% |
System.Collections.AddGivenSize<String>.List(Size: 512) | 12.21 μs | 18.32 μs | 6.11 μs | -50% |
System.IO.Tests.BinaryWriterExtendedTests.WriteAsciiCharArray(StringLengthInChars: 2000000) | 8.14 ms | 12.20 ms | 4.06 ms | -50% |
System.Numerics.Tests.Perf_VectorOf<Int32>.ZeroBenchmark | 3.20 ns | 4.80 ns | 1.59 ns | 50% |
System.Buffers.Tests.RentReturnArrayPoolTests<Byte>.ProducerConsumer(RentalSize: 4096, ManipulateArray: False, Async: True, UseSharedPool: True) | 5.73 μs | 8.56 μs | 2.83 μs | -49% |
System.Buffers.Tests.RentReturnArrayPoolTests<Object>.ProducerConsumer(RentalSize: 4096, ManipulateArray: False, Async: True, UseSharedPool: True) | 5.62 μs | 8.37 μs | 2.75 μs | -49% |
Here is a list of ongoing regressions in Preview 2 snapshot with short description.
Issue report | Description |
---|---|
dotnet/perf-autofiling-issues#12707 | use of not implemented Vector operations |
dotnet/perf-autofiling-issues#13747 | Intrinsified common Vector128 operations |
Preview 1
This report presents .NET 8 Preview 1 overview of major performance improvements and regressions in Mono Interpreter.
Improvements
Here is a list of top 20 microbenchmarks improvements in Preview 1.
Name | Baseline Value | Compare Value | Difference | % Difference |
---|---|---|---|---|
System.Numerics.Tests.Perf_VectorOf<Byte>.LessThanAnyBenchmark | 292.17 ns | 18.88 ns | -273.29 ns | 94% |
System.Numerics.Tests.Perf_VectorOf<Byte>.LessThanOrEqualAnyBenchmark | 298.08 ns | 20.47 ns | -277.61 ns | 93% |
System.Numerics.Tests.Perf_VectorOf<SByte>.LessThanOrEqualAnyBenchmark | 294.38 ns | 20.33 ns | -274.05 ns | 93% |
System.Numerics.Tests.Perf_VectorOf<SByte>.LessThanAnyBenchmark | 298.45 ns | 20.63 ns | -277.82 ns | 93% |
System.Numerics.Tests.Perf_VectorOf<Byte>.GreaterThanOrEqualAllBenchmark | 331.73 ns | 24.25 ns | -307.48 ns | 93% |
System.Numerics.Tests.Perf_VectorOf<UInt16>.GreaterThanOrEqualAllBenchmark | 218.05 ns | 20.58 ns | -197.47 ns | 91% |
System.Numerics.Tests.Perf_VectorOf<Int16>.GreaterThanAllBenchmark | 209.57 ns | 20.48 ns | -189.08 ns | 90% |
System.Numerics.Tests.Perf_VectorOf<Int16>.GreaterThanOrEqualAllBenchmark | 231.47 ns | 23.03 ns | -208.44 ns | 90% |
System.Numerics.Tests.Perf_VectorOf<Int16>.LessThanOrEqualAnyBenchmark | 188.87 ns | 20.02 ns | -168.84 ns | 89% |
System.Numerics.Tests.Perf_VectorOf<Int16>.LessThanAnyBenchmark | 186.21 ns | 20.05 ns | -166.16 ns | 89% |
System.Numerics.Tests.Perf_VectorOf<UInt16>.LessThanOrEqualAnyBenchmark | 189.87 ns | 20.76 ns | -169.11 ns | 89% |
System.Numerics.Tests.Perf_VectorOf<UInt16>.LessThanAnyBenchmark | 186.54 ns | 21.38 ns | -165.15 ns | 89% |
System.Memory.Span<Byte>.IndexOfAnyFourValues(Size: 512) | 11.82 μs | 1.60 μs | -10.23 μs | 87% |
System.Memory.Span<Byte>.IndexOfAnyFiveValues(Size: 512) | 14.32 μs | 2.42 μs | -11.90 μs | 83% |
System.Numerics.Tests.Perf_VectorOf<Int32>.GreaterThanAllBenchmark | 120.71 ns | 20.59 ns | -100.11 ns | 83% |
System.Numerics.Tests.Perf_VectorOf<UInt32>.GreaterThanAllBenchmark | 124.72 ns | 21.39 ns | -103.32 ns | 83% |
System.Numerics.Tests.Perf_VectorOf<Single>.GreaterThanOrEqualAllBenchmark | 136.11 ns | 24.20 ns | -111.91 ns | 82% |
System.Numerics.Tests.Perf_VectorOf<Single>.GreaterThanAllBenchmark | 128.50 ns | 24.30 ns | -104.20 ns | 81% |
System.Numerics.Tests.Perf_VectorOf<UInt64>.GreaterThanAllBenchmark | 105.81 ns | 20.48 ns | -85.33 ns | 81% |
System.Numerics.Tests.Perf_VectorOf<Int64>.GreaterThanAllBenchmark | 105.16 ns | 20.57 ns | -84.60 ns | 80% |
There are a number of improvements introduced in Preview 1 to individually call out. The following section presents only major improvements with high-level analysis.
The analysis should be taken dubiously and readers are encouraged to examine benchmark reports for thorough analysis.
The most improved groupings of benchmark are System.Runtime.Vectors
, System.Runtime.Intrinsics
and System.Collections
as outlined here and in dotnet/perf-autofiling-issues#10468.
Adding stobj.vt.noref
version for no reference case that is twice as fast compared to the stobj.v
improved over 400 microbenchmarks as outlined in dotnet/perf-autofiling-issues#10468 and dotnet/perf-autofiling-issues#10464.
SpanHelpers are widly used in BCL and improvements related to them could significantly improve performance. Changes in 200a90a, 7fa0d5b, and c0447bc removed mono-specific SpanHelpers, replaced branch patterns with super-instructions, and improved detection of dead bblocks. Over 300 microbenchmarks are improved as outlined in dotnet/perf-autofiling-issues#10989 and dotnet/perf-autofiling-issues#11155.
Change #77331 simplified getitem.span
opcode and avoided typical use of ldloca with it, which improved over 50 microbenchmarks.
Allow passing vtypes with a single scalar field to native code using the faster code path improved System.Text
an System.Collections
groupings of benchmarks as outlined in dotnet/perf-autofiling-issues#10987 and dotnet/perf-autofiling-issues#10938. The assumption is that those libraries rely on ObjectHandleOnStack types.
Intrinsic for string allocation newstr
in #79392 improved various microbenchmarks as outlined in dotnet/perf-autofiling-issues#10694 and dotnet/perf-autofiling-issues#10670.
9a65109 contributed to dotnet/perf-autofiling-issues#10695 and dotnet/perf-autofiling-issues#10671.
All above mentioned changes are speed improvements of microbechmarks. There was a significant size improvement in web assembly by #79672 that reduced size on disk (SOD) in blazor template application for ~270kb by trimming S.N.Vector
class in non-SIMD cases. With deduplication of symbols in web assembly additional size savings are achieved.
Regressions
Here is a list of top 20 microbenchmarks regressions in Preview 1.
Name | Baseline Value | Compare Value | Difference | % Difference |
---|---|---|---|---|
System.Numerics.Tests.Perf_VectorOf<Byte>.CountBenchmark | 0.10 ns | 1.10 ns | 1.00 ns | -969% |
System.Tests.Perf_String.Replace_Char(text: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgblu3at20nfab2wei1kxfbvsbpzwhanjczcqa2psra3aacxb67qnwbnfp2tok6v0a58lzfdql | 11.63 μs | 101.96 μs | 90.33 μs | -777% |
System.Tests.Perf_String.Replace_Char(text: "yfesgj0sg1ijslnjsb3uofdz3tbzf6ysgblu3at20nfab2wei1kxfbvsbpzwhanjczcqa2psra3aacxb67qnwbnfp2tok6v0a58l", ol | 1.30 μs | 8.82 μs | 7.52 μs | -578% |
System.Tests.Perf_Byte.ToString(value: 255) | 38.31 ns | 257.96 ns | 219.65 ns | -573% |
System.Tests.Perf_String.Replace_String(text: "This is a very nice sentence. This is another very nice sentence.", oldValue: "a", newValue: "b") | 962.59 ns | 6.30 μs | 5335.40 ns | -554% |
PerfLabTests.LowLevelPerf.IntegerFormatting | 6.08 ms | 34.30 ms | 28.21 ms | -464% |
System.Tests.Perf_Int32.ToString(value: 2147483647) | 59.17 ns | 332.19 ns | 273.01 ns | -461% |
System.Tests.Perf_Int16.ToString(value: 32767) | 53.24 ns | 297.84 ns | 244.60 ns | -459% |
System.Tests.Perf_Int32.ToString(value: 12345) | 52.90 ns | 293.56 ns | 240.66 ns | -455% |
System.Tests.Perf_String.Replace_Char(text: "This is a very nice sentence", oldChar: 'i', newChar: 'I') | 531.46 ns | 2.89 μs | 2355.30 ns | -443% |
System.Tests.Perf_SByte.ToString(value: 127) | 52.62 ns | 276.41 ns | 223.79 ns | -425% |
System.Numerics.Tests.Perf_Vector2.TransformNormalByMatrix4x4Benchmark | 21.70 ns | 108.97 ns | 87.28 ns | -402% |
System.Numerics.Tests.Perf_Vector2.TransformByMatrix4x4Benchmark | 26.37 ns | 114.02 ns | 87.65 ns | -332% |
System.Numerics.Tests.Perf_Matrix4x4.MultiplyByMatrixOperatorBenchmark | 246.08 ns | 1.04 μs | 797.11 ns | -324% |
System.Numerics.Tests.Perf_Matrix4x4.MultiplyByMatrixBenchmark | 243.24 ns | 1.02 μs | 779.98 ns | -321% |
System.Tests.Perf_Byte.ToString(value: 0) | 7.06 ns | 27.18 ns | 20.11 ns | -285% |
System.Numerics.Tests.Perf_Matrix4x4.CreateTranslationFromScalarXYZ | 25.27 ns | 91.61 ns | 66.34 ns | -263% |
System.Numerics.Tests.Perf_Matrix4x4.AddBenchmark | 90.93 ns | 304.20 ns | 213.27 ns | -235% |
System.Numerics.Tests.Perf_Matrix4x4.LerpBenchmark | 141.51 ns | 443.45 ns | 301.94 ns | -213% |
System.Numerics.Tests.Perf_Matrix4x4.SubtractOperatorBenchmark | 100.31 ns | 307.60 ns | 207.29 ns | -207% |
Here is a list of ongoing regressions in Preview 1 snapshot with short description.
Issue report | Description |
---|---|
dotnet/perf-autofiling-issues#12299 | Extracted code outside of interp main loop |
dotnet/perf-autofiling-issues#11449 | Investigating |
dotnet/perf-autofiling-issues#11453 | Redundant ldloca and stfld opcodes in the new Matrix4x4 implementation |
dotnet/perf-autofiling-issues#11147 | New ASCII APIs |
#79973 | Dependencies update |
#79336 | Managed implementation of UInt32ToDecStr |
#79876 | Unoptimized pattern ldstr; if (uncommon) throw ex (string) |