Skip to content

Conversation

kzrnm
Copy link
Contributor

@kzrnm kzrnm commented Mar 5, 2025

Fixes #111708

After MemoryMarshal was moved from System.Memory to System.Runtime in #106360, crossgen is properly applied to Number.FormatBigInteger. However, the loop logic in FormatBigInteger is significantly slow in tier-0. For more information #111708 (comment)

Discussions about applying AO have taken place in issues like #71261, #79645, #109087 (comment), etc., but these were mainly about frequently called methods like SpanHelpers. Given this, I don't think applying AO to BigInteger would cause any startup performance issues.

@ghost ghost added the area-System.Numerics label Mar 5, 2025
@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Mar 5, 2025
@EgorBo
Copy link
Member

EgorBo commented Mar 5, 2025

Let's first check if that was the issue:

@EgorBo
Copy link
Member

EgorBo commented Mar 5, 2025

@EgorBot -amd -arm -windows_intel

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Numerics;

BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);

[MemoryDiagnoser(false)]
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
    private char[] _dest = new char[1_000_000];

    private BigInteger _value =
        BigInteger.Pow(BigInteger.Parse(string.Concat(Enumerable.Repeat("1234567890", 30))), 1_000);

    [Benchmark(Baseline = true)]
    public string Stringify() => _value.ToString();

    [Benchmark]
    public bool TryFormat() => _value.TryFormat(_dest, out _);
}

@EgorBo
Copy link
Member

EgorBo commented Mar 5, 2025

@kzrnm While my bot is likely broken for Linux atm (fixing it now), the Windows-x64 job couldn't confirm your theory - EgorBot/runtime-utils#310 (comment) the numbers for before/after are the same.

@kzrnm
Copy link
Contributor Author

kzrnm commented Mar 5, 2025

@EgorBo

When running build.cmd/build.sh, crossgen is properly applied to only System.Private.CoreLib.In other words, libraries generate ReadyToRun DLLs through PublishReadyToRun, so we need to check the assemblies produced by publish.

https://github.com/dotnet/runtime/blob/7dd74fc403dda02fdccc541b306b8267c9a22aa9/src/coreclr/crossgen-corelib.proj

Since I wasn't sure how dotnet/runtime handles publishing, I ran crossgen2.exe directly and used the ReadyToRun assemblies for verification.

The System.Runtime.Numerics.dll output in artifacts is around 130 KB, whereas the one in the actual runtime is 350 KB.

Get-ChildItem $root\runtime\artifacts\bin\System.Runtime.Numerics\Release\net10.0\System.Runtime.Numerics.dll, 'C:\Program Files\dotnet\shared\Microsoft.NETCore.App\9.0.2\System.Runtime.Numerics.dll'

    Directory: D:\repository\dotnet\runtime\artifacts\bin\System.Runtime.Numerics\Release\net10.0

Mode                 LastWriteTime         Length Name
----                 -------------         ------ ----
-a---          2025/03/06     8:06         134656 System.Runtime.Numerics.dll

    Directory: C:\Program Files\dotnet\shared\Microsoft.NETCore.App\9.0.2

Mode                 LastWriteTime         Length Name
----                 -------------         ------ ----
-a---          2025/01/16    21:51         354600 System.Runtime.Numerics.dll

When inspecting the latter with ILSpy, it displays "This assembly contains unmanaged code."

@EgorBo
Copy link
Member

EgorBo commented Mar 6, 2025

Ah, I didn't realize BigInteger is not in corelib (although, there is an internal copy there).

Anyway, it doesn't sound like a correct solution to me. AO bypasses all Tier1 optimizations including Dynamic PGO. Today the only method in BCL that has it is AwaitUnsafeOnCompleted that is often on hot path and allocates a lot in tier0 (and we would love to remove AO on it one day - requires some work in the JIT)

@kzrnm
Copy link
Contributor Author

kzrnm commented Mar 6, 2025

@EgorBo I don't think the fact that AO isn't used elsewhere is a valid reason not to use it for BigInteger.

For example, SpanHelpers can be useful for lightweight applications, making R2R a suitable choice. However, BigInteger is purely a numerical computation type, and users are likely to prioritize faster calculations over a slight improvement in application startup time.

Additionally, the regression is occurring because Dynamic PGO and OSR are not being applied, causing execution to remain in Tier-0. While this might be an issue caused by JIT, considering the use case of BigInteger, I believe applying AO as a workaround is a reasonable way to fix the regression. It should not negatively impact users who do not use BigInteger, as no other methods are expected to call BigInteger.ToString().


there is an internal copy there

Are you referring to Number.BigInteger? It appears to be a limited-purpose type used for parsing floating-point number and does not implement ToString().

https://github.com/dotnet/runtime/blob/850c0ab4519b904a28f2d67abdaba1ac78c955ff/src/libraries/System.Private.CoreLib/src/System/Number.BigInteger.cs

@EgorBo
Copy link
Member

EgorBo commented Mar 6, 2025

I don't think the fact that AO isn't used elsewhere is a valid reason not to use it for BigInteger.

AO is a workaround, there must be a valid, well explained reason for it to apply. E.g in case of AwaitUnsafeOnCompleted it's blocked by this issue #90965 and removing it caused a massive impact on TE benchmarks' working set (see #90899)

Do we have an issue explaining the exact cause here?

Additionally, the regression is occurring because Dynamic PGO and OSR are not being applied

It should be eventually applied if it's hot enough - if it does not, it's a serious bug we need to investigate. If you need performance right at the start, you should consider disabling TieredCompilation completely or switch to NativeAOT. Generally, AO is expected to decrease the steady state performance - various "is type initialized" checks are typically not elided and no PGO is applied.

@kzrnm kzrnm force-pushed the BigIntegerRegression111708 branch from c769ae3 to 96b47bc Compare March 6, 2025 14:16
@kzrnm
Copy link
Contributor Author

kzrnm commented Mar 6, 2025

Further testing has yielded even better results. By extracting the inner loop into a separate method, the code achieves higher performance without needing AO.


BenchmarkDotNet v0.14.0, Windows 11 (10.0.26100.3194)
13th Gen Intel Core i5-13500, 1 CPU, 20 logical and 14 physical cores
.NET SDK 10.0.100-preview.3.25125.5
  [Host]   : .NET 10.0.0 (10.0.25.12411), X64 RyuJIT AVX2
  ShortRun : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX2

Job=ShortRun  IterationCount=3  LaunchCount=1  
WarmupCount=3  

Method Toolchain N Mean Ratio Allocated Alloc Ratio
DecimalString \main\corerun.exe 100 154.5 ns 1.00 - NA
DecimalString \pr\corerun.exe 100 165.2 ns 1.07 - NA
DecimalString \withAO\corerun.exe 100 154.7 ns 1.00 - NA
DecimalString \main\corerun.exe 1000 10,173.4 ns 1.00 - NA
DecimalString \pr\corerun.exe 1000 9,039.6 ns 0.89 - NA
DecimalString \withAO\corerun.exe 1000 9,936.7 ns 0.98 - NA
DecimalString \main\corerun.exe 10000 1,037,805.7 ns 1.00 - NA
DecimalString \pr\corerun.exe 10000 907,048.1 ns 0.87 - NA
DecimalString \withAO\corerun.exe 10000 1,045,343.4 ns 1.01 - NA
DecimalString \main\corerun.exe 100000 209,918,922.2 ns 1.00 - NA
DecimalString \pr\corerun.exe 100000 91,287,872.2 ns 0.43 - NA
DecimalString \withAO\corerun.exe 100000 104,468,880.0 ns 0.50 - NA
DecimalString \main\corerun.exe 1000000 20,839,716,466.7 ns 1.00 - NA
DecimalString \pr\corerun.exe 1000000 9,223,145,333.3 ns 0.44 - NA
DecimalString \withAO\corerun.exe 1000000 10,461,054,700.0 ns 0.50 - NA
Benchmark code
using BenchmarkDotNet.Attributes;
using System.Numerics;

[DisassemblyDiagnoser]
[MemoryDiagnoser(false)]
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class ToStringTest
{
    char[] _dest = new char[1000010];
    [Params([
        100,
        1000,
        10000,
        100000,
        1000000,
    ])]
    public int N;

    BigInteger b;

    [GlobalSetup]
    public void Setup()
    {
        b = BigInteger.Parse(new string('9', N));
    }

    [Benchmark] public void DecimalString() => b.TryFormat(_dest, out _);
}

@EgorBo
Copy link
Member

EgorBo commented Mar 6, 2025

Job=ShortRun IterationCount=3 LaunchCount=1 WarmupCount=3

Maybe that is the reason? the fact that you're measuring a short run and it doesn't properly promote to the final tier?
What are the results with default BDN parameters or long run?

@kzrnm
Copy link
Contributor Author

kzrnm commented Mar 6, 2025

Hmm, when measured with the Default job, the main version with 100,000 digits became as fast as the AO version. However, with 1,000,000 digits, the execution time was too long for proper warm-up to take place.

Moreover, suggesting that "for use cases where a single execution of a massive computation takes several seconds, TiredCompilation should be disabled" is an extremely niche piece of knowledge. It is desirable for the program to perform efficiently even when processing a large input only once.


BenchmarkDotNet v0.14.0, Windows 11 (10.0.26100.3194)
13th Gen Intel Core i5-13500, 1 CPU, 20 logical and 14 physical cores
.NET SDK 10.0.100-preview.3.25125.5
  [Host]     : .NET 10.0.0 (10.0.25.12411), X64 RyuJIT AVX2
  Job-XVYGXJ : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX2
  Job-TFIFOI : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX2
  Job-MKOOHF : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX2


Method Toolchain N Mean Ratio Code Size Allocated Alloc Ratio
DecimalString \main\corerun.exe 100000 104.64 ms 1.00 5,262 B - NA
DecimalString \pr\corerun.exe 100000 91.53 ms 0.87 5,224 B - NA
DecimalString \withAO\corerun.exe 100000 104.82 ms 1.00 3,821 B - NA
DecimalString \main\corerun.exe 1000000 20,975.64 ms 1.00 274 B - NA
DecimalString \pr\corerun.exe 1000000 9,192.49 ms 0.44 274 B - NA
DecimalString \withAO\corerun.exe 1000000 10,493.09 ms 0.50 274 B - NA

@kzrnm kzrnm changed the title Add AggressiveOptimization to Number.FormatBigInteger Make Number.FormatBigInteger JIT-firendly Mar 6, 2025
@kzrnm
Copy link
Contributor Author

kzrnm commented Apr 12, 2025

I noticed that Tier1 performance has regressed. It looks like a different approach might be needed, so I'll go ahead and close this for now.


BenchmarkDotNet v0.14.0, Windows 11 (10.0.26100.3775)
13th Gen Intel Core i5-13500, 1 CPU, 20 logical and 14 physical cores
.NET SDK 10.0.100-preview.3.25167.3
  [Host]   : .NET 9.0.1 (9.0.124.61010), X64 RyuJIT AVX2
  ShortRun : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX2

Job=ShortRun  IterationCount=3  LaunchCount=1  
WarmupCount=3  

Method Toolchain N Mean Ratio Code Size Allocated Alloc Ratio
DecimalString \main\corerun.exe 100000 296.9 ms 1.00 274 B 536 B 1.00
DecimalString \pr\corerun.exe 100000 119.9 ms 0.40 309 B 214 B 0.40
DecimalString2 \main\corerun.exe 100000 162.2 ms 1.00 569 B 268 B 1.00
DecimalString2 \pr\corerun.exe 100000 137.1 ms 0.85 569 B 268 B 1.00
using BenchmarkDotNet.Attributes;
using System.Numerics;

[DisassemblyDiagnoser]
[MemoryDiagnoser(false)]
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class ToStringTest
{
    char[] _dest = new char[1000010];
    [Params([
        // 100,
        // 1000,
        // 10000,
        100000,
        // 1000000,
    ])]
    public int N;

    BigInteger b, s;

    [GlobalSetup]
    public void Setup()
    {
        s = BigInteger.Parse(new string('9', 1 << 8));
        b = BigInteger.Parse(new string('9', N));
    }

    [Benchmark] public void DecimalString() => b.TryFormat(_dest, out _);
    [Benchmark]
    public void DecimalString2()
    {
        for (int i = 0; i < 10000; i++)
        {
            s.TryFormat(_dest, out _);
        }
        b.TryFormat(_dest, out _);
    }
}

@kzrnm kzrnm closed this Apr 12, 2025
@kzrnm kzrnm reopened this Apr 12, 2025
@kzrnm
Copy link
Contributor Author

kzrnm commented Apr 12, 2025

@EgorBo
It seems that method calls were the cause of the slowdown.
By hard-coding the logic instead of using method calls, I could improve Tier0 performance without negatively impacting Tier1.


BenchmarkDotNet v0.14.0, Windows 11 (10.0.26100.3775)
13th Gen Intel Core i5-13500, 1 CPU, 20 logical and 14 physical cores
.NET SDK 10.0.100-preview.3.25167.3
  [Host]   : .NET 9.0.1 (9.0.124.61010), X64 RyuJIT AVX2
  ShortRun : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX2

Job=ShortRun  IterationCount=3  LaunchCount=1  
WarmupCount=3  

Method Toolchain N Mean Ratio Code Size Allocated Alloc Ratio
DecimalStringTier0 \main\corerun.exe 100000 277.4 ms 1.00 274 B 536 B 1.00
DecimalStringTier0 \pr\corerun.exe 100000 162.3 ms 0.59 309 B 196 B 0.37
DecimalStringTier1 \main\corerun.exe 100000 155.2 ms 1.00 569 B 268 B 1.00
DecimalStringTier1 \pr\corerun.exe 100000 148.9 ms 0.96 569 B 357 B 1.33
using BenchmarkDotNet.Attributes;
using System.Numerics;

[DisassemblyDiagnoser]
[MemoryDiagnoser(false)]
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class ToStringTest
{
    char[] _dest = new char[1000010];
    [Params([
        100000,
    ])]
    public int N;

    BigInteger b, s;

    [GlobalSetup]
    public void Setup()
    {
        s = BigInteger.Parse(new string('9', 1 << 8));
        b = BigInteger.Parse(new string('9', N));
    }

    [Benchmark] public void DecimalStringTier0() => b.TryFormat(_dest, out _);
    [Benchmark]
    public void DecimalStringTier1()
    {
        for (int i = 0; i < 10000; i++)
        {
            s.TryFormat(_dest, out _);
        }
        b.TryFormat(_dest, out _);
    }
}

@stephentoub
Copy link
Member

Late to the party... what is this PR trying to improve? Tier 0 throughput of BigInteger formatting?

@kzrnm
Copy link
Contributor Author

kzrnm commented May 29, 2025

Tier 0 throughput

Yes, that's right - this wasn't an issue in .NET 8 because it always ran in Tier 1, but in .NET 9 it started running in Tier 0 and the problem became apparent. The root cause seems to be that On-Stack Replacement is not working as expected, but since I don't know much about JIT, I tried to improve it with libraries. By eliminating method calls and writing the logic directly, it became fast even on Tier 0, so I am making this PR.

@jkotas
Copy link
Member

jkotas commented Jul 26, 2025

This sounds like something that should be fixed by a longer warm up for the microbenchmark so that it does not run TIer0 code. It is very common for Tier0 code to be several times slower than the optimized code - it is by design.

@jeffhandley
Copy link
Member

@tannergooding What do you think about this change? As @jkotas called out, the Tier 0 trade-offs are expected and perhaps not something we'd want to try to address.

@kzrnm
Copy link
Contributor Author

kzrnm commented Sep 17, 2025

Now that #112178 has been merged, this is no longer needed at all.

@kzrnm kzrnm closed this Sep 17, 2025
@kzrnm kzrnm deleted the BigIntegerRegression111708 branch September 17, 2025 00:23
@github-actions github-actions bot locked and limited conversation to collaborators Oct 17, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

area-System.Numerics community-contribution Indicates that the PR has been added by a community member

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Perf] BigInteger formatting performance regression in .NET 9

6 participants