Windows/Arm64: Use 8.1 atomic instructions for GC code #71169

kunalspathak · 2022-06-22T22:01:09Z

Following the example of #70921, continue to optimize following APIs using atomics in GC code base for Windows/arm64:

Interlocked::Exchange
Interlocked::CompareExchange
Interlocked::ExchangePointer

I was not able to optimize more methods because of lack of MSVC having them available. I have opened https://developercommunity.visualstudio.com/t/Add-APIs-for-Arm64-intrinsics-for-_Inter/10078117?space=62&q=intrinsic&entry=myfeedback to make a suggestion. I couldn't even write assembly equivalent for them because MSVC doesn't handle __asm for arm64 and writing the method in a .asm will invoke a function call, while we want these APIs to be inlined.

I have declare/defined g_atomics_available_present for clrgc and gcsample but not setting them, so it will be OFF by default for them.

kunalspathak · 2022-06-22T22:01:42Z

@Maoni0 @dotnet/jit-contrib

ghost · 2022-06-22T22:01:44Z

Tagging subscribers to this area: @dotnet/gc
See info in area-owners.md if you want to be subscribed.

Issue Details

Following the example of #70921, continue to optimize following APIs using atomics in GC code base for Windows/arm64:

Interlocked::Exchange
Interlocked::CompareExchange
Interlocked::ExchangePointer

I was not able to optimize more methods because of lack of MSVC having them available. I have opened https://developercommunity.visualstudio.com/myfeedback?space=62&q=intrinsic&entry=myfeedback to make a suggestion. I couldn't even write assembly equivalent for them because MSVC doesn't handle __asm for arm64 and writing the method in a .asm will invoke a function call, while we want these APIs to be inlined.

I have declare/defined g_atomics_available_present for clrgc and gcsample but not setting them, so it will be OFF by default for them.

Author:	kunalspathak
Assignees:	-
Labels:	`area-GC-coreclr`
Milestone:	-

jkotas · 2022-06-23T04:26:19Z

Any measurable perf improvements from this change?

jkotas · 2022-06-23T04:27:56Z

To fix the NativeAOT build break, you can define and initialize the flag here:

runtime/src/coreclr/nativeaot/Runtime/windows/PalRedhawkMinWin.cpp

Line 694 in dc3a6ac

}

kunalspathak · 2022-06-23T15:17:29Z

Any measurable perf improvements from this change?

Oddly no. I tried measuring using GCPerfSim. The numbers are in error margin and so I don't see much impact. @Maoni0 - any other benchmarks you would recommend? I intentionally didn't set COMPlus_GCgen0size so it gets lower cache (the binaries don't have #71029) and hence would trigger GC more often.

COMPlus_GCCpuGroup=1
COMPlus_Thread_UseAllCpuGroups=1
COMPlus_TieredCompilation=0

_1.bat
%CORE_RUN% E:\kpathak\GCPerfSim\net6.0\gcperfsim.dll -tc 1 -tagb 100 -tlgb 0 -lohar 0 -sohsi 0 -lohsi 0 -pohsi 0 -sohpi 0 -lohpi 0 -sohfi 0 -lohfi 0 -pohfi 0 -allocType reference -testKind time

_2.bat
%CORE_RUN% gcperfsim.dll -tc 1 -tagb 100 -tlgb 0.1 -lohar 0 -sohsi 50 -lohsi 0 -pohsi 0 -sohpi 0 -lohpi 0 -sohfi 0 -lohfi 0 -pohfi 0 -allocType reference -testKind time

jkotas · 2022-06-23T16:17:53Z

Oddly no. I tried measuring using GCPerfSim. The numbers are in error margin and so I don't see much impact.

It is what I would expect as we have discussed offline. The GC should not be doing interlocked operations that often for this to make a measurable difference. I was more wondering whether there is a path through GC where this would make a measurable difference that I did not think about.

kunalspathak · 2022-06-23T16:35:28Z

It is what I would expect as we have discussed offline.

Right, but @Maoni0 did point out some interesting code paths like enter_spin_lock and r_join uses Interlocked::CompareExchange. However join or methods around card table updates uses Interlocked::Add, etc. that are not optimized as part of this PR so that could be the reason.

jkotas · 2022-06-23T16:47:57Z

some interesting code paths like enter_spin_lock and r_join

How much time do the GC benchmarks spend in these methods? Subtract the time spent spinning from that. The expected benefit will be 20% of what remains based on the data from your other PR.

Maoni0 · 2022-06-23T18:04:16Z

you are not going to detect the difference by measuring the total execution time. if you make it so that it hits the code path like enter_spin_lock a lot more, you could potentially detect difference by looking at the CPU sample counts. right now you are using
-tc 1 which means one allocating threads and this will not contend the lock that enter_spin_lock enters. so you'd want to up the # of allocating threads by a lot. like -tc 16.

kunalspathak · 2022-06-23T22:03:32Z

I don't see much difference with -tc 16 serverGC . For the workstation GC, the numbers have lot of variances...between 66secs to 74 secs over 5 iterations. I tried to profile workstation GC and don't see methods that are hot.

However, when I checked at instruction level, I did see code around compareexchange to be hot.

Do we think we should still do this or as Jan pointed, since there won't be measurable difference skip doing it?

Maoni0 · 2022-06-24T17:52:00Z

I should have been more clear - this is not going to show up as a top method or anything. I meant you'd need to actually look at the CPU sample count that's spent in enter_spin_lock which is inlined so this would be in gc_heap::try_allocate_more_space.

I could skip doing this for GC.

kunalspathak · 2022-06-24T17:55:59Z

I could skip doing this for GC.

Sounds good to me. I am testing another prototype in #71260 which should enable using atomics on linux/arm64 on machines that have capability. On windows, we will live without it.

ghost added the area-GC-coreclr label Jun 22, 2022

ghost assigned kunalspathak Jun 22, 2022

Maoni0 approved these changes Jun 23, 2022

View reviewed changes

kunalspathak added 4 commits June 23, 2022 08:08

Use atomic for CompareExchange, CompareExchangePointer, ExchangePointer

bb60ea3

Compile for gcsample and clrgc

ed13ac5

Revert _InterlockedCompareExchangePointer changes

01f5edc

fix nativeAOT build

b9ff4f7

kunalspathak force-pushed the gc_atomics branch from c92b817 to b9ff4f7 Compare June 23, 2022 15:31

kunalspathak requested a review from MichalStrehovsky as a code owner June 23, 2022 15:31

kunalspathak mentioned this pull request Jun 23, 2022

Enable arm64 atomics for standalone GC #71221

Closed

kunalspathak closed this Jun 24, 2022

kunalspathak deleted the gc_atomics branch June 24, 2022 17:56

kunalspathak mentioned this pull request Jul 8, 2022

gcenv.interlocked's Interlocked use full memory barriers even with 8.1 Atomics #67824

Closed

ghost locked as resolved and limited conversation to collaborators Jul 24, 2022

Windows/Arm64: Use 8.1 atomic instructions for GC code #71169

Windows/Arm64: Use 8.1 atomic instructions for GC code #71169

Uh oh!

Conversation

kunalspathak commented Jun 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kunalspathak commented Jun 22, 2022

Uh oh!

ghost commented Jun 22, 2022

Uh oh!

jkotas commented Jun 23, 2022

Uh oh!

jkotas commented Jun 23, 2022

Uh oh!

kunalspathak commented Jun 23, 2022

Uh oh!

jkotas commented Jun 23, 2022

Uh oh!

kunalspathak commented Jun 23, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jkotas commented Jun 23, 2022

Uh oh!

Maoni0 commented Jun 23, 2022

Uh oh!

kunalspathak commented Jun 23, 2022

Uh oh!

Maoni0 commented Jun 24, 2022

Uh oh!

kunalspathak commented Jun 24, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kunalspathak commented Jun 22, 2022 •

edited

Loading

kunalspathak commented Jun 23, 2022 •

edited

Loading