Skip to content

Conversation

kunalspathak
Copy link
Contributor

@kunalspathak kunalspathak commented Jun 22, 2022

Following the example of #70921, continue to optimize following APIs using atomics in GC code base for Windows/arm64:

  • Interlocked::Exchange
  • Interlocked::CompareExchange
  • Interlocked::ExchangePointer

I was not able to optimize more methods because of lack of MSVC having them available. I have opened https://developercommunity.visualstudio.com/t/Add-APIs-for-Arm64-intrinsics-for-_Inter/10078117?space=62&q=intrinsic&entry=myfeedback to make a suggestion. I couldn't even write assembly equivalent for them because MSVC doesn't handle __asm for arm64 and writing the method in a .asm will invoke a function call, while we want these APIs to be inlined.

image

I have declare/defined g_atomics_available_present for clrgc and gcsample but not setting them, so it will be OFF by default for them.

@ghost ghost added the area-GC-coreclr label Jun 22, 2022
@ghost ghost assigned kunalspathak Jun 22, 2022
@kunalspathak
Copy link
Contributor Author

@Maoni0 @dotnet/jit-contrib

@ghost
Copy link

ghost commented Jun 22, 2022

Tagging subscribers to this area: @dotnet/gc
See info in area-owners.md if you want to be subscribed.

Issue Details

Following the example of #70921, continue to optimize following APIs using atomics in GC code base for Windows/arm64:

  • Interlocked::Exchange
  • Interlocked::CompareExchange
  • Interlocked::ExchangePointer

I was not able to optimize more methods because of lack of MSVC having them available. I have opened https://developercommunity.visualstudio.com/myfeedback?space=62&q=intrinsic&entry=myfeedback to make a suggestion. I couldn't even write assembly equivalent for them because MSVC doesn't handle __asm for arm64 and writing the method in a .asm will invoke a function call, while we want these APIs to be inlined.

image

I have declare/defined g_atomics_available_present for clrgc and gcsample but not setting them, so it will be OFF by default for them.

Author: kunalspathak
Assignees: -
Labels:

area-GC-coreclr

Milestone: -

@jkotas
Copy link
Member

jkotas commented Jun 23, 2022

Any measurable perf improvements from this change?

@jkotas
Copy link
Member

jkotas commented Jun 23, 2022

To fix the NativeAOT build break, you can define and initialize the flag here:

@kunalspathak
Copy link
Contributor Author

Any measurable perf improvements from this change?

Oddly no. I tried measuring using GCPerfSim. The numbers are in error margin and so I don't see much impact. @Maoni0 - any other benchmarks you would recommend? I intentionally didn't set COMPlus_GCgen0size so it gets lower cache (the binaries don't have #71029) and hence would trigger GC more often.

COMPlus_GCCpuGroup=1
COMPlus_Thread_UseAllCpuGroups=1
COMPlus_TieredCompilation=0

_1.bat
%CORE_RUN% E:\kpathak\GCPerfSim\net6.0\gcperfsim.dll -tc 1 -tagb 100 -tlgb 0 -lohar 0 -sohsi 0 -lohsi 0 -pohsi 0 -sohpi 0 -lohpi 0 -sohfi 0 -lohfi 0 -pohfi 0 -allocType reference -testKind time

_2.bat
%CORE_RUN% gcperfsim.dll -tc 1 -tagb 100 -tlgb 0.1 -lohar 0 -sohsi 50 -lohsi 0 -pohsi 0 -sohpi 0 -lohpi 0 -sohfi 0 -lohfi 0 -pohfi 0 -allocType reference -testKind time

image

@jkotas
Copy link
Member

jkotas commented Jun 23, 2022

Oddly no. I tried measuring using GCPerfSim. The numbers are in error margin and so I don't see much impact.

It is what I would expect as we have discussed offline. The GC should not be doing interlocked operations that often for this to make a measurable difference. I was more wondering whether there is a path through GC where this would make a measurable difference that I did not think about.

@kunalspathak
Copy link
Contributor Author

kunalspathak commented Jun 23, 2022

It is what I would expect as we have discussed offline.

Right, but @Maoni0 did point out some interesting code paths like enter_spin_lock and r_join uses Interlocked::CompareExchange. However join or methods around card table updates uses Interlocked::Add, etc. that are not optimized as part of this PR so that could be the reason.

@jkotas
Copy link
Member

jkotas commented Jun 23, 2022

some interesting code paths like enter_spin_lock and r_join

How much time do the GC benchmarks spend in these methods? Subtract the time spent spinning from that. The expected benefit will be 20% of what remains based on the data from your other PR.

@Maoni0
Copy link
Member

Maoni0 commented Jun 23, 2022

you are not going to detect the difference by measuring the total execution time. if you make it so that it hits the code path like enter_spin_lock a lot more, you could potentially detect difference by looking at the CPU sample counts. right now you are using
-tc 1 which means one allocating threads and this will not contend the lock that enter_spin_lock enters. so you'd want to up the # of allocating threads by a lot. like -tc 16.

@kunalspathak
Copy link
Contributor Author

I don't see much difference with -tc 16 serverGC . For the workstation GC, the numbers have lot of variances...between 66secs to 74 secs over 5 iterations. I tried to profile workstation GC and don't see methods that are hot.

image

However, when I checked at instruction level, I did see code around compareexchange to be hot.

image

Do we think we should still do this or as Jan pointed, since there won't be measurable difference skip doing it?

@Maoni0
Copy link
Member

Maoni0 commented Jun 24, 2022

I should have been more clear - this is not going to show up as a top method or anything. I meant you'd need to actually look at the CPU sample count that's spent in enter_spin_lock which is inlined so this would be in gc_heap::try_allocate_more_space.

I could skip doing this for GC.

@kunalspathak
Copy link
Contributor Author

I could skip doing this for GC.

Sounds good to me. I am testing another prototype in #71260 which should enable using atomics on linux/arm64 on machines that have capability. On windows, we will live without it.

@kunalspathak kunalspathak deleted the gc_atomics branch June 24, 2022 17:56
@ghost ghost locked as resolved and limited conversation to collaborators Jul 24, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants