Skip to content

Conversation

jakobbotsch
Copy link
Member

@jakobbotsch jakobbotsch commented May 1, 2024

RefTypeKill is one of the most common types of ref position because we
create a separate one for every register whenever registers are killed,
and registers are killed with very high frequency (for example, all
callee trashed registers at calls). This can be seen in metrics: in
libraries_tests.run, RefTypeKill is as common as RefTypeUse and
RefTypeDef combined on x64, and 3x as common as them combined on arm64
due to having more registers.

The main complication around changing RefTypeKill to represent killing
a full set of registers is the fact that we want to be able to easily
look up the next RefPosition at which a particular register is killed
or used. Today, this is very simple: all kills (RefTypeKill) and uses
(RefTypeFixedReg) participate in an intrusive linked list, and finding
the next RefPosition is as simple as following this list.

This PR changes LSRA to model the kills in bulk instead of creating an
individual RefPosition for every kill. To handle finding the next
fixed reg ref position LSRA is updated to take into account that there
can be RefPosition's corresponding to a register from two different
linked lists: the RefTypeFixedReg intrusive linked list, and also a
linked list of all RefTypeKill ref positions. The latter linked list
may take some searching (not every kill necessarily kills the register
we are looking for), but the reduction in number of RefPosition's
created more than makes up for this.

No diffs are expected.

This change reduces memory usage of the JIT by around 12% for arm64,
and by around 7% for x64.

`RefTypeKill` is one of the most common type of ref position because we
create a separate one for every register whenever registers are killed,
and registers are killed with very high frequency (for example, all
callee trashed registers at calls). This can be seen in metrics: in
`libraries_tests.run`, `RefTypeKill` is as common as `RefTypeUse` and
`RefTypeDef` combined on x64, and 3x as common as them combined on arm64
due to having more registers.

The main complication around changing `RefTypeKill` to represent killing
a full set of registers is the fact that we want to be able to easily
look up the next `RefPosition` at which a particular register is killed
or used. Today, this is very simple: all kills (`RefTypeKill`) and uses
(`RefTypeFixedReg`) participate in an intrusive linked list, and finding
the next `RefPosition` is as simple as following this list.

This PR changes LSRA to model the kills in bulk instead of creating an
individual `RefPosition` for every kill. To handle finding the next
fixed reg ref position LSRA is updated to take into account that there
can be `RefPosition`'s corresponding to a register from two different
linked lists: the `RefTypeFixedReg` intrusive linked list, and also a
linked list of all `RefTypeKill` ref positions. The latter linked list
may take some searching (not every kill necessarily kills the register
we are looking for), but the reduction in number of `RefPosition`'s
created more than makes up for this.
@ghost ghost added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label May 1, 2024
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

@jakobbotsch jakobbotsch marked this pull request as ready for review May 1, 2024 17:43
@jakobbotsch
Copy link
Member Author

cc @dotnet/jit-contrib PTAL @kunalspathak

No asm diffs. Large TP improvements (4 to 8% on MinOpts arm64), and large reduction in memory use (~12% on arm64, ~7% on x64).

This mostly pays off the cost of predicate registers in MinOpts. It will also make it slightly less expensive to add predicate registers since we no longer pay for multiple new ref positions for their kills around calls.

image

@jakobbotsch jakobbotsch requested a review from kunalspathak May 1, 2024 17:46
Copy link
Contributor

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this. When I tried this last year #87294 (comment), I had a different approach to get the next RefPosition, but your approach is simpler and giving the expected TP win.

This mostly pays off the cost of predicate registers in MinOpts.

Yes, I had a TODO that I removed in here for zeroing out predicate registers, but yes, this will lower the TP cost definitely.

// We could use RefTypeKill, but RefTypeFixedReg is used less commonly, so the check for delayRegFree
// during register allocation should be cheaper in terms of TP.
RefPosition* pos = newRefPosition(REG_SWIFT_ERROR, currentLoc, RefTypeFixedReg, call, RBM_SWIFT_ERROR);
RefPosition* pos = newRefPosition(REG_SWIFT_ERROR, currentLoc + 1, RefTypeFixedReg, call, RBM_SWIFT_ERROR);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We were hitting this assert:

// We can't have two RefPositions on a RegRecord at the same location, unless they are different types.
assert((regRecord->lastRefPosition == nullptr) || (regRecord->lastRefPosition->nodeLocation < theLocation) ||
(regRecord->lastRefPosition->refType != theRefType));

In the baseline we create these RefPositions:

DefList: { N047.t48. PUTARG_REG; N051.t49. PUTARG_REG }
N075 ( 20,  8) [000016] --CXG------                         ▌  CALL ind unman long   REG NA $103
<RefPosition #69  @75  RefTypeFixedReg <Reg:x0 > BB02 regmask=[x0] minReg=1 wt=100.00>
<RefPosition #70  @75  RefTypeUse <Ivl:17> BB02 regmask=[x0] minReg=1 last fixed wt=100.00>
<RefPosition #71  @75  RefTypeFixedReg <Reg:x21> BB02 regmask=[x21] minReg=1 wt=100.00>
<RefPosition #72  @75  RefTypeUse <Ivl:19> BB02 regmask=[x21] minReg=1 last fixed wt=100.00>
Last use of V09 between PUTARG and CALL. Removing occupied arg regs from preferences: [x0 x21]
<RefPosition #73  @75  RefTypeUse <Ivl:2 V09> LCL_VAR BB02 regmask=[x0-xip0 x19-x28] minReg=1 last wt=400.00>
<RefPosition #74  @76  RefTypeKill <Reg:x0 > BB02 regmask=[x0] minReg=1 wt=100.00>
<RefPosition #75  @76  RefTypeKill <Reg:x1 > BB02 regmask=[x1] minReg=1 wt=100.00>
<RefPosition #76  @76  RefTypeKill <Reg:x2 > BB02 regmask=[x2] minReg=1 wt=100.00>
<RefPosition #77  @76  RefTypeKill <Reg:x3 > BB02 regmask=[x3] minReg=1 wt=100.00>
<RefPosition #78  @76  RefTypeKill <Reg:x4 > BB02 regmask=[x4] minReg=1 wt=100.00>
<RefPosition #79  @76  RefTypeKill <Reg:x5 > BB02 regmask=[x5] minReg=1 wt=100.00>
<RefPosition #80  @76  RefTypeKill <Reg:x6 > BB02 regmask=[x6] minReg=1 wt=100.00>
<RefPosition #81  @76  RefTypeKill <Reg:x7 > BB02 regmask=[x7] minReg=1 wt=100.00>
<RefPosition #82  @76  RefTypeKill <Reg:x8 > BB02 regmask=[x8] minReg=1 wt=100.00>
<RefPosition #83  @76  RefTypeKill <Reg:x9 > BB02 regmask=[x9] minReg=1 wt=100.00>
<RefPosition #84  @76  RefTypeKill <Reg:x10> BB02 regmask=[x10] minReg=1 wt=100.00>
<RefPosition #85  @76  RefTypeKill <Reg:x11> BB02 regmask=[x11] minReg=1 wt=100.00>
<RefPosition #86  @76  RefTypeKill <Reg:x12> BB02 regmask=[x12] minReg=1 wt=100.00>
<RefPosition #87  @76  RefTypeKill <Reg:x13> BB02 regmask=[x13] minReg=1 wt=100.00>
<RefPosition #88  @76  RefTypeKill <Reg:x14> BB02 regmask=[x14] minReg=1 wt=100.00>
<RefPosition #89  @76  RefTypeKill <Reg:x15> BB02 regmask=[x15] minReg=1 wt=100.00>
<RefPosition #90  @76  RefTypeKill <Reg:xip0> BB02 regmask=[xip0] minReg=1 wt=100.00>
<RefPosition #91  @76  RefTypeKill <Reg:xip1> BB02 regmask=[xip1] minReg=1 wt=100.00>
<RefPosition #92  @76  RefTypeKill <Reg:x21> BB02 regmask=[x21] minReg=1 wt=100.00>
<RefPosition #93  @76  RefTypeKill <Reg:lr > BB02 regmask=[lr] minReg=1 wt=100.00>
<RefPosition #94  @76  RefTypeKillGCRefs CALL BB02 regmask=[x8-xip0 x19-x28] minReg=1 wt=400.00>
Interval 22: long RefPositions {} physReg:NA Preferences=[x0-xip0 x19-x28] Aversions=[]
<RefPosition #95  @76  RefTypeFixedReg <Reg:x0 > BB02 regmask=[x0] minReg=1 wt=100.00>
<RefPosition #96  @76  RefTypeDef <Ivl:22> CALL BB02 regmask=[x0] minReg=1 fixed wt=400.00>
<RefPosition #97  @75  RefTypeFixedReg <Reg:x21> CALL BB02 regmask=[x21] minReg=1 wt=400.00>

Notice that RefPosition #97 is at location 75 even though it comes after a RefPosition at location 76, so that on its own seems wrong (hence why I changed this). But it doesn't hit the assert because between RefPosition #71 and RefPosition #97 there is the kill at RefPosition #92.

With this change and this PR we now create these RefPositions:

DefList: { N047.t48. PUTARG_REG; N051.t49. PUTARG_REG }
N075 ( 20,  8) [000016] --CXG------                         ▌  CALL ind unman long   REG NA $103
<RefPosition #51  @75  RefTypeFixedReg <Reg:x0 > BB02 regmask=[x0] minReg=1 wt=100.00>
<RefPosition #52  @75  RefTypeUse <Ivl:17> BB02 regmask=[x0] minReg=1 last fixed wt=100.00>
<RefPosition #53  @75  RefTypeFixedReg <Reg:x21> BB02 regmask=[x21] minReg=1 wt=100.00>
<RefPosition #54  @75  RefTypeUse <Ivl:19> BB02 regmask=[x21] minReg=1 last fixed wt=100.00>
Last use of V09 between PUTARG and CALL. Removing occupied arg regs from preferences: [x0 x21]
<RefPosition #55  @75  RefTypeUse <Ivl:2 V09> LCL_VAR BB02 regmask=[x0-xip0 x19-x28] minReg=1 last wt=400.00>
<RefPosition #56  @76  RefTypeKill BB02 regmask=[x0-xip1 x21 lr] minReg=1 wt=100.00>
<RefPosition #57  @76  RefTypeKillGCRefs CALL BB02 regmask=[x8-xip0 x19-x28] minReg=1 wt=400.00>
Interval 22: long RefPositions {} physReg:NA Preferences=[x0-xip0 x19-x28] Aversions=[]
<RefPosition #58  @76  RefTypeFixedReg <Reg:x0 > BB02 regmask=[x0] minReg=1 wt=100.00>
<RefPosition #59  @76  RefTypeDef <Ivl:22> CALL BB02 regmask=[x0] minReg=1 fixed wt=400.00>
<RefPosition #60  @76  RefTypeFixedReg <Reg:x21> CALL BB02 regmask=[x21] minReg=1 wt=400.00>

Looking at this, however, it seems like the ref position is not inserted in the right location. We need to insert it before the definition of the call to ensure that doesn't get allocated into x21. In fact we should probably just mark the existing RefPosition #53 as delay freed. I'll open a separate issue about this.

assert(nextIntervalRef[reg] == MaxLocation);
assert(spillCost[reg] == 0);
}
LsraLocation thisNextFixedRef = physRegRecord->getNextRefLocation();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this not needed anymore?

Copy link
Member Author

@jakobbotsch jakobbotsch May 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assert isn't true anymore -- nextFixedRef[reg] has either the location of the next RefTypeFixedReg or of the next RefTypeKill that kills the register (to match the behavior before this change). But physRegRecord->getNextRefLocation() is always the next RefTypeFixedReg after this change.

@jakobbotsch jakobbotsch merged commit f95a76b into dotnet:main May 2, 2024
@jakobbotsch jakobbotsch deleted the bulk-kill-registers branch May 2, 2024 09:24
michaelgsharp pushed a commit to michaelgsharp/runtime that referenced this pull request May 9, 2024
`RefTypeKill` is one of the most common types of ref position because we
create a separate one for every register whenever registers are killed,
and registers are killed with very high frequency (for example, all
callee trashed registers at calls). This can be seen in metrics: in
`libraries_tests.run`, `RefTypeKill` is as common as `RefTypeUse` and
`RefTypeDef` combined on x64, and 3x as common as them combined on arm64
due to having more registers.

The main complication around changing `RefTypeKill` to represent killing
a full set of registers is the fact that we want to be able to easily
look up the next `RefPosition` at which a particular register is killed
or used. Today, this is very simple: all kills (`RefTypeKill`) and uses
(`RefTypeFixedReg`) participate in an intrusive linked list, and finding
the next `RefPosition` is as simple as following this list.

This PR changes LSRA to model the kills in bulk instead of creating an
individual `RefPosition` for every kill. To handle finding the next
fixed reg ref position LSRA is updated to take into account that there
can be `RefPosition`'s corresponding to a register from two different
linked lists: the `RefTypeFixedReg` intrusive linked list, and also a
linked list of all `RefTypeKill` ref positions. The latter linked list
may take some searching (not every kill necessarily kills the register
we are looking for), but the reduction in number of `RefPosition`'s
created more than makes up for this.

No codegen diffs, but significant throughput improvements.

This change reduces memory usage of the JIT by around 12% for arm64,
and by around 7% for x64.
Ruihan-Yin pushed a commit to Ruihan-Yin/runtime that referenced this pull request May 30, 2024
`RefTypeKill` is one of the most common types of ref position because we
create a separate one for every register whenever registers are killed,
and registers are killed with very high frequency (for example, all
callee trashed registers at calls). This can be seen in metrics: in
`libraries_tests.run`, `RefTypeKill` is as common as `RefTypeUse` and
`RefTypeDef` combined on x64, and 3x as common as them combined on arm64
due to having more registers.

The main complication around changing `RefTypeKill` to represent killing
a full set of registers is the fact that we want to be able to easily
look up the next `RefPosition` at which a particular register is killed
or used. Today, this is very simple: all kills (`RefTypeKill`) and uses
(`RefTypeFixedReg`) participate in an intrusive linked list, and finding
the next `RefPosition` is as simple as following this list.

This PR changes LSRA to model the kills in bulk instead of creating an
individual `RefPosition` for every kill. To handle finding the next
fixed reg ref position LSRA is updated to take into account that there
can be `RefPosition`'s corresponding to a register from two different
linked lists: the `RefTypeFixedReg` intrusive linked list, and also a
linked list of all `RefTypeKill` ref positions. The latter linked list
may take some searching (not every kill necessarily kills the register
we are looking for), but the reduction in number of `RefPosition`'s
created more than makes up for this.

No codegen diffs, but significant throughput improvements.

This change reduces memory usage of the JIT by around 12% for arm64,
and by around 7% for x64.
@github-actions github-actions bot locked and limited conversation to collaborators Jun 2, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants