-
Notifications
You must be signed in to change notification settings - Fork 5.2k
Description
Short version: I was reading @stephentoub's article Performance Improvements in .NET Core 2.1. I noticed that his example for avoiding boxing allocations thanks to dotnet/coreclr#14698 uses is
followed by a cast, when in C# 7, the same code could be simplified using pattern matching. So I was wondering if using C# 7 features also results in the same efficient code. It turns out it doesn't and I think this should be improved.
More details:
Consider this code:
using System.Runtime.CompilerServices;
class Program
{
static void Main()
{
Cast(new Dog());
Pattern(new Dog());
}
[MethodImpl(MethodImplOptions.NoInlining)]
static void Cast<T>(T thing)
{
if (thing is IAnimal)
((IAnimal)thing).MakeSound();
}
[MethodImpl(MethodImplOptions.NoInlining)]
static void Pattern<T>(T thing)
{
if (thing is IAnimal animal)
animal.MakeSound();
}
}
struct Dog : IAnimal
{
public void Bark() { }
void IAnimal.MakeSound() => Bark();
}
interface IAnimal
{
void MakeSound();
}
The IL for the relevant methods is:
.method private hidebysig static void Cast<T>(!!T thing) cil managed noinlining
{
// Code size 30 (0x1e)
.maxstack 8
IL_0000: ldarg.0
IL_0001: box !!T
IL_0006: isinst IAnimal
IL_000b: brfalse.s IL_001d
IL_000d: ldarg.0
IL_000e: box !!T
IL_0013: castclass IAnimal
IL_0018: callvirt instance void IAnimal::MakeSound()
IL_001d: ret
}
.method private hidebysig static void Pattern<T>(!!T thing) cil managed noinlining
{
// Code size 22 (0x16)
.maxstack 2
.locals init (class IAnimal V_0)
IL_0000: ldarg.0
IL_0001: box !!T
IL_0006: isinst IAnimal
IL_000b: dup
IL_000c: stloc.0
IL_000d: brfalse.s IL_0015
IL_000f: ldloc.0
IL_0010: callvirt instance void IAnimal::MakeSound()
IL_0015: ret
}
Notice how in Pattern
, the boxed object is saved to a local variable (typed as the interface).
The disassembly from .Net Core 2.1.0-preview2-26406-04 win10-x64 is:
; Assembly listing for method Program:Cast(struct)
; Emitting BLENDED_CODE for X64 CPU with AVX
; optimized code
; rsp based frame
; partially interruptible
; Final local variable assignments
;
; V00 arg0 [V00 ] ( 2, 2 ) struct ( 8) [rsp+0x08] do-not-enreg[XS] addr-exposed
;* V01 tmp0 [V01 ] ( 0, 0 ) ref -> zero-ref class-hnd exact
;* V02 tmp1 [V02 ] ( 0, 0 ) struct ( 8) zero-ref do-not-enreg[SF] class-hnd exact
;# V03 OutArgs [V03 ] ( 1, 1 ) lclBlk ( 0) [rsp+0x00]
;
; Lcl frame size = 0
G_M19994_IG01:
48894C2408 mov qword ptr [rsp+08H], rcx
G_M19994_IG02:
C3 ret
; Total bytes of code 6, prolog size 0 for method Program:Cast(struct)
; ============================================================
; Assembly listing for method Program:Pattern(struct)
; Emitting BLENDED_CODE for X64 CPU with AVX
; optimized code
; rsp based frame
; partially interruptible
; Final local variable assignments
;
; V00 arg0 [V00 ] ( 4, 4 ) struct ( 8) [rsp+0x30] do-not-enreg[XSF] addr-exposed
; V01 loc0 [V01,T02] ( 3, 2 ) ref -> rax class-hnd exact
; V02 tmp0 [V02,T00] ( 4, 8 ) ref -> rax class-hnd exact
; V03 tmp1 [V03,T01] ( 2, 4 ) ref -> rax class-hnd exact
; V04 OutArgs [V04 ] ( 1, 1 ) lclBlk (32) [rsp+0x00]
;
; Lcl frame size = 40
G_M22101_IG01:
4883EC28 sub rsp, 40
48894C2430 mov qword ptr [rsp+30H], rcx
G_M22101_IG02:
48B9005F64B2F87F0000 mov rcx, 0x7FF8B2645F00
E8A86B0F5F call CORINFO_HELP_NEWSFAST
480FBE4C2430 movsx rcx, byte ptr [rsp+30H]
884808 mov byte ptr [rax+8], cl
488BC8 mov rcx, rax
E897FBFFFF call Dog:IAnimal.MakeSound():this
90 nop
G_M22101_IG03:
4883C428 add rsp, 40
C3 ret
; Total bytes of code 47, prolog size 4 for method Program:Pattern(struct)
; ============================================================
Notice how for Cast
, almost all the code, including the boxing allocation, is optimized away (the remaining ). But for mov
seems to be unnecessary, but that's not really relevant herePattern
, all the code is still there, including an allocation and a non-inlined call to Dog.IAnimal.MakeSound
.
The two versions of the code do the same thing, so I think they should have comparable performance. Especially since the pattern matching version is more readable and I suspect it's also going to be more common in new code than the other version.
How hard would it be to make this optimization work even in the pattern matching version?
If it would be too hard to perform this optimization in the JIT, is there a reasonable way for the C# compiler to emit IL that would be optmized?
cc (?): @AndyAyersMS, @benaadams, @justinvp
category:cq
theme:importer
skill-level:expert
cost:medium