Skip to content

Conversation

pedrobsaila
Copy link
Contributor

Fixes #95747

@ghost ghost added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Mar 1, 2024
@ghost
Copy link

ghost commented Mar 1, 2024

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Fixes #95747

Author: pedrobsaila
Assignees: -
Labels:

area-CodeGen-coreclr

Milestone: -

@JulieLeeMSFT
Copy link
Member

We don't have time to work on this for .NET 9, so we will review it in .NET 10.

@JulieLeeMSFT
Copy link
Member

@amanasifkhalid, please review this community PR.

Copy link
Contributor

@amanasifkhalid amanasifkhalid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pedrobsaila sorry to keep you waiting -- when you're ready, could you please rebase this on top of main so we can kick off a new CI run? Bool opts have undergone some churn lately, so resolving the merge conflicts might take some work.

bool isBool; // If the compTree is boolean expression
};

struct IntBoolOpDsc
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of the struct-and-initializer pattern, could you please rewrite this as a class so that all the relevant transformations are scoped to it? OptBoolsDsc might be a useful model.

@pedrobsaila
Copy link
Contributor Author

pedrobsaila commented Mar 30, 2025

The results are the same as before, no noticeable diffs. Either my transformation fails to recognize some gen tree format I'm unaware of or this pattern doesn't exist in libraries/tests/aspnet/benchmark code which seems surprising

class IntBoolOpDsc
{
private:
IntBoolOpDsc(Compiler* comp)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Can we use an initializer list here instead? Ex:

IntBoolOpDsc(Compiler* comp)
    : ctsArray()
    , ctsArrayLength()
    // etc.

}

//-----------------------------------------------------------------------------
// Reinit: Procedure that reinitialize IntBoolOpDsc reference
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Reinit: Procedure that reinitialize IntBoolOpDsc reference
// Reinit: Procedure that reinitializes IntBoolOpDsc reference

// Arguments:
// tree lcl var tree
//
void IntBoolOpDsc::AppendToLclVarArray(GenTree* tree)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you need growable arrays, try using ArrayStack instead (you might need to #include "arraystack.h" first). For 8 or fewer elements, it allocates on the stack, and switches to the heap if needed. I imagine for most cases, these arrays won't exceed 8 elements, so ArrayStack makes sense here.

//-----------------------------------------------------------------------------
// Free: Procedure that frees IntBoolOpDsc reference
//
void IntBoolOpDsc::Free()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once you switch the arrays over to using ArrayStack, you can get rid of IntBoolOpDsc::Free entirely.

}

//-----------------------------------------------------------------------------
// TryOptimize: Function that fold constant INT OR operations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// TryOptimize: Function that fold constant INT OR operations
// TryOptimize: Function that folds constant INT OR operations

[MethodImpl(MethodImplOptions.NoInlining)]
private static int Or10Or5(int x, int y)
{
return (x | 10) | (y | 5);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please try running this test with your changes and the environment variable DOTNET_JitDisasm="Or10Or5" set so we can verify if the optimization kicked in? Let me know if you need help running the test locally.

Copy link
Contributor Author

@pedrobsaila pedrobsaila Apr 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

; Assembly listing for method CBoolTest:Or10Or5(int,int):int (FullOpts)
; Emitting BLENDED_CODE for X64 with AVX - Windows
; FullOpts code
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  3,  3   )     int  ->  rcx         single-def
;  V01 arg1         [V01,T01] (  3,  3   )     int  ->  rdx         single-def
;# V02 OutArgs      [V02    ] (  1,  1   )  struct ( 0) [rsp+0x00]  do-not-enreg[XS] addr-exposed "OutgoingArgSpace" <Empty>
;
; Lcl frame size = 0

G_M36715_IG01:  ;; offset=0x0000
                                                ;; size=0 bbWeight=1 PerfScore 0.00
G_M36715_IG02:  ;; offset=0x0000
       mov      eax, edx
       or       eax, ecx
       or       eax, 15
                                                ;; size=7 bbWeight=1 PerfScore 0.75
G_M36715_IG03:  ;; offset=0x0007
       ret
                                                ;; size=1 bbWeight=1 PerfScore 1.00

; Total bytes of code 8, prolog size 0, PerfScore 1.75, instruction count 4, allocated bytes for code 8 (MethodHash=f92e7094) for method CBoolTest:Or10Or5(int,int):int (FullOpts)
; ============================================================
; Assembly listing for method CBoolTest:LongOr10Or5(long,long):long (FullOpts)
; Emitting BLENDED_CODE for X64 with AVX - Windows
; FullOpts code
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  3,  3   )    long  ->  rcx         single-def
;  V01 arg1         [V01,T01] (  3,  3   )    long  ->  rdx         single-def
;# V02 OutArgs      [V02    ] (  1,  1   )  struct ( 0) [rsp+0x00]  do-not-enreg[XS] addr-exposed "OutgoingArgSpace" <Empty>
;
; Lcl frame size = 0

G_M33752_IG01:  ;; offset=0x0000
                                                ;; size=0 bbWeight=1 PerfScore 0.00
G_M33752_IG02:  ;; offset=0x0000
       mov      rax, rdx
       or       rax, rcx
       or       rax, 15
                                                ;; size=10 bbWeight=1 PerfScore 0.75
G_M33752_IG03:  ;; offset=0x000A
       ret
                                                ;; size=1 bbWeight=1 PerfScore 1.00

; Total bytes of code 11, prolog size 0, PerfScore 1.75, instruction count 4, allocated bytes for code 11 (MethodHash=4c277c27) for method CBoolTest:LongOr10Or5(long,long):long (FullOpts)
; ============================================================
; Assembly listing for method CBoolTest:ByteOr10Or5(ubyte,ubyte):int (FullOpts)
; Emitting BLENDED_CODE for X64 with AVX - Windows
; FullOpts code
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  3,  3   )   ubyte  ->  rcx         single-def
;  V01 arg1         [V01,T01] (  3,  3   )   ubyte  ->  rdx         single-def
;# V02 OutArgs      [V02    ] (  1,  1   )  struct ( 0) [rsp+0x00]  do-not-enreg[XS] addr-exposed "OutgoingArgSpace" <Empty>
;
; Lcl frame size = 0

G_M39681_IG01:  ;; offset=0x0000
                                                ;; size=0 bbWeight=1 PerfScore 0.00
G_M39681_IG02:  ;; offset=0x0000
       movzx    rax, dl
       movzx    rcx, cl
       or       eax, ecx
       or       eax, 15
                                                ;; size=11 bbWeight=1 PerfScore 1.00
G_M39681_IG03:  ;; offset=0x000B
       ret
                                                ;; size=1 bbWeight=1 PerfScore 1.00

; Total bytes of code 12, prolog size 0, PerfScore 2.00, instruction count 5, allocated bytes for code 12 (MethodHash=65d464fe) for method CBoolTest:ByteOr10Or5(ubyte,ubyte):int (FullOpts)
; ============================================================
; Assembly listing for method CBoolTest:ShortOr10Or5(short,short):int (FullOpts)
; Emitting BLENDED_CODE for X64 with AVX - Windows
; FullOpts code
; optimized code
; rsp based frame
; partially interruptible
; No PGO data
; Final local variable assignments
;
;  V00 arg0         [V00,T00] (  3,  3   )   short  ->  rcx         single-def
;  V01 arg1         [V01,T01] (  3,  3   )   short  ->  rdx         single-def
;# V02 OutArgs      [V02    ] (  1,  1   )  struct ( 0) [rsp+0x00]  do-not-enreg[XS] addr-exposed "OutgoingArgSpace" <Empty>
;
; Lcl frame size = 0

G_M54649_IG01:  ;; offset=0x0000
                                                ;; size=0 bbWeight=1 PerfScore 0.00
G_M54649_IG02:  ;; offset=0x0000
       movsx    rax, dx
       movsx    rcx, cx
       or       eax, ecx
       or       eax, 15
                                                ;; size=13 bbWeight=1 PerfScore 1.00
G_M54649_IG03:  ;; offset=0x000D
       ret
                                                ;; size=1 bbWeight=1 PerfScore 1.00

; Total bytes of code 14, prolog size 0, PerfScore 2.00, instruction count 5, allocated bytes for code 14 (MethodHash=ad252a86) for method CBoolTest:ShortOr10Or5(short,short):int (FullOpts)
; ============================================================

Copy link
Contributor Author

@pedrobsaila pedrobsaila Apr 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The transformation kicks in if I set explicitly DOTNET_TieredCompilation to 0 (working with debug configuration)

@JulieLeeMSFT
Copy link
Member

@pedrobsaila, is this PR ready for another round of review?

@pedrobsaila
Copy link
Contributor Author

pedrobsaila commented Apr 14, 2025

@pedrobsaila, is this PR ready for another round of review?

yes it is ready, sorry for keeping you waiting. Since my previous code was not making any asm diffs. I tried extending the support for long, byte, shorts to see if I can make any. Unfortunately, I was not lucky. If you see that the new code adds unnecessary complexity, let me know so I roll it back

@pedrobsaila
Copy link
Contributor Author

Just fixing here conflict with main branch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

JIT: missing opportunities in constant folding around bitwise ops
3 participants