Skip to content

[GlobalMerge] Aggressively merge constants to reduce TOC entries #111756

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Oct 24, 2024

Conversation

syzaara
Copy link
Contributor

@syzaara syzaara commented Oct 9, 2024

Symbols that get mapped into the read-only section are loaded as part of the text segment and will always need a TOC entry to be addressable. Add an option to aggressively merge these read only globals to reduce TOC usage.

Symbols that get mapped into the read-only section are loaded
as part of the text segment and will always need a TOC entry to be
addressable. Add an option to aggressively merge these read only
globals to reduce TOC usage.
@llvmbot
Copy link
Member

llvmbot commented Oct 9, 2024

@llvm/pr-subscribers-backend-powerpc

Author: Zaara Syeda (syzaara)

Changes

Symbols that get mapped into the read-only section are loaded as part of the text segment and will always need a TOC entry to be addressable. Add an option to aggressively merge these read only globals to reduce TOC usage.


Full diff: https://github.com/llvm/llvm-project/pull/111756.diff

2 Files Affected:

  • (modified) llvm/lib/CodeGen/GlobalMerge.cpp (+6-1)
  • (added) llvm/test/CodeGen/PowerPC/aix-aggressive-merge-const.ll (+93)
diff --git a/llvm/lib/CodeGen/GlobalMerge.cpp b/llvm/lib/CodeGen/GlobalMerge.cpp
index 007bea9a6585ef..4c6c8c600ee2bb 100644
--- a/llvm/lib/CodeGen/GlobalMerge.cpp
+++ b/llvm/lib/CodeGen/GlobalMerge.cpp
@@ -119,6 +119,11 @@ static cl::opt<bool> GlobalMergeGroupByUse(
     "global-merge-group-by-use", cl::Hidden,
     cl::desc("Improve global merge pass to look at uses"), cl::init(true));
 
+static cl::opt<bool> GlobalMergeAllConst(
+    "global-merge-all-const", cl::Hidden,
+    cl::desc("Merge all const globals without looking at uses"),
+    cl::init(false));
+
 static cl::opt<bool> GlobalMergeIgnoreSingleUse(
     "global-merge-ignore-single-use", cl::Hidden,
     cl::desc("Improve global merge pass to ignore globals only used alone"),
@@ -263,7 +268,7 @@ bool GlobalMergeImpl::doMerge(SmallVectorImpl<GlobalVariable *> &Globals,
       });
 
   // If we want to just blindly group all globals together, do so.
-  if (!GlobalMergeGroupByUse) {
+  if (!GlobalMergeGroupByUse || (GlobalMergeAllConst && isConst)) {
     BitVector AllGlobals(Globals.size());
     AllGlobals.set();
     return doMerge(Globals, AllGlobals, M, isConst, AddrSpace);
diff --git a/llvm/test/CodeGen/PowerPC/aix-aggressive-merge-const.ll b/llvm/test/CodeGen/PowerPC/aix-aggressive-merge-const.ll
new file mode 100644
index 00000000000000..ec2d409bfb971d
--- /dev/null
+++ b/llvm/test/CodeGen/PowerPC/aix-aggressive-merge-const.ll
@@ -0,0 +1,93 @@
+; RUN: llc --verify-machineinstrs -mtriple powerpc-ibm-aix-xcoff \
+; RUN: -global-merge-all-const=true < %s | FileCheck %s
+
+; RUN: llc --verify-machineinstrs -mtriple powerpc-ibm-aix-xcoff \
+; RUN: -global-merge-all-const=false < %s | FileCheck --check-prefix=NOMERGE %s
+
+%struct.pc_t = type { i8 }
+%struct.S = type { i32, i32, i32, i32, [9 x i32] }
+
+@constinit = private unnamed_addr constant <{ i32, i32, i32, i32, [9 x i32] }> <{ i32 0, i32 0, i32 0, i32 2, [9 x i32] zeroinitializer }>, align 4
+@.str = private unnamed_addr constant [6 x i8] c"hello\00", align 1
+@.str.1 = private unnamed_addr constant [6 x i8] c"world\00", align 1
+@.str.2 = private unnamed_addr constant [6 x i8] c"abcde\00", align 1
+@.str.3 = private unnamed_addr constant [6 x i8] c"fghij\00", align 1
+@pc = internal constant %struct.pc_t zeroinitializer, align 1
+@s = internal constant %struct.S { i32 1, i32 2, i32 3, i32 4, [9 x i32] [i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13] }, align 4
+
+; Function Attrs: mustprogress
+define noundef i32 @f5() {
+entry:
+  %call = tail call noundef i32 @f4(ptr noundef nonnull @pc)
+  ret i32 %call
+}
+
+declare noundef i32 @f4(ptr noundef)
+declare noundef i32 @printf(ptr nocapture noundef readonly, ...)
+
+define noundef i32 @f1() {
+entry:
+  %call = tail call i32 (ptr, ...) @printf(ptr noundef nonnull dereferenceable(1) @.str)
+  ret i32 %call
+}
+
+
+; Function Attrs: mustprogress nofree nounwind
+define noundef i32 @f2() {
+entry:
+  %call = tail call i32 (ptr, ...) @printf(ptr noundef nonnull dereferenceable(1) @.str.1)
+  ret i32 %call
+}
+
+define noundef i32 @f3() {
+entry:
+  %call = tail call i32 (ptr, ...) @printf(ptr noundef nonnull dereferenceable(1) @.str.2)
+  ret i32 %call
+}
+
+define noundef i32 @f7() {
+entry:
+  %call = tail call noundef i32 @f6(ptr noundef nonnull @s)
+  ret i32 %call
+}
+
+declare noundef i32 @f6(ptr noundef)
+
+; CHECK:             .csect L.._MergedGlobals[RO],2
+; CHECK-NEXT:        .lglobl pc                          # @_MergedGlobals
+; CHECK-NEXT:        .lglobl s
+; CHECK-NEXT:        .align  2
+; CHECK-NEXT:pc:
+; CHECK-NEXT:        .space  1
+; CHECK-NEXT:L...str:
+; CHECK-NEXT:        .string "hello"
+; CHECK-NEXT:L...str.1:
+; CHECK-NEXT:        .string "world"
+; CHECK-NEXT:L...str.2:
+; CHECK-NEXT:        .string "abcde"
+; CHECK-NEXT:L...str.3:
+; CHECK-NEXT:        .string "fghij"
+; CHECK-NEXT:        .space  3
+; CHECK-NEXT:L..constinit:
+; CHECK-NEXT:        .vbyte  4, 0                            # 0x0
+; CHECK-NEXT:        .vbyte  4, 0                            # 0x0
+; CHECK-NEXT:        .vbyte  4, 0                            # 0x0
+; CHECK-NEXT:        .vbyte  4, 2                            # 0x2
+; CHECK-NEXT:        .space  36
+; CHECK-NEXT:s:
+; CHECK-NEXT:        .vbyte  4, 1                            # 0x1
+; CHECK-NEXT:        .vbyte  4, 2                            # 0x2
+; CHECK-NEXT:        .vbyte  4, 3                            # 0x3
+; CHECK-NEXT:        .vbyte  4, 4                            # 0x4
+; CHECK-NEXT:        .vbyte  4, 5                            # 0x5
+; CHECK-NEXT:        .vbyte  4, 6                            # 0x6
+; CHECK-NEXT:        .vbyte  4, 7                            # 0x7
+; CHECK-NEXT:        .vbyte  4, 8                            # 0x8
+; CHECK-NEXT:        .vbyte  4, 9                            # 0x9
+; CHECK-NEXT:        .vbyte  4, 10                           # 0xa
+; CHECK-NEXT:        .vbyte  4, 11                           # 0xb
+; CHECK-NEXT:        .vbyte  4, 12                           # 0xc
+; CHECK-NEXT:        .vbyte  4, 13                           # 0xd
+
+
+; NOMERGE-NOT: L.._MergedGGlobals[RO]

@efriedma-quic
Copy link
Collaborator

This seems like it will increase the number of loads from the TOC... and inserting more TOC loads so you can shrink the TOC doesn't seem like a good tradeoff.

@syzaara
Copy link
Contributor Author

syzaara commented Oct 10, 2024

This seems like it will increase the number of loads from the TOC... and inserting more TOC loads so you can shrink the TOC doesn't seem like a good tradeoff.

The number of loads from TOC is not changed since we still need TOC loads for the constants before merging.

For example:

int callee(char *);

int str1() {
  return callee("hello");
}

int str2() {
  return callee("world\n");
}

The constant strings “hello” and “world\n” are placed in the TOC and without aggressive constant merging, the code we produce is:

.str1:
# %bb.0:                                # %entry
        mflr 0
        stwu 1, -64(1)
        lwz 3, L..C0(2)                         # @.str
        stw 0, 72(1)
        bl .callee[PR]
        nop 
        addi 1, 1, 64
        lwz 0, 8(1)
        mtlr 0
        blr 

.str2:
# %bb.0:                                # %entry
        mflr 0
        stwu 1, -64(1)
        lwz 3, L..C1(2)                         # @.str.1
        stw 0, 72(1)
        bl .callee[PR]
        nop
        addi 1, 1, 64
        lwz 0, 8(1)
        mtlr 0
        blr

We can observe the TOC loads:

lwz 3, L..C0(2)                         # @.str
lwz 3, L..C1(2)                         # @.str.1

.toc
L..C0:  
        .tc L...str[TC],L...str[RO]
L..C1:  
        .tc L...str.1[TC],L...str.1[RO]

.csect L...str[RO],2
.string "hello"                         # @.str
.csect L...str.1[RO],2
.byte   'w,'o,'r,'l,'d,0012,0000        # @.str.1

With aggressive merging of the constants, we can use one TOC entry, and keep the same number of TOC loads:

.str1:
# %bb.0:                                # %entry
        mflr 0
        stwu 1, -64(1)
        lwz 3, L..C0(2)                         # @_MergedGlobals
        stw 0, 72(1)
        addi 3, 3, 8
        bl .callee[PR]
        nop 
        addi 1, 1, 64
        lwz 0, 8(1)
        mtlr 0
        blr 

.str2:
# %bb.0:                                # %entry
        mflr 0
        stwu 1, -64(1)
        lwz 3, L..C0(2)                         # @_MergedGlobals
        stw 0, 72(1)
        addi 3, 3, 14
        bl .callee[PR]
        nop
        addi 1, 1, 64
        lwz 0, 8(1)
        mtlr 0
        blr

Now we see the TOC loads from the MergedGlobals:
lwz 3, L..C0(2) # @_MergedGlobals

.toc
L..C0:
        .tc L.._MergedGlobals[TC],L.._MergedGlobals[RO]

        .csect L.._MergedGlobals[RO],2
L...str:                                # @_MergedGlobals
        .string "hello"
L...str.1:
        .byte   'w,'o,'r,'l,'d,0012,0000
        .extern .callee[PR]

@efriedma-quic
Copy link
Collaborator

If you manage to merge every global in the module into a single variable, the number of loads stays the same, sure. But in general, that isn't possible: because of the offset limitations, you'll end up generating multiple global variables in the general case. And in that case, how you partition the globals into groups to be merged matters.

@syzaara
Copy link
Contributor Author

syzaara commented Oct 10, 2024

If you manage to merge every global in the module into a single variable, the number of loads stays the same, sure. But in general, that isn't possible: because of the offset limitations, you'll end up generating multiple global variables in the general case. And in that case, how you partition the globals into groups to be merged matters.

It seems like regardless of how the partitioning happens, we will end up reducing TOC size while keeping the number of loads the same.
For example:

static const int a[4000];
static const int b[4000];
static const int c[3000];
static const int d[3000];

int foo(int *);

int int1() {
 return foo(a);
}

int int2() {
 return foo(b);
}

int int3() {
 return foo(c);
}

int int4() {
 return foo(d);
}

We create 2 MergedGlobals due to the MaxOffset

GV to merge: @a = internal constant [4000 x i32] zeroinitializer, align 4
GV to merge: @b = internal constant [4000 x i32] zeroinitializer, align 4
GV to merge: @c = internal constant [3000 x i32] zeroinitializer, align 4
GV to merge: @d = internal constant [3000 x i32] zeroinitializer, align 4
 Trying to merge set, starts with #0, total of 4
MergedGV: @_MergedGlobals = private constant <{ [3000 x i32], [3000 x i32] }> zeroinitializer, align 4
MergedGV: @_MergedGlobals.1 = private constant <{ [4000 x i32], [4000 x i32] }> zeroinitializer, align 4

This reduces the TOC size to 2 entries, while keeping the number of loads the same.

@efriedma-quic
Copy link
Collaborator

static const int a[4000];
static const int b[4000];
static const int c[3000];
static const int d[3000];

int foo(int *, int*);

int int1() {
 return foo(a, c);
}

int int2() {
 return foo(b, d);
}

int int3() {
 return foo(a, c);
}

int int4() {
 return foo(b, d);
}

If you put "a" and "c" together, and "b" and "d" together, each function has one TOC load. If you merge a+b and c+d instead, each function has two TOC loads.

@syzaara
Copy link
Contributor Author

syzaara commented Oct 22, 2024

static const int a[4000];
static const int b[4000];
static const int c[3000];
static const int d[3000];

int foo(int *, int*);

int int1() {
 return foo(a, c);
}

int int2() {
 return foo(b, d);
}

int int3() {
 return foo(a, c);
}

int int4() {
 return foo(b, d);
}

If you put "a" and "c" together, and "b" and "d" together, each function has one TOC load. If you merge a+b and c+d instead, each function has two TOC loads.

I see what you mean now. I tried prototyping another approach where we merge the same way as we currently do and then follow that on with merging any left overs. I then compared it with this approach of blindly merging all constants together using SPEC. The results showed that there was negligible impact on performance but the TOC size was ~2% smaller if we merge all constants without looking at use. Since reducing TOC size is beneficial, we would like to move forward with this patch. As a next step, our goal is to use this approach in GlobalMerge to replace the PPCMergeStringPoolPass as this can handle everything that PPCMergeStringPoolPass was meant to handle while using the existing infrastructure. Additionally, we hope to move away from PPCMergeStringPoolPass to GlobalMerge since PPCMergeStringPoolPass has observed some intermittent issues on some of the buildbots in the past.

Copy link
Collaborator

@efriedma-quic efriedma-quic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you've done the analysis, and determined this is good enough for now, then that's fine.

LGTM

@syzaara syzaara merged commit f3131c9 into llvm:main Oct 24, 2024
10 checks passed
@frobtech frobtech mentioned this pull request Oct 25, 2024
NoumanAmir657 pushed a commit to NoumanAmir657/llvm-project that referenced this pull request Nov 4, 2024
…m#111756)

Symbols that get mapped into the read-only section are loaded as part of
the text segment and will always need a TOC entry to be addressable. Add
an option to aggressively merge these read only globals to reduce TOC
usage.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants