Search code, repositories, users, issues, pull requests...

As of Go 1.12, the stack bound check in every function prologue looks like

MOVQ FS:0xfffffff8, CX
CMPQ 0x10(CX), SP
JBE 0x4018e4

(or some variation thereof). This involves a chain of three dependent instructions, the first two of which are memory loads and the last of which is a conditional branch. I don't have hard data for this, but I suspect this is really bad for the CPU pipeline. The two loads are absolutely going to be in cache, but the CPU still has to wait for the first load before issuing the second, and has to wait for the second before resolving the branch. The branch is highly predictable and can probably be speculated over, but since almost every single function has such a branch, it's probably somewhat likely the branch predictor cache will fail us here.

Function prologue overhead was also reported to be high in The benefits and costs of writing a POSIX kernel in a high-level language by Cutler et al.

One way we could address this is by putting the stack bound in a dedicated register (leveraging our new ability to change the internal ABI, #27539). This would make the prologue a single register/register compare and a branch. The branch would still probably have poor prediction cache locality, but the register/register comparison would happen so quickly that we would lose very little to a speculation failure. We're already moving toward implementing goroutine preemption using signals, which would make it easy to poison this stack bound register when requesting a preemption.

/cc @dr2chase @randall77 @josharian

added this to the Unplanned milestone

on Apr 1, 2019

mentioned this

on Apr 1, 2019

cmd/compile: improve inlining cost model #17566

randall77

Contributor

I'd really want to see more data to show that this prologue is actually a performance problem. My intuition says two L1 loads and a predictable branch are ~free.

MemberAuthor

I agree about wanting more data. :) This is all just a hunch. The other half of the hunch is that we got only 5% from @dr2chase's register-based calling convention experiments because we still had all of these memory operations being the long pole in the function call sequence.

Contributor

I'd really want to see more data to show that this prologue is actually a performance problem.

Thinking out loud about how to tell...

How about hacking in a change to make all goroutines stack start giant and running with GOGC=off and inlining off? Then in theory there is never any stack growth and never any pre-emption and we make lots of function calls. We could then pick some cpu-bound, function-call-heavy benchmarks and run them as is and with function prologues removed.

Contributor

Here's an attempt at getting an upper bounds on the potential savings.

package p

import (
	"fmt"
	"testing"
)

func count(x uint) {
	if x != 0 {
		count(x - 1)
	}
}

func count2(x uint) {
	if x != 0 {
		count2(x - 1)
	}
}

func BenchmarkCount(b *testing.B) {
	for n := uint(1); n <= 1e6; n *= 100 {
		b.Run(fmt.Sprint(n), func(b *testing.B) {
			count(n) // grow stack
			b.ResetTimer()
			for i := 0; i < b.N; i++ {
				count(n)
			}
		})
	}
}

func BenchmarkCount2(b *testing.B) {
	for n := uint(1); n <= 1e6; n *= 100 {
		b.Run(fmt.Sprint(n), func(b *testing.B) {
			count(n) // grow stack
			b.ResetTimer()
			for i := 0; i < b.N; i++ {
				count2(n)
			}
		})
	}
}

Now hack the toolchain to omit the stack check prologue for functions named count2.

diff --git a/src/cmd/internal/obj/x86/obj6.go b/src/cmd/internal/obj/x86/obj6.go
index eb0e88b494..422fa6588f 100644
--- a/src/cmd/internal/obj/x86/obj6.go
+++ b/src/cmd/internal/obj/x86/obj6.go
@@ -675,12 +675,12 @@ func preprocess(ctxt *obj.Link, cursym *obj.LSym, newprog obj.ProgAlloc) {
                }
        }
 
-       if !p.From.Sym.NoSplit() || p.From.Sym.Wrapper() {
+       if (!p.From.Sym.NoSplit() || p.From.Sym.Wrapper()) && cursym.Name != `"".count2` {
                p = obj.Appendp(p, newprog)
                p = load_g_cx(ctxt, p, newprog) // load g into CX
        }
 
-       if !cursym.Func.Text.From.Sym.NoSplit() {
+       if !cursym.Func.Text.From.Sym.NoSplit() && cursym.Name != `"".count2` {
                p = stacksplit(ctxt, cursym, p, newprog, autoffset, int32(textarg)) // emit split check
        }

Check with go tool compile -S (sans FUNCDATA, PCDATA, NOP, etc):

"".count STEXT size=70 args=0x8 locals=0x10
	0x0000 00000 (count_test.go:8)	TEXT	"".count(SB), ABIInternal, $16-8
	0x0000 00000 (count_test.go:8)	MOVQ	(TLS), CX
	0x0009 00009 (count_test.go:8)	CMPQ	SP, 16(CX)
	0x000d 00013 (count_test.go:8)	JLS	63
	0x000f 00015 (count_test.go:8)	SUBQ	$16, SP
	0x0013 00019 (count_test.go:8)	MOVQ	BP, 8(SP)
	0x0018 00024 (count_test.go:8)	LEAQ	8(SP), BP
	0x001d 00029 (count_test.go:9)	MOVQ	"".x+24(SP), AX
	0x0022 00034 (count_test.go:9)	TESTQ	AX, AX
	0x0025 00037 (count_test.go:9)	JNE	49
	0x0027 00039 (<unknown line number>)	MOVQ	8(SP), BP
	0x002c 00044 (<unknown line number>)	ADDQ	$16, SP
	0x0030 00048 (<unknown line number>)	RET
	0x0031 00049 (count_test.go:10)	DECQ	AX
	0x0034 00052 (count_test.go:10)	MOVQ	AX, (SP)
	0x0038 00056 (count_test.go:10)	CALL	"".count(SB)
	0x003d 00061 (count_test.go:10)	JMP	39
	0x003f 00063 (count_test.go:8)	CALL	runtime.morestack_noctxt(SB)
	0x0044 00068 (count_test.go:8)	JMP	0

"".count2 STEXT size=48 args=0x8 locals=0x10
	0x0000 00000 (count_test.go:14)	TEXT	"".count2(SB), ABIInternal, $16-8
	0x0000 00000 (count_test.go:14)	SUBQ	$16, SP
	0x0004 00004 (count_test.go:14)	MOVQ	BP, 8(SP)
	0x0009 00009 (count_test.go:14)	LEAQ	8(SP), BP
	0x000e 00014 (count_test.go:15)	MOVQ	"".x+24(SP), AX
	0x0013 00019 (count_test.go:15)	TESTQ	AX, AX
	0x0016 00022 (count_test.go:15)	JNE	34
	0x0018 00024 (<unknown line number>)	MOVQ	8(SP), BP
	0x001d 00029 (<unknown line number>)	ADDQ	$16, SP
	0x0021 00033 (<unknown line number>)	RET
	0x0022 00034 (count_test.go:16)	DECQ	AX
	0x0025 00037 (count_test.go:16)	MOVQ	AX, (SP)
	0x0029 00041 (count_test.go:16)	CALL	"".count2(SB)
	0x002e 00046 (count_test.go:16)	JMP	24

Results:

name              time/op
Count/1-8         4.49ns ± 2%
Count/100-8        269ns ± 1%
Count/10000-8     25.3µs ± 1%
Count/1000000-8   4.23ms ± 2%
Count2/1-8        3.51ns ± 2%
Count2/100-8       266ns ± 2%
Count2/10000-8    24.9µs ± 1%
Count2/1000000-8  4.07ms ±11%

I'm not entirely sure how to interpret this: Is the /1 number the meaningful one or the mid-range steady state ones? The former is pretty significant, but the latter is just 2-4% despite the prologue being a substantive percentage of the instructions being executed.

The Count2/1000000 benchmark has very high variance. Looking at the raw data, it is bimodal, starting stable at around 4.2-4.3ms for a period and then dropping and re-stabilizing at around 3.7ms, so there is some other effect occurring. I'm thus inclined to disregard that data point entirely.

MemberAuthor

I think the upper-bound for the compiler as a benchmark is around 3.5%. I used PEBS profiling to get a precisely-attributed CPU cycles profile for the compiler compiling cmd/compile/internal/ssa and wrote a tool to use the DWARF prologue markers to figure out which samples fell in the stack check prologue. Since it was easy to add, I also measured the bytes of text in the stack check prologue, which worked out to 2.0% of the text bytes.

I collected the profile using:

cat >/tmp/perfwrap <<EOF
#!/bin/sh
perf record -e cycles:ppp -o /tmp/perf.data.\$\$ -- "\$@"
EOF
chmod a+x /tmp/perfwrap

cd cmd/compile/internal/ssa
# Make sure dependencies are installed
go build -i
go build -gcflags "-o /tmp/_bench.o" -toolexec /tmp/perfwrap -x

And then ran prologuer (which I just wrote):

1916 of 54450 samples (3.52%) in prologue
202021 of 10097708 bytes (2.00%) in prologue

Contributor

It is promising that we got similar bounds (2-4%, 3.5%) via very different means. A few percent is nothing to sniff at, but neither is the loss of a register.

I wonder whether we could combine this with #20005 and pack wb.enabled and stack information all into a single register.

The branch is highly predictable and can probably be speculated over, but since almost every single function has such a branch, it's probably somewhat likely the branch predictor cache will fail us here.

I forgot earlier: I wanted to echo that sentiment, and add to it. When @navytux and I investigated #18977, we concluded (perhaps erroneously) that branch predictor interference was a non-trivial source of performance changes. Having a branch at the beginning of every function call could have non-local negative performance effects by disrupting branch predictions elsewhere. Of course, that's not relevant here, since there is no plan to eliminate the branch, just the loads.

mentioned this

on Apr 2, 2019

cmd/compile: add line numbers for function exit paths? #31199

MemberAuthor

A few percent is nothing to sniff at, but neither is the loss of a register.

We've been talking about putting g back into a register on amd64 (like on most other architectures). I think this would supplant that.

I wonder whether we could combine this with #20005 and pack wb.enabled and stack information all into a single register.

Ooh, that's interesting. It would be easy enough to steal the bottom bit without affecting the stack bound check. And we have to interrupt everything to enable the write barrier, so it would be easy to set that bit when needed.

added

added

Contributor

#29067 that I opened a while back is related, although it's more about code layout

Member

I wonder if we could [ab]use the fact that stacks are at least 2KB-aligned and have something like:

RCX = RSP
RSP -= 0x40
RCX ^= RSP
RCX >>= 11
JNZ slowpath

That's 1 instruction more (assuming we need to allocate the stack frame anyway), but no memory loads and no registers taken. However, I am not sure we can actually arrange things for such check because the size we want to check is not necessary the stack frame size.
Or it something like the following better?

RCX = RSP
RCX &= 0xfffffffffffff800
RCX -= 0x50
JB slowpath

It also unties frame allocation from the check, so they can use different sizes.

Member

Rumors say that using flags from shift operations has higher latency and can cause stalls. So we might need to replace RCX >>= 11 with AND with mask or something else. Or maybe that's another argument in favor of the second snippet.

Member

Do we know what's the performance/object size cost of stealing a register? We could restrict compiler from using 1 register and see what happens.

dr2chase

Contributor

How urgent is your question? I think I could run that experiment.