Description
As of Go 1.12, the stack bound check in every function prologue looks like
MOVQ FS:0xfffffff8, CX
CMPQ 0x10(CX), SP
JBE 0x4018e4
(or some variation thereof). This involves a chain of three dependent instructions, the first two of which are memory loads and the last of which is a conditional branch. I don't have hard data for this, but I suspect this is really bad for the CPU pipeline. The two loads are absolutely going to be in cache, but the CPU still has to wait for the first load before issuing the second, and has to wait for the second before resolving the branch. The branch is highly predictable and can probably be speculated over, but since almost every single function has such a branch, it's probably somewhat likely the branch predictor cache will fail us here.
Function prologue overhead was also reported to be high in The benefits and costs of writing a POSIX kernel in a high-level language by Cutler et al.
One way we could address this is by putting the stack bound in a dedicated register (leveraging our new ability to change the internal ABI, #27539). This would make the prologue a single register/register compare and a branch. The branch would still probably have poor prediction cache locality, but the register/register comparison would happen so quickly that we would lose very little to a speculation failure. We're already moving toward implementing goroutine preemption using signals, which would make it easy to poison this stack bound register when requesting a preemption.
Activity
randall77 commentedon Apr 1, 2019
I'd really want to see more data to show that this prologue is actually a performance problem. My intuition says two L1 loads and a predictable branch are ~free.
aclements commentedon Apr 1, 2019
I agree about wanting more data. :) This is all just a hunch. The other half of the hunch is that we got only 5% from @dr2chase's register-based calling convention experiments because we still had all of these memory operations being the long pole in the function call sequence.
josharian commentedon Apr 1, 2019
Thinking out loud about how to tell...
How about hacking in a change to make all goroutines stack start giant and running with GOGC=off and inlining off? Then in theory there is never any stack growth and never any pre-emption and we make lots of function calls. We could then pick some cpu-bound, function-call-heavy benchmarks and run them as is and with function prologues removed.
josharian commentedon Apr 2, 2019
Here's an attempt at getting an upper bounds on the potential savings.
Now hack the toolchain to omit the stack check prologue for functions named
count2
.Check with
go tool compile -S
(sans FUNCDATA, PCDATA, NOP, etc):Results:
I'm not entirely sure how to interpret this: Is the
/1
number the meaningful one or the mid-range steady state ones? The former is pretty significant, but the latter is just 2-4% despite the prologue being a substantive percentage of the instructions being executed.The Count2/1000000 benchmark has very high variance. Looking at the raw data, it is bimodal, starting stable at around 4.2-4.3ms for a period and then dropping and re-stabilizing at around 3.7ms, so there is some other effect occurring. I'm thus inclined to disregard that data point entirely.
aclements commentedon Apr 2, 2019
I think the upper-bound for the compiler as a benchmark is around 3.5%. I used PEBS profiling to get a precisely-attributed CPU cycles profile for the compiler compiling cmd/compile/internal/ssa and wrote a tool to use the DWARF prologue markers to figure out which samples fell in the stack check prologue. Since it was easy to add, I also measured the bytes of text in the stack check prologue, which worked out to 2.0% of the text bytes.
I collected the profile using:
And then ran prologuer (which I just wrote):
1916 of 54450 samples (3.52%) in prologue
202021 of 10097708 bytes (2.00%) in prologue
josharian commentedon Apr 2, 2019
It is promising that we got similar bounds (2-4%, 3.5%) via very different means. A few percent is nothing to sniff at, but neither is the loss of a register.
I wonder whether we could combine this with #20005 and pack wb.enabled and stack information all into a single register.
I forgot earlier: I wanted to echo that sentiment, and add to it. When @navytux and I investigated #18977, we concluded (perhaps erroneously) that branch predictor interference was a non-trivial source of performance changes. Having a branch at the beginning of every function call could have non-local negative performance effects by disrupting branch predictions elsewhere. Of course, that's not relevant here, since there is no plan to eliminate the branch, just the loads.
aclements commentedon Apr 2, 2019
We've been talking about putting g back into a register on amd64 (like on most other architectures). I think this would supplant that.
Ooh, that's interesting. It would be easy enough to steal the bottom bit without affecting the stack bound check. And we have to interrupt everything to enable the write barrier, so it would be easy to set that bit when needed.
CAFxX commentedon Apr 16, 2019
#29067 that I opened a while back is related, although it's more about code layout
dvyukov commentedon Jun 27, 2019
I wonder if we could [ab]use the fact that stacks are at least 2KB-aligned and have something like:
That's 1 instruction more (assuming we need to allocate the stack frame anyway), but no memory loads and no registers taken. However, I am not sure we can actually arrange things for such check because the size we want to check is not necessary the stack frame size.
Or it something like the following better?
It also unties frame allocation from the check, so they can use different sizes.
dvyukov commentedon Jun 28, 2019
Rumors say that using flags from shift operations has higher latency and can cause stalls. So we might need to replace
RCX >>= 11
with AND with mask or something else. Or maybe that's another argument in favor of the second snippet.dvyukov commentedon Jun 28, 2019
Do we know what's the performance/object size cost of stealing a register? We could restrict compiler from using 1 register and see what happens.
dr2chase commentedon Jun 28, 2019
How urgent is your question? I think I could run that experiment.
dvyukov commentedon Jun 29, 2019
Not urgent at all. But it would be useful to have that number before we do any decisions on this issue. I don't know if anybody have near future plans to attack this issue.