Description
perf
is a sampling-based analysis tool on Linux. It's kind of a swiss-army knife tool, but the basic usage just samples PCs periodically and reports CPU usage by function.
For this issue, I'm interested in how perf gets call stacks, which is the -g
option to perf record
. Currently the default for perf is to do --call-graph=fp
, which means use frame pointers to unwind stacks.
Example program:
package main
import (
"os"
"runtime/pprof"
)
type T struct{ a, b, c, d, e, f, g, h int }
//go:noinline
func leaf() {
a = b
}
var a, b T
type U struct{ a, b, c, d, e, f, g, h, i, j, k, l int }
//go:noinline
func duff() {
c = d
}
var c, d U
//go:noinline
func work() {
for i := 0; i < 1000000000; i++ {
leaf()
duff()
}
}
//go:noinline
func main() {
if len(os.Args) >= 2 {
f, _ := os.Create(os.Args[1])
defer f.Close()
pprof.StartCPUProfile(f)
}
work()
if len(os.Args) >= 2 {
pprof.StopCPUProfile()
}
}
Example usage:
> go build example.go
> ./example cpu.prof // use Go's pprof
> perf record -g ./example // use perf
> perf report -g
Go's pprof
seems to always get call stacks perfectly correct.
perf
, on the other hand, has some issues. Because perf
uses frame pointers, it can sometimes get stack backtraces wrong. In particular, currently it has the following problems:
- On amd64, if a sample point is in (some parts of) the prolog or epilog, it incorrectly skips the parent frame. It appears as if the grandparent directly called the sampled function.
- On amd64, if the sample point is in a frameless leaf function, the same thing happens.
Both of these problems relate to the fact that perf
uses frame pointers to unwind the stack. Because the frame pointer has not been set up in both of the above situations, perf unwinds incorrectly. To get the parent frame, it does pc = *(fp+8); fp = *fp
. When fp
is from the parent frame, a pc from the parent frame itself is never found, after the current sample point the next pc is from the grandparent.
It seems that this is not a problem on arm64
. Not sure how exactly, but it does not suffer from this problem. TODO: how about other architectures? Is this related to link-register vs stack push of the return address?
We have a hack to solve this problem (CL 7728) when the callee is runtime.duffzero
or runtime.duffcopy
. The caller sets up a dummy frame pointer before calling either of those functions. When perf
samples inside those two functions, it correctly finds the parent frame. This hack was added because in perf
profiles we see a fair amount of these two functions, and it helps to see the immediate caller (these functions are called from lots of places, unlike a typical frameless leaf function). But for all the other cases in 1 and 2, we are out of luck.
The runtime.duffzero
/runtime.duffcopy
hack was also ported to arm64
, but probably that was not needed. It is also causing problems, see #73748. Probably we should remove it, although I don't yet understand how perf
solves this problem on arm64
.
So, with all that said, how might we proceed here?
perf
is not important. Remove the hack above, and just live with the fact thatperf
backtraces might be missing the parent. Not the end of the world.perf
is really important. We should add frame pointer setup and teardown to frameless leaf functions.- Do nothing. The duff functions are the only frameless leaf functions that get proper parents.
- Convince
perf
to do stack walking without using frame pointers. Modernperf
has some other ways of finding stacks, including--call-graph=lbr
(last branch record) and--call-graph=dwarf
(using dwarf info in a.eh_frame
section).
Only 4 would in principle handle the prolog/epilog problem. Just adding frame pointers everywhere would not.
As mentioned above, maybe this only matters for amd64
?
Metadata
Metadata
Assignees
Labels
Type
Projects
Status