Closed
Description
What did you do?
- I wanted to make sure I was creating accurate benchmarks.
- I found and read Dave Cheney's 2013 blog post on how to write benchmarks in Go. In the "A note on compiler optimisations" section he mentions that it is best practice to assign results to local and package level variables to avoid optimizations.
- I went to https://golang.org/pkg/testing/#hdr-Benchmarks
What did you expect to see?
I expected to see documentation on how to correctly write benchmarks that avoid compiler optimizations and examples that reflect best practices.
If the techniques described in Dave's blog post are no longer necessary, I expected to see explicit documentation to that effect.
What did you see instead?
Neither of those things.
Metadata
Metadata
Assignees
Labels
Type
Projects
Relationships
Development
No branches or pull requests
Activity
meirf commentedon Aug 31, 2018
@davecheney,
Is that kind of optimization avoidance still recommended?
If so, do you think that info should be put in https://golang.org/pkg/testing/#hdr-Benchmarks? I don't see an official benchmark wiki so seems useful to give a short explanation in
testing
doc.as commentedon Sep 1, 2018
Couldn't reproduce benchmark code elision in currently-supported Go versions.
josharian commentedon Sep 1, 2018
@as I believe it can happen now with inlined calls. There is discussion of purity analysis, which might impact non-inlined pure calls later.
gbbr commentedon Oct 25, 2019
How come no action has been taken here? This indeed seems like it would be documentation worthy. Does it warrant a doc update?
randall77 commentedon Oct 25, 2019
@gbbr I think adding some documentation around this would be fine. No one has gotten to it yet. Want to send a CL?
nicksnyder commentedon Oct 25, 2019
Based on the conversation in this thread it is still unclear to me what the documentation should say. Does Dave’s 2013 blog post reflect today’s best practices?
gbbr commentedon Oct 25, 2019
I am not able to come up with an example that illustrates the problem. I'm not sure if this really is a problem today. The example here is wrong, as is #14813, because it uses the loop's index as an argument to the function call. The other example in Dave's post here also does not prove any noticeable differences between the two solutions.
cespare commentedon Oct 25, 2019
Here's an example that demonstrates the problem. You have to be careful to ensure that the function is inlined but also that the result cannot computed at compile time.
Here's what I get using gc (tip) and gccgo (8.3.0):
cespare commentedon Oct 25, 2019
I started one of the other discussions about this a few years ago on golang-dev. I think the situation is still quite unfortunate. To the best of my knowledge, I think that situation is:
Given (1) and (2), I think a lot of the current "sink" approaches are not good since they make the code uglier and they are hard to justify (they protect the benchmark against some, but not all, hypothetical future optimizations).
More recently some people have suggested that using runtime.KeepAlive to mark values that you want to always be computed in the benchmark is the best approach (such as @randall77's comment here). That seems better, but is still not completely ideal in two ways:
Some comments in these discussions have seemed to suggest that writing microbenchmarks is an advanced task for experienced programmers which always requires some knowledge about the compiler and system underneath and therefore there's nothing to be done here. I don't really buy that, though: I think that even beginners can reasonably want to write benchmarks which measure how long their function
f
takes to run and compare different approaches to writingf
, and we should make it as easy as possible to avoid any future pitfalls where the obvious benchmark code becomes broken.I advocate that we decide on a single best approach here (which seems to be using
runtime.KeepAlive
or else a new helper in the testing package), document it thoroughly, and adopt it as widely as possible.7 remaining items
bcmills commentedon Oct 4, 2021
(I filed proposal #48768 for the API from the above comment.)
eliben commentedon Jun 22, 2023
I ran into this issue in a particularly insidious way recently, and want to
share the scenario, because it's somewhat different from what I've seen before.
IMHO this highlights the seriousness of this issue and the need to at least
document it properly.
My goal was to benchmark a function akin to this
countCond
(this example isartificial but it's very similar to the real one):
So I wrote a benchmark function:
getInputContents
creates a large testing slice input (full code is inhttps://gist.github.com/eliben/3ad700c3589d814eb87dfef704083abe).
I've been burned by compilers optimizing away benchmarking code many times
before and am always on the lookout for some signs:
a loop-y operation, for example)
e.g. in a loop like
countCond
I would expect the time to grow Nx if Iincrease the input slice length Nx.
Neither of these happened here:
This is for input size of 400k. 100us/op doesn't sound very outlandish.
Growing the input size to 800k I saw:
Roughly 2x growth in time per op - makes sense.
But in fact, the compiler here inlined everything into the benchmark function,
then realized the actual result isn't used anywhere and optimized the loop
body of
countCond
away entirely, but left the loop itself in place. Theloop just loops from 0 to
len(input)
with an empty body. This explains theremaining linear growth w.r.t. input size.
Mitigating this by having a sink or
KeepAlive
fixes the issue and I seethe "real" benchmark numbers:
What I find particularly problematic about this scenario is that the partial
optimization performed by the compiler thwarted my "bad benchmark" mental smoke
test and misled me for quite a while. So there appears to be no guarantee that
when compiler optimizations confuse benchmarks, it's at least "obvious" by
just eyeballing the results.
gopherbot commentedon Jun 22, 2023
Change https://go.dev/cl/505235 mentions this issue:
testing: improve benchmarking example
randall77 commentedon Jun 23, 2023
I was thinking about this some more, along the lines of what Brian proposed, with ergonomic improvements.
What if
BenchmarkFoo
could return afunc() ...
which is the thing that is to be benchmarked?The testing package calls
BenchmarkFoo
once, then does theb.N
loop and calls whateverBenchmarkFoo
returns in that loop.The "sink" in this case is the return value of the closure that
BenchmarkFoo
returns. We could allow any type(s) here, testing would just need to call it with reflect (for which we would probably want #49340 to avoid allocations by reflect).Any test setup could be done in
BenchmarkFoo
before returning the closure. (There is no place for test teardown, but that kind of benchmark is rare.)This is backwards-compatible because
Benchmark*
functions can't currently return anything. Benchmarks with no return values operate as before.Possibly we could pass
i
as an argument to the closure? Usually it could be reconstructable by the benchmark using a closure variable, but it may be convenient in some cases.bcmills commentedon Jun 23, 2023
@randall77, I like that direction but I think #48768 is a little clearer w.r.t. interactions with existing
testing.B
methods.With the “return a closure” approach, presumably calling
SetParallelism
would cause the function to be invoked in parallel on the indicated number of goroutines? But it's not clear to me how you would indicate “run with the parallelism specified by the-cpu
flag” (analogous to callingb.RunParallel
).I think we would also need to be careful to specify that
Cleanup
callbacks are executed only after the last call to the function has returned.markdryan commentedon Oct 12, 2023
I've just stumbled across this issue with utf16_test.BenchmarkDecodeRune. After getting weird results from this benchmark when testing a RISC-V patch, I took a look at the disassembly and noticed that the benchmark was neither calling nor inlining DecodeRune. DecodeRune seemed to have been completely optimised away. Assigning the return value of DecodeRune to a local variable and calling runtime.KeepAlive on that local at the end of the function fixed the issue.
Without the fix, I see
and with
This isn't a RISC-V specific problem. Disassembling an amd64 build of the benchmark shows that the DecodeRune code has been optimised away on those builds as well.
smallIntegerValueCache
onflow/cadence#2936seankhliao commentedon Dec 2, 2024
Should we fold this into Keep #61179 or Loop #61515 ?
seankhliao commentedon Mar 22, 2025
Done with B.Loop https://pkg.go.dev/testing#B.Loop