Description
Go version
go version go1.22.4 linux/amd64
Output of go env
in your module/workspace:
GO111MODULE='on'
GOARCH='amd64'
GOBIN=''
GOCACHE='/home/user/.cache/go-build'
GOENV='/home/user/.config/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='amd64'
GOHOSTOS='linux'
GOINSECURE=''
GOMODCACHE='/home/user/go-repos/pkg/mod'
GONOPROXY=''
GONOSUMDB=''
GOOS='linux'
GOPATH='/home/user/go-repos:/opt/go/path:/home/user/go-code'
GOPRIVATE=''
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/home/user/go/src/github.com/golang/go'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/home/user/go/src/github.com/golang/go/pkg/tool/linux_amd64'
GOVCS=''
GOVERSION='go1.22.4'
GCCGO='gccgo'
GOAMD64='v1'
AR='ar'
CC='gcc'
CXX='g++'
CGO_ENABLED='1'
GOMOD='/home/user/go/src/github.com/uber-go/zap/go.mod'
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -ffile-prefix-map=/tmp/go-build1740765925=/tmp/go-build -gno-record-gcc-switches'
What did you do?
We have been doing some performance testing of Go tip at Uber in preparation for Go 1.23.
What did you see happen?
We have noticed degradation in linux machines with a lot of cores (96) in all of Zap’s Field logging benchmark tests of around 8%. These benchmarks look something like this:
logger := New(
zapcore.NewCore(
zapcore.NewJSONEncoder(NewProductionConfig().EncoderConfig),
&ztest.Discarder{}, // No actual i/o, logs get discarded.
DebugLevel,
),
)
b.ResetTimer()
b.RunParallel(func(pb *testing.PB) {
for pb.Next() {
log.Info("Boolean.", Bool("foo", true))
}
We don’t have an isolated linux environment available to us, so these results are susceptible to a slight noisy neighbor problem, but we have consistently seen some amount of degradation on these benchmarks:
$ go version
go version go1.22.4 linux/amd64
$ go test -bench Field -run nounittests -count 25 . | tee go1224.log
$ ~/go/src/github.com/golang/go4/bin/go version
go version devel go1.23-93bbf719a6 Wed Jun 5 17:30:16 2024 +0000 linux/amd64
$ go test -bench Field -run nounittests -count 25 . | tee 93bbf719a6.log
$ benchstat go1224.log 93bbf719a6.log
goos: linux
goarch: amd64
pkg: go.uber.org/zap
cpu: AMD EPYC 7B13
│ go1.22.4.log │ 93bbf719a6.log │
│ sec/op │ sec/op vs base │
BoolField-96 110.6n ± 6% 127.9n ± 8% +15.64% (p=0.001 n=25)
ByteStringField-96 139.7n ± 1% 149.9n ± 2% +7.30% (p=0.000 n=25)
Float64Field-96 112.2n ± 4% 125.2n ± 4% +11.59% (p=0.000 n=25)
IntField-96 108.7n ± 3% 116.2n ± 2% +6.90% (p=0.000 n=25)
Int64Field-96 105.9n ± 4% 113.2n ± 2% +6.89% (p=0.009 n=25)
StringField-96 104.4n ± 2% 115.4n ± 4% +10.54% (p=0.000 n=25)
StringerField-96 105.4n ± 3% 115.5n ± 4% +9.58% (p=0.000 n=25)
TimeField-96 109.6n ± 2% 117.4n ± 2% +7.12% (p=0.000 n=25)
DurationField-96 111.6n ± 3% 121.9n ± 3% +9.23% (p=0.000 n=25)
ErrorField-96 108.4n ± 2% 115.7n ± 4% +6.73% (p=0.000 n=25)
ErrorsField-96 184.1n ± 2% 205.1n ± 4% +11.41% (p=0.000 n=25)
StackField-96 713.0n ± 3% 813.3n ± 3% +14.07% (p=0.000 n=25)
ObjectField-96 117.2n ± 2% 130.9n ± 3% +11.69% (p=0.000 n=25)
ReflectField-96 317.6n ± 2% 346.0n ± 3% +8.94% (p=0.000 n=25)
10Fields-96 584.7n ± 2% 622.4n ± 4% +6.45% (p=0.000 n=25)
100Fields-96 5.919µ ± 3% 5.630µ ± 5% ~ (p=0.073 n=25)
geomean 196.5n 213.4n +8.61%
We fiddled with GOMAXPROCS a bit and noticed the degradation is definitely related to parallelism.
We didn’t see a whole lot in CPU profiles other than a general increase of about 2-4% of samples taken in the runtime package.
We were able to use git bisect to identify e995aa95cb5f379c1df5d5511ee09970261d877f as one cause. Specifically, the added calls to nanotime() seem to cause degradation in these highly parallelized benchmarks. However, this commit alone does not seem to account for the entire degradation:
$ ~/go/src/github.com/golang/go3/bin/go version
go version devel go1.23-e995aa95cb Mon Apr 8 21:43:16 2024 +0000 linux/amd64
$ ~/go/src/github.com/golang/go3/bin/go test -bench Field -run nounittests -count 25 . | tee e995aa95cb.log
$ benchstat go1224.log e995aa95cb.log
goos: linux
goarch: amd64
pkg: go.uber.org/zap
cpu: AMD EPYC 7B13
│ go1.22.4.log │ e995aa95cb.log │
│ sec/op │ sec/op vs base │
BoolField-96 110.6n ± 6% 121.1n ± 6% +9.49% (p=0.004 n=25)
ByteStringField-96 139.7n ± 1% 145.9n ± 2% +4.44% (p=0.002 n=25)
Float64Field-96 112.2n ± 4% 121.1n ± 1% +7.93% (p=0.000 n=25)
IntField-96 108.7n ± 3% 112.5n ± 2% +3.50% (p=0.009 n=25)
Int64Field-96 105.9n ± 4% 111.4n ± 3% ~ (p=0.200 n=25)
StringField-96 104.4n ± 2% 111.5n ± 2% +6.80% (p=0.000 n=25)
StringerField-96 105.4n ± 3% 113.4n ± 3% +7.59% (p=0.000 n=25)
TimeField-96 109.6n ± 2% 117.6n ± 2% +7.30% (p=0.000 n=25)
DurationField-96 111.6n ± 3% 116.8n ± 2% +4.66% (p=0.000 n=25)
ErrorField-96 108.4n ± 2% 113.7n ± 2% +4.89% (p=0.002 n=25)
ErrorsField-96 184.1n ± 2% 201.7n ± 4% +9.56% (p=0.000 n=25)
StackField-96 713.0n ± 3% 770.9n ± 2% +8.12% (p=0.000 n=25)
ObjectField-96 117.2n ± 2% 127.2n ± 3% +8.53% (p=0.000 n=25)
ReflectField-96 317.6n ± 2% 349.4n ± 5% +10.01% (p=0.000 n=25)
10Fields-96 584.7n ± 2% 620.5n ± 5% +6.12% (p=0.005 n=25)
100Fields-96 5.919µ ± 3% 6.046µ ± 25% ~ (p=0.064 n=25)
geomean 196.5n 209.5n +6.62%
We weren’t able to reliably identify any additional commits beyond this one that accounted for more of the degradation.
Note: this is not a duplicate of #67857, but rather an investigation of different Zap benchmark degradations.
What did you expect to see?
No practical degradation.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
Activity
ianlancetaylor commentedon Jun 7, 2024
CC @golang/runtime
mknyszek commentedon Jun 7, 2024
These logging benchmarks appear to involve incredibly tight loops, which leads me to believe that it's possibly due to unfortunate microarchitectural effects -- see my comment on #67857.
And my next question would be: does this in any way correspond to production-level regressions? Unless the world is getting stopped a lot I would not expect 15*
GOMAXPROCS
cpu-ns of CPU time have any meaningful effect on performance in any real program (referencing your bisection result). I could see it having an effect in a microbenchmark that stops the world a lot, though.One thing I might suggest trying is setting
GOMEMLIMIT=<something big>
andGOGC=off
in your benchmarks to try and eliminate or minimize the GC cost of your benchmark (allocation cost will still be measured, though, which I assume you care about). The reason I say this is because microbenchmarks can end up being a torture test for the GC in a way that makes the benchmark less useful. In other words, the costs dominating your benchmark do not reflect the actual costs of your code in real-world contexts.For example, a benchmark that allocates in a loop, immediately drops the memory, and has a teeny tiny live heap (a lot like most microbenchmarks!), is going to see a very, very high number of GC cycles because it is going to sit at the minimum total heap size of 4 MiB. This means tons of STWs and CPU time spent on mark work that does not translate at all to real production systems with sizable live heaps. The tricky part with the GC is that its frequency (and thus its cost) is proportional to your live heap via
GOGC
, so considering the GC as "part of the benchmark" is really hard to do in a useful way. It's much easier if the benchmark measures the performance of some larger end-to-end system.I'm curious to know if setting
GOMEMLIMIT
to something big andGOGC=off
makes the problem go away completely. If it does, I am less concerned about these regressions. If it doesn't, then I think this becomes much more important to fix. Also, if this regression shows up in more end-to-end benchmarks, or in production, that's much more cause for alarm.I'll also add, are you able to compare profiles before and after (
pprof -diff_base=old.prof new.prof
)? Even an 8% regression should stick our like a sore thumb.JacobOaks commentedon Jun 7, 2024
Hey @mknyszek - thanks for the quick response & ideas!
Your explanation w.r.t microbenchmarks being a GC torture test makes sense - these tests are allocating memory that quickly becomes dead - which leads to a situation with consistent allocations but low target memory, causing GC to get triggered often. I suppose this would exacerbate any GC performance degradations that are otherwise minuscule in larger benchmarks/applications.
It's unfortunately not realistic for us to actually test Go tip in production - but we may be able to do this once an rc version is tagged.
I did turn off GC and re-ran these benchmarks, and as you predicted, this does seem to result in no degradation between the versions. Like I mentioned before, we had a hard time discerning any differences in profiles between the two versions, but I tried to make the issue worse by forcing more GC with GOGC=50 in hopes that differences between the profiles would surface better, and did find that
gcBgMarkWorker
saw about a 3% increase in CPU time with this - which seems in line with your explanation. There weren't any other significant differences that I could tell (happy to share the profiles if it helps).mknyszek commentedon Jun 7, 2024
No problem. These reports are important. :) I hope my response did not discourage future reports. Thanks for the thorough issue you filed -- it's genuinely helpful that y'all do performance testing against tip.
Yeah, it's a bit unfortunate how easy it is to generate benchmarks that do this.
Understandable. Hopefully the RC gives us some more feedback.
Got it, that's good to know! Thanks for checking.
That's unfortunate; just to clarify, that's even with the automated diffing? The only other thing I might suggest is doing the same kind of differential profile with Linux
perf
if you can, which tends to produce more accurate results overall.I'm going to leave this issue open for now while I investigate #67822, in case it becomes relevant.
gabyhelp commentedon Jun 8, 2024
Similar Issues
(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)
thanm commentedon Jun 12, 2024
I ran these benchmarks on my high-core 2-socket machine, with and without link-time randomization (as described in this note: #67822 (comment)).
Here's what I see with no randomization; rock steady:
Now here's what happens when I add in text layout randomization (10 instances of -count=3 runs, each with a random seed):
Note the
p
values, which are now all over the map. So I strongly suspect that in addition to the GC artifacts mentioned by @mknyszek, some additional fraction of the apparent degradation is due to loop alignment (e.g. just happening to get the "right" alignment in one case and the "wrong" alignment in another).JacobOaks commentedon Jun 13, 2024
Hmm... FWIW I wasn't able to reproduce this using the script from #67822 (comment), including the same 10 seeds. Perhaps I'm doing something wrong.
No linker randomization:
Linker randomization:
I did sanity check go was being invoked correctly.
very strange.
7 remaining items