Skip to content

runtime: performance degradation on tip on high core count machines #67858

Open
@JacobOaks

Description

@JacobOaks

Go version

go version go1.22.4 linux/amd64

Output of go env in your module/workspace:

GO111MODULE='on'
GOARCH='amd64'
GOBIN=''
GOCACHE='/home/user/.cache/go-build'
GOENV='/home/user/.config/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='amd64'
GOHOSTOS='linux'
GOINSECURE=''
GOMODCACHE='/home/user/go-repos/pkg/mod'
GONOPROXY=''
GONOSUMDB=''
GOOS='linux'
GOPATH='/home/user/go-repos:/opt/go/path:/home/user/go-code'
GOPRIVATE=''
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/home/user/go/src/github.com/golang/go'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/home/user/go/src/github.com/golang/go/pkg/tool/linux_amd64'
GOVCS=''
GOVERSION='go1.22.4'
GCCGO='gccgo'
GOAMD64='v1'
AR='ar'
CC='gcc'
CXX='g++'
CGO_ENABLED='1'
GOMOD='/home/user/go/src/github.com/uber-go/zap/go.mod'
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -ffile-prefix-map=/tmp/go-build1740765925=/tmp/go-build -gno-record-gcc-switches'

What did you do?

We have been doing some performance testing of Go tip at Uber in preparation for Go 1.23.

What did you see happen?

We have noticed degradation in linux machines with a lot of cores (96) in all of Zap’s Field logging benchmark tests of around 8%. These benchmarks look something like this:

logger := New(
	zapcore.NewCore(
		zapcore.NewJSONEncoder(NewProductionConfig().EncoderConfig),
		&ztest.Discarder{}, // No actual i/o, logs get discarded.
		DebugLevel,
	),
)
b.ResetTimer()
b.RunParallel(func(pb *testing.PB) {
for pb.Next() {
	log.Info("Boolean.", Bool("foo", true))
}

We don’t have an isolated linux environment available to us, so these results are susceptible to a slight noisy neighbor problem, but we have consistently seen some amount of degradation on these benchmarks:

$ go version
go version go1.22.4 linux/amd64
$ go test -bench Field -run nounittests -count 25 . | tee go1224.log
$ ~/go/src/github.com/golang/go4/bin/go version
go version devel go1.23-93bbf719a6 Wed Jun 5 17:30:16 2024 +0000 linux/amd64
$ go test -bench Field -run nounittests -count 25 . | tee 93bbf719a6.log
$ benchstat go1224.log 93bbf719a6.log

goos: linux
goarch: amd64
pkg: go.uber.org/zap
cpu: AMD EPYC 7B13
                   │ go1.22.4.log │           93bbf719a6.log            │
                   │    sec/op    │   sec/op     vs base                │
BoolField-96          110.6n ± 6%   127.9n ± 8%  +15.64% (p=0.001 n=25)
ByteStringField-96    139.7n ± 1%   149.9n ± 2%   +7.30% (p=0.000 n=25)
Float64Field-96       112.2n ± 4%   125.2n ± 4%  +11.59% (p=0.000 n=25)
IntField-96           108.7n ± 3%   116.2n ± 2%   +6.90% (p=0.000 n=25)
Int64Field-96         105.9n ± 4%   113.2n ± 2%   +6.89% (p=0.009 n=25)
StringField-96        104.4n ± 2%   115.4n ± 4%  +10.54% (p=0.000 n=25)
StringerField-96      105.4n ± 3%   115.5n ± 4%   +9.58% (p=0.000 n=25)
TimeField-96          109.6n ± 2%   117.4n ± 2%   +7.12% (p=0.000 n=25)
DurationField-96      111.6n ± 3%   121.9n ± 3%   +9.23% (p=0.000 n=25)
ErrorField-96         108.4n ± 2%   115.7n ± 4%   +6.73% (p=0.000 n=25)
ErrorsField-96        184.1n ± 2%   205.1n ± 4%  +11.41% (p=0.000 n=25)
StackField-96         713.0n ± 3%   813.3n ± 3%  +14.07% (p=0.000 n=25)
ObjectField-96        117.2n ± 2%   130.9n ± 3%  +11.69% (p=0.000 n=25)
ReflectField-96       317.6n ± 2%   346.0n ± 3%   +8.94% (p=0.000 n=25)
10Fields-96           584.7n ± 2%   622.4n ± 4%   +6.45% (p=0.000 n=25)
100Fields-96          5.919µ ± 3%   5.630µ ± 5%        ~ (p=0.073 n=25)
geomean               196.5n        213.4n        +8.61%

We fiddled with GOMAXPROCS a bit and noticed the degradation is definitely related to parallelism.
Screenshot 2024-06-06 at 2 02 53 PM

We didn’t see a whole lot in CPU profiles other than a general increase of about 2-4% of samples taken in the runtime package.

We were able to use git bisect to identify e995aa95cb5f379c1df5d5511ee09970261d877f as one cause. Specifically, the added calls to nanotime() seem to cause degradation in these highly parallelized benchmarks. However, this commit alone does not seem to account for the entire degradation:

$ ~/go/src/github.com/golang/go3/bin/go version
go version devel go1.23-e995aa95cb Mon Apr 8 21:43:16 2024 +0000 linux/amd64
$ ~/go/src/github.com/golang/go3/bin/go test -bench Field -run nounittests -count 25 . | tee e995aa95cb.log
$ benchstat go1224.log e995aa95cb.log
goos: linux
goarch: amd64
pkg: go.uber.org/zap
cpu: AMD EPYC 7B13
                   │ go1.22.4.log │            e995aa95cb.log            │
                   │    sec/op    │    sec/op     vs base                │
BoolField-96          110.6n ± 6%   121.1n ±  6%   +9.49% (p=0.004 n=25)
ByteStringField-96    139.7n ± 1%   145.9n ±  2%   +4.44% (p=0.002 n=25)
Float64Field-96       112.2n ± 4%   121.1n ±  1%   +7.93% (p=0.000 n=25)
IntField-96           108.7n ± 3%   112.5n ±  2%   +3.50% (p=0.009 n=25)
Int64Field-96         105.9n ± 4%   111.4n ±  3%        ~ (p=0.200 n=25)
StringField-96        104.4n ± 2%   111.5n ±  2%   +6.80% (p=0.000 n=25)
StringerField-96      105.4n ± 3%   113.4n ±  3%   +7.59% (p=0.000 n=25)
TimeField-96          109.6n ± 2%   117.6n ±  2%   +7.30% (p=0.000 n=25)
DurationField-96      111.6n ± 3%   116.8n ±  2%   +4.66% (p=0.000 n=25)
ErrorField-96         108.4n ± 2%   113.7n ±  2%   +4.89% (p=0.002 n=25)
ErrorsField-96        184.1n ± 2%   201.7n ±  4%   +9.56% (p=0.000 n=25)
StackField-96         713.0n ± 3%   770.9n ±  2%   +8.12% (p=0.000 n=25)
ObjectField-96        117.2n ± 2%   127.2n ±  3%   +8.53% (p=0.000 n=25)
ReflectField-96       317.6n ± 2%   349.4n ±  5%  +10.01% (p=0.000 n=25)
10Fields-96           584.7n ± 2%   620.5n ±  5%   +6.12% (p=0.005 n=25)
100Fields-96          5.919µ ± 3%   6.046µ ± 25%        ~ (p=0.064 n=25)
geomean               196.5n        209.5n         +6.62%

We weren’t able to reliably identify any additional commits beyond this one that accounted for more of the degradation.

Note: this is not a duplicate of #67857, but rather an investigation of different Zap benchmark degradations.

What did you expect to see?

No practical degradation.

Activity

ianlancetaylor

ianlancetaylor commented on Jun 7, 2024

@ianlancetaylor
Contributor

CC @golang/runtime

mknyszek

mknyszek commented on Jun 7, 2024

@mknyszek
Contributor

These logging benchmarks appear to involve incredibly tight loops, which leads me to believe that it's possibly due to unfortunate microarchitectural effects -- see my comment on #67857.

And my next question would be: does this in any way correspond to production-level regressions? Unless the world is getting stopped a lot I would not expect 15*GOMAXPROCS cpu-ns of CPU time have any meaningful effect on performance in any real program (referencing your bisection result). I could see it having an effect in a microbenchmark that stops the world a lot, though.

One thing I might suggest trying is setting GOMEMLIMIT=<something big> and GOGC=off in your benchmarks to try and eliminate or minimize the GC cost of your benchmark (allocation cost will still be measured, though, which I assume you care about). The reason I say this is because microbenchmarks can end up being a torture test for the GC in a way that makes the benchmark less useful. In other words, the costs dominating your benchmark do not reflect the actual costs of your code in real-world contexts.

For example, a benchmark that allocates in a loop, immediately drops the memory, and has a teeny tiny live heap (a lot like most microbenchmarks!), is going to see a very, very high number of GC cycles because it is going to sit at the minimum total heap size of 4 MiB. This means tons of STWs and CPU time spent on mark work that does not translate at all to real production systems with sizable live heaps. The tricky part with the GC is that its frequency (and thus its cost) is proportional to your live heap via GOGC, so considering the GC as "part of the benchmark" is really hard to do in a useful way. It's much easier if the benchmark measures the performance of some larger end-to-end system.

I'm curious to know if setting GOMEMLIMIT to something big and GOGC=off makes the problem go away completely. If it does, I am less concerned about these regressions. If it doesn't, then I think this becomes much more important to fix. Also, if this regression shows up in more end-to-end benchmarks, or in production, that's much more cause for alarm.

I'll also add, are you able to compare profiles before and after (pprof -diff_base=old.prof new.prof)? Even an 8% regression should stick our like a sore thumb.

self-assigned this
on Jun 7, 2024
added
NeedsInvestigationSomeone must examine and confirm this is a valid issue and not a duplicate of an existing one.
on Jun 7, 2024
added this to the Go1.23 milestone on Jun 7, 2024
JacobOaks

JacobOaks commented on Jun 7, 2024

@JacobOaks
Author

Hey @mknyszek - thanks for the quick response & ideas!

Your explanation w.r.t microbenchmarks being a GC torture test makes sense - these tests are allocating memory that quickly becomes dead - which leads to a situation with consistent allocations but low target memory, causing GC to get triggered often. I suppose this would exacerbate any GC performance degradations that are otherwise minuscule in larger benchmarks/applications.

It's unfortunately not realistic for us to actually test Go tip in production - but we may be able to do this once an rc version is tagged.

I did turn off GC and re-ran these benchmarks, and as you predicted, this does seem to result in no degradation between the versions. Like I mentioned before, we had a hard time discerning any differences in profiles between the two versions, but I tried to make the issue worse by forcing more GC with GOGC=50 in hopes that differences between the profiles would surface better, and did find that gcBgMarkWorker saw about a 3% increase in CPU time with this - which seems in line with your explanation. There weren't any other significant differences that I could tell (happy to share the profiles if it helps).

mknyszek

mknyszek commented on Jun 7, 2024

@mknyszek
Contributor

Hey @mknyszek - thanks for the quick response & ideas!

No problem. These reports are important. :) I hope my response did not discourage future reports. Thanks for the thorough issue you filed -- it's genuinely helpful that y'all do performance testing against tip.

Your explanation w.r.t microbenchmarks being a GC torture test makes sense - these tests are allocating memory that quickly becomes dead - which leads to a situation with consistent allocations but low target memory, causing GC to get triggered often. I suppose this would exacerbate any GC performance degradations that are otherwise minuscule in larger benchmarks/applications.

Yeah, it's a bit unfortunate how easy it is to generate benchmarks that do this.

It's unfortunately not realistic for us to actually test Go tip in production - but we may be able to do this once an rc version is tagged.

Understandable. Hopefully the RC gives us some more feedback.

I did turn off GC and re-ran these benchmarks, and as you predicted, this does seem to result in no degradation between the versions. Like I mentioned before, we had a hard time discerning any differences in profiles between the two versions, but I tried to make the issue worse by forcing more GC with GOGC=50 in hopes that differences between the profiles would surface better, and did find that gcBgMarkWorker saw about a 3% increase in CPU time with this - which seems in line with your explanation. There weren't any other significant differences that I could tell (happy to share the profiles if it helps).

Got it, that's good to know! Thanks for checking.

we had a hard time discerning any differences in profiles between the two versions

That's unfortunate; just to clarify, that's even with the automated diffing? The only other thing I might suggest is doing the same kind of differential profile with Linux perf if you can, which tends to produce more accurate results overall.

I'm going to leave this issue open for now while I investigate #67822, in case it becomes relevant.

thanm

thanm commented on Jun 12, 2024

@thanm
Contributor

I ran these benchmarks on my high-core 2-socket machine, with and without link-time randomization (as described in this note: #67822 (comment)).

Here's what I see with no randomization; rock steady:

goos: linux
goarch: amd64
pkg: go.uber.org/zap
cpu: Intel(R) Xeon(R) CPU @ 2.00GHz
                   │ out.base.txt │             out.tip.txt             │
                   │    sec/op    │   sec/op     vs base                │
BoolField-96         116.1n ±  5%   127.7n ± 3%   +9.99% (p=0.000 n=30)
ByteStringField-96   135.2n ±  5%   143.5n ± 8%   +6.18% (p=0.001 n=30)
Float64Field-96      118.6n ±  4%   128.1n ± 5%   +8.06% (p=0.000 n=30)
IntField-96          104.6n ±  3%   114.9n ± 2%   +9.85% (p=0.000 n=30)
Int64Field-96        107.0n ±  2%   116.7n ± 2%   +9.12% (p=0.000 n=30)
StringField-96       104.7n ±  1%   115.4n ± 4%  +10.22% (p=0.000 n=30)
StringerField-96     107.5n ±  2%   114.1n ± 5%   +6.19% (p=0.000 n=30)
TimeField-96         107.6n ±  4%   118.3n ± 3%   +9.90% (p=0.000 n=30)
DurationField-96     110.9n ±  2%   120.4n ± 3%   +8.66% (p=0.000 n=30)
ErrorField-96        104.2n ±  2%   115.6n ± 2%  +10.94% (p=0.000 n=30)
ErrorsField-96       198.9n ±  3%   211.7n ± 2%   +6.41% (p=0.000 n=30)
StackField-96        673.3n ±  7%   747.2n ± 3%  +10.97% (p=0.000 n=30)
ObjectField-96       131.1n ±  1%   145.5n ± 2%  +10.98% (p=0.000 n=30)
ReflectField-96      265.9n ±  1%   297.7n ± 5%  +11.98% (p=0.000 n=30)
10Fields-96          560.2n ± 11%   576.4n ± 4%        ~ (p=0.090 n=30)
100Fields-96         8.545µ ±  1%   8.635µ ± 1%   +1.05% (p=0.003 n=30)
geomean              199.9n         216.5n        +8.29%

Now here's what happens when I add in text layout randomization (10 instances of -count=3 runs, each with a random seed):

goos: linux
goarch: amd64
pkg: go.uber.org/zap
cpu: Intel(R) Xeon(R) CPU @ 2.00GHz
                   │ rout.base.txt │             rout.tip.txt             │
                   │    sec/op     │    sec/op     vs base                │
BoolField-96          127.3n ±  8%   131.7n ±  7%        ~ (p=0.301 n=30)
ByteStringField-96    155.5n ±  8%   146.8n ±  5%        ~ (p=0.506 n=30)
Float64Field-96       131.2n ±  9%   131.4n ±  3%        ~ (p=0.762 n=30)
IntField-96           119.5n ±  6%   114.9n ±  4%        ~ (p=0.636 n=30)
Int64Field-96         120.1n ±  6%   119.7n ±  4%        ~ (p=0.530 n=30)
StringField-96        115.0n ±  9%   120.1n ±  3%        ~ (p=0.124 n=30)
StringerField-96      112.2n ±  9%   117.9n ±  3%   +5.13% (p=0.044 n=30)
TimeField-96          115.8n ± 10%   128.4n ±  7%  +10.88% (p=0.004 n=30)
DurationField-96      124.2n ±  4%   123.0n ±  3%        ~ (p=0.605 n=30)
ErrorField-96         120.5n ±  8%   116.8n ± 10%        ~ (p=0.231 n=30)
ErrorsField-96        210.8n ±  4%   224.1n ±  3%   +6.36% (p=0.001 n=30)
StackField-96         748.4n ±  6%   779.4n ±  4%        ~ (p=0.156 n=30)
ObjectField-96        144.7n ±  3%   146.7n ±  5%        ~ (p=0.067 n=30)
ReflectField-96       324.1n ±  4%   314.8n ±  6%        ~ (p=0.807 n=30)
10Fields-96           578.1n ±  5%   580.1n ±  5%        ~ (p=0.894 n=30)
100Fields-96          8.917µ ±  1%   8.771µ ±  1%   -1.64% (p=0.002 n=30)
geomean               220.7n         223.0n         +1.03%

Note the p values, which are now all over the map. So I strongly suspect that in addition to the GC artifacts mentioned by @mknyszek, some additional fraction of the apparent degradation is due to loop alignment (e.g. just happening to get the "right" alignment in one case and the "wrong" alignment in another).

JacobOaks

JacobOaks commented on Jun 13, 2024

@JacobOaks
Author

some additional fraction of the apparent degradation is due to loop alignment

Hmm... FWIW I wasn't able to reproduce this using the script from #67822 (comment), including the same 10 seeds. Perhaps I'm doing something wrong.

No linker randomization:

goos: linux
goarch: amd64
pkg: go.uber.org/zap
cpu: AMD EPYC 7B13
                   │ out.go1.22.txt │             out.tip.txt             │
                   │     sec/op     │   sec/op     vs base                │
BoolField-96            112.3n ± 3%   126.5n ± 4%  +12.60% (p=0.000 n=30)
ByteStringField-96      141.2n ± 3%   150.8n ± 2%   +6.83% (p=0.000 n=30)
Float64Field-96         115.6n ± 2%   127.2n ± 2%  +10.03% (p=0.000 n=30)
IntField-96             107.4n ± 2%   116.2n ± 1%   +8.14% (p=0.000 n=30)
Int64Field-96           106.2n ± 2%   116.0n ± 2%   +9.33% (p=0.000 n=30)
StringField-96          105.6n ± 1%   115.5n ± 1%   +9.38% (p=0.000 n=30)
StringerField-96        107.4n ± 2%   113.5n ± 2%   +5.68% (p=0.000 n=30)
TimeField-96            110.0n ± 2%   119.6n ± 1%   +8.73% (p=0.000 n=30)
DurationField-96        110.5n ± 1%   122.2n ± 1%  +10.54% (p=0.000 n=30)
ErrorField-96           107.8n ± 3%   115.9n ± 2%   +7.56% (p=0.000 n=30)
ErrorsField-96          188.1n ± 2%   209.0n ± 3%  +11.17% (p=0.000 n=30)
StackField-96           725.4n ± 2%   788.0n ± 1%   +8.64% (p=0.000 n=30)
ObjectField-96          120.1n ± 1%   129.1n ± 3%   +7.54% (p=0.000 n=30)
ReflectField-96         326.8n ± 1%   350.2n ± 1%   +7.16% (p=0.000 n=30)
10Fields-96             597.5n ± 2%   594.0n ± 4%        ~ (p=0.906 n=30)
100Fields-96            5.597µ ± 1%   5.667µ ± 5%        ~ (p=0.623 n=30)
geomean                 198.0n        213.2n        +7.70%

Linker randomization:

goos: linux
goarch: amd64
pkg: go.uber.org/zap
cpu: AMD EPYC 7B13
                   │ rout.go1.22.txt │            rout.tip.txt             │
                   │     sec/op      │   sec/op     vs base                │
BoolField-96             116.4n ± 2%   132.6n ± 3%  +13.87% (p=0.000 n=30)
ByteStringField-96       143.0n ± 2%   152.6n ± 2%   +6.64% (p=0.000 n=30)
Float64Field-96          115.5n ± 1%   128.0n ± 2%  +10.77% (p=0.000 n=30)
IntField-96              108.3n ± 1%   116.9n ± 3%   +7.94% (p=0.000 n=30)
Int64Field-96            107.4n ± 2%   117.6n ± 2%   +9.45% (p=0.000 n=30)
StringField-96           107.2n ± 2%   117.0n ± 2%   +9.14% (p=0.000 n=30)
StringerField-96         108.4n ± 2%   117.8n ± 2%   +8.68% (p=0.000 n=30)
TimeField-96             111.2n ± 1%   121.4n ± 1%   +9.22% (p=0.000 n=30)
DurationField-96         113.8n ± 3%   124.6n ± 2%   +9.44% (p=0.000 n=30)
ErrorField-96            108.5n ± 3%   119.7n ± 1%  +10.32% (p=0.000 n=30)
ErrorsField-96           186.7n ± 2%   208.5n ± 1%  +11.70% (p=0.000 n=30)
StackField-96            720.4n ± 1%   805.6n ± 3%  +11.82% (p=0.000 n=30)
ObjectField-96           120.2n ± 2%   132.0n ± 2%   +9.78% (p=0.000 n=30)
ReflectField-96          329.4n ± 2%   357.0n ± 3%   +8.39% (p=0.000 n=30)
10Fields-96              593.3n ± 2%   618.2n ± 2%   +4.20% (p=0.001 n=30)
100Fields-96             5.690µ ± 4%   5.744µ ± 2%        ~ (p=0.935 n=30)
geomean                  199.7n        217.4n        +8.85%

I did sanity check go was being invoked correctly.

+ go version
go version go1.22.4 linux/amd64
+ go test -ldflags=-randlayout=17017 -count=3 -run=nothing -bench=Field
...
+ go version
go version devel go1.23-97bc577812 Wed Jun 12 18:56:34 2024 +0000 linux/amd64
+ go test -ldflags=-randlayout=15015 -count=3 -run=nothing -bench=Field

very strange.

7 remaining items

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

Labels

NeedsInvestigationSomeone must examine and confirm this is a valid issue and not a duplicate of an existing one.compiler/runtimeIssues related to the Go compiler and/or runtime.

Type

No type

Projects

Status

In Progress

Relationships

None yet

    Development

    No branches or pull requests

      Participants

      @mknyszek@ianlancetaylor@gopherbot@JacobOaks@thanm

      Issue actions

        runtime: performance degradation on tip on high core count machines · Issue #67858 · golang/go