Description
What version of Go are you using (go version
)?
go 1.11.12, planning to upgrade to go 1.12.15
Does this issue reproduce with the latest release?
Yes, with go 1.12.15
What operating system and processor architecture are you using (go env
)?
Ubuntu 18.04
What did you do?
Since go 1.11 is about to go EOL, we are planning to upgrade to the next version. I attempted
to upgrade to go 1.12.15, but I am seeing 15-20% performance degradation. A similar result was seen with go 1.13.6.
What did you expect to see?
I was hoping to see performance improvement with a newer go compiler.
What did you see instead?
Instead of performance improvement, saw performance degradation instead.
Is there anything in newer go versions (1.12 and beyond) that could answer the above performance degradation? I read online that the scheduler was modified to allow preemption of long-running goroutines. While I do not know if this could cause the performance issue, is there any way to revert this scheduler behavior just for test purposes. Any other suggestions are welcome.
Activity
ALTree commentedon Jan 13, 2020
Hi,
thanks for reporting this, but your description of the issue is far too vague to be actionable.
Please provide a runnable, auto-contained benchmark that demonstrates the problem on the latest 1.13 release, together with some benchmarking results on the platform you are running on.
[-]Performance degradation in go 1.12[/-][+]runtime: performance degradation in go 1.12[/+]interviewQ commentedon Jan 13, 2020
Thanks for the response. I do see the performance issue with 1.13 also. Give me a few days to come up with a sample code that demonstrates the issue.
interviewQ commentedon Feb 4, 2020
Sorry for the delay, it took me a while to narrow down the problem code path. Attached is a sample benchmark test. I am running this test on my Mac and here are the results -
1.11.13 -
goos: darwin
goarch: amd64
BenchmarkBlockSplit-12 1 2506669772 ns/op
PASS
1.13.7 -
goos: darwin
goarch: amd64
BenchmarkBlockSplit-12 1 2945086182 ns/op
PASS
So, 2.5 sec -> 2.9 sec equates to 16% drop in performance.
perf.txt
randall77 commentedon Feb 4, 2020
Here's a simple reproducer. I can reproduce on Linux.
This test is really allocation heavy. It's going to depend on the precise details of how allocation is done. I get these times:
I think most of the difference has to do with how big the heap grows. Here's my eyeballing the maxiumum RSS (adding lots of iterations to the innermost loop so I can watch
top
while it runs):This test is stressing the allocator so much that it goes way beyond 200% of live heap data, which should only be 640KB or so. You'll see that we've made steady improvements on how much over we go. Of course, that means that we'll run garbage collections more often, which will hurt the run time.
I'm not sure there's any bug here. It just seems like a different speed/memory tradeoff. To actually demonstrate a regression, I think we'd need to hold heap size constant somehow.
@mknyszek
interviewQ commentedon Feb 4, 2020
Our app which is showing performance degradation is also allocation heavy, which is why we are seeing this performance degradation with newer go runtime. Having said that, over the last few weeks I have made lots of changes to improve memory management and did see improvements that make the performance drop lower. From what you are saying above, it seems like memory allocator heavy apps will have to take a hit when they upgrade to a newer go runtime.
In your test above it seems time taken increased from 0.689 sec -> 1.151 sec which is a 65% increase. Is this big drop in performance solely attributed to running GC more often?
mknyszek commentedon Feb 4, 2020
@interviewQ Some background: there was a known regression in the slow path of the allocator in Go 1.12, which was intentional in order to support returning memory to the OS more eagerly. For the vast majority of programs, this wasn't noticed. The regression only showed up in microbenchmarks, not in any real production services or applications (as far as we could see/find). You need to be really allocation bound to notice this, as @randall77's benchmark shows. Note that @randall77's reproducer makes many 64 KiB allocations without ever touching that memory (aside from zeroing), which would already meaningfully reduce the load on the allocator.
However, it's also been known that allocating heavily and in parallel has had serious lock contention issues since at least Go 1.11. This was noticed when we worked on the regression in Go 1.12. In Go 1.14 we worked to fix these issues (#35112) and it went reasonably well. I'm fairly confident that the allocator is now faster, so this seems to me like it's the change in the default trade-off in these "way past the heap goal" cases as @randall77 says.
With that being said, @interviewQ could you:
GOGC
to the point where you're using the same amount of memory as before, then measure CPU performance again.Hopefully the net effect will be that we made performance better. :) If not, we can move on from there.
interviewQ commentedon Feb 4, 2020
First of all thanks a lot for these quick replies and very useful info.
I had tested with 1.14beta1 in Dec and that still showed the performance issue.
Did the fix for 35112 go after 1.14 beta1 was released? I was planning on retesting with 1.14 when it gets released (was due on Feb 1). However, I could try the tip of the master branch.
In our app, we have 1000 goroutines, all allocating memory in parallel, so from what you are saying above "allocating heavily and in parallel" is what applies to our app.
Using sync.Pool is something I have thought about but it seems non-trivial since I will need to figure out when to put the buffer back to the pool. It is not impossible just hard to figure out using sync.Pool.
What if I have 1 goroutine whose only job is to alloc memory. These 1000 goroutines can talk to this "allocator" goroutine (via channels). That way these huge allocations happen from only 1 goroutine and contention is eliminated. Please let me know if this makes sense. Also, I come from C++ world and hence new to Go, so please pardon my ignorance.
@mknyszek
mknyszek commentedon Feb 4, 2020
There may be a regression in the default configuration, but have you looked at memory use like I mentioned? If it went down, you could increase GOGC from the default until your memory use is the same as before (in the steady-state) and see if it performs better? Since you mentioned you're relatively new to Go, check out the comment at the top of https://golang.org/pkg/runtime for an explanation of GOGC, and also see https://golang.org/pkg/runtime/debug/#SetGCPercent for more information.
The number of goroutines doesn't really matter since that's just concurrency. The GOMAXPROCS value is your actual level of parallelism (up to the number of independent CPU cores on your machine). What is GOMAXPROCS for your application?
Contention isn't really eliminated; you're just moving it from the allocator and to the channels in this case. Before trying to restructure your code for this please give my suggestion above a try. You might just have to increase GOGC to get the same level of CPU performance for the same memory usage.
interviewQ commentedon Feb 4, 2020
GOMAXPROCS = number of CPU cores. Our app runs on EC2 instances, so it depends on which instance we are running on. In my test GOMAXPROCS = 72. I will try out your suggestions and will get back to you (most likely tomorrow).
interviewQ commentedon Feb 4, 2020
I ran tests with various go versions and checked RSS usage.
1.11.13 - 3.3 GB
1.13.7 - 3.4GB
1.14 (tip of master) - 4.6 GB
Actually, RSS increased with 1.14, due to this I did not so the GOGC adjustment. On this note, I see mention of SetMaxHeap function in runtime (https://blog.golang.org/ismmkeynote). Is this something that would be included in future versions of Go?
Relative to 1.11.13, performance-wise I see 25% drop with 1.13.7 and 12% drop with 1.14.
By reducing heap allocation in the app I do see performance improvement, so it is clear allocator is the bottleneck.
@mknyszek
mknyszek commentedon Feb 5, 2020
Just out of curiosity, how did you measure RSS?
I just want to be very precise on this because the virtual memory usage of Go increased significantly with 1.14 (around 600 MiB, which admittedly doesn't account for everything you're seeing), but most of that memory is not mapped in/committed.
Thanks for the quick turnaround on this and for your cooperation. 72 cores is not a level of parallelism I've personally ever tested, it's possible we have some scalability problems at that level that we haven't seen before.
As another experiment, can you try setting GOMAXPROCS to 48 or so? There are other things I'd be interested in looking at as well, but I'll wait for your reply first.
mknyszek commentedon Feb 5, 2020
@interviewQ Also, if you'd be willing to share a GC trace, that would give us a lot more insight into what the runtime is doing. You can collect one by running your application with
GODEBUG=gctrace=1
set and capturing STDERR.interviewQ commentedon Feb 5, 2020
@mknyszek
I measured RSS by looking into "top" output, did you mean something else when you mentioned memory usage of the application. I had collected gctrace but did not save it. Will run the test again and give you the output. Planning to run the following tests -
Do you also need results from 1.14 (top of master branch)?
10 remaining items
dmitshur commentedon Feb 13, 2020
It's not clear to me yet if this should be a release blocker for 1.14, but I'll tentatively add it for visibility so we can discuss it in a release meeting.
dmitshur commentedon Feb 14, 2020
We've looked over this, and it should not be a release blocker for Go 1.14, because it's not a regression that is specific to 1.14 (it affects previous releases too). Removing release-blocker.
gopherbot commentedon Apr 12, 2020
Timed out in state WaitingForInfo. Closing.
(I am just a bot, though. Please speak up if this is a mistake or you have the requested information.)
networkimprov commentedon Apr 12, 2020
@interviewQ I think this was waiting on pprof results from you, described in #36521 (comment).