Closed
Description
What version of Go are you using (go version
)?
$ gotip version go version devel +3c47ead Thu Nov 7 19:20:57 2019 +0000 darwin/amd64
Does this issue reproduce with the latest release?
Current version of 1.13.3
runs faster. In fact, a version of gotip as of yesterday saw this program spending 50% of its time in GC. With this latest version of tip, it now is running at 33%.
What operating system and processor architecture are you using (go env
)?
go env
Output
$ gotip env GO111MODULE="" GOARCH="amd64" GOBIN="" GOCACHE="/Users/bill/Library/Caches/go-build" GOENV="/Users/bill/Library/Application Support/go/env" GOEXE="" GOFLAGS="" GOHOSTARCH="amd64" GOHOSTOS="darwin" GONOPROXY="" GONOSUMDB="" GOOS="darwin" GOPATH="/Users/bill/code/go" GOPRIVATE="" GOPROXY="https://proxy.golang.org,direct" GOROOT="/Users/bill/sdk/gotip" GOSUMDB="sum.golang.org" GOTMPDIR="" GOTOOLDIR="/Users/bill/sdk/gotip/pkg/tool/darwin_amd64" GCCGO="gccgo" AR="ar" CC="clang" CXX="clang++" CGO_ENABLED="1" GOMOD="" CGO_CFLAGS="-g -O2" CGO_CPPFLAGS="" CGO_CXXFLAGS="-g -O2" CGO_FFLAGS="-g -O2" CGO_LDFLAGS="-g -O2" PKG_CONFIG="pkg-config" GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/f8/nl6gsnzs1m7530bkx9ct8rzc0000gn/T/go-build761411139=/tmp/go-build -gno-record-gcc-switches -fno-common"
What did you do?
https://github.com/ardanlabs/gotraining/tree/master/topics/go/profiling/trace
With the following code changes.
// Uncomment out these two lines.
44 trace.Start(os.Stdout)
45 defer trace.Stop()
Comment out line 53 and uncomment out line 56.
52 topic := "president"
53 // n := freq(topic, docs)
54 // n := freqConcurrent(topic, docs)
55 // n := freqConcurrentSem(topic, docs)
56 n := freqNumCPU(topic, docs)
57 // n := freqNumCPUTasks(topic, docs)
58 // n := freqActor(topic, docs)
Run the program
$ gotip build
$ ./trace > t.out
$ gotip tool trace t.out
What did you expect to see?
I expected to see the GC to be at or under 25% of the total runtime for the program. I didn't expect the program to run slower. Also the freqConcurrent
version of the algorithm used to run at a comparable run time. Now on tip, this is faster as well by close to 300 milliseconds.
What did you see instead?
With the latest version of tip for today, I saw GC using 33% of the total run time.
On Tip
GC | 282,674,620 ns | 282,674,620 ns | 674,641 ns | 419
Selection start: 3,595,151 ns
Selection extent: 845,408,873 ns
Total Run time: 849.3ms
On 1.13.3
GC | 174,446,968 ns | 174,446,968 ns | 425,480 ns | 410
Selection start: 2,872,528 ns
Selection extent: 763,358,190 ns
Total Run time: 768.0ms
Metadata
Metadata
Assignees
Labels
Type
Projects
Relationships
Development
No branches or pull requests
Activity
odeke-em commentedon Nov 7, 2019
Thank you for reporting this issue @ardan-bkennedy!
Kindly paging @mknyszek @randall77 @aclements @RLH.
[-]runtime/GC: Program appears to spend 10% more time in GC on tip[/-][+]runtime: program appears to spend 10% more time in GC on tip 3c47ead than on Go1.13.3[/+]mknyszek commentedon Nov 7, 2019
This is likely related to golang.org/cl/200439 which allows the GC to assist more than 25% in cases where there's a high rate of allocation.
Although this seems like a regression, please stay tuned. I'm currently in the process of landing a set of patches related to #35112 and by the end, with this additional GC use, it's a net win for heavily allocating applications (AFAICT).
The reason we're allowing GC to exceed 25% in these cases is because #35112 makes the page allocator fast enough to out-run the GC and drive the trigger ratio to very low values (like 0.01), which means the next mark phase is starting almost immediately, meaning pretty much all new memory would be allocated black, leading to an unnecessary RSS increase. By bounding the trigger ratio like in golang.org/cl/200439, your application may end up assisting more, but the latency win from #35112 should still beat that latency hit by a significant margin in my experiments.
I'll poke this thread again when I've finished landing the full stack of changes, so please try again at that point.
In the meantime, if you could provide some information about your application? In particular:
This will help me get a better idea of whether this will be a win, or whether this is a loss in single-threaded performance or something else.
ardan-bkennedy commentedon Nov 7, 2019
This runs as a 12 threaded Go program. So the code is using a pool of 12 goroutines and the GC is keeping the heap at 4 meg. In the version of code that creates a goroutine per file, I see the heap grow as high as 80 meg.
The program is opening, reading, decoding and searching 4000 files. It's memory intensive to an extent. Throwing 4000 groutines at this problem on tip is finishing the work faster than using a pool. That was never the case in 1.13.
ardan-bkennedy commentedon Nov 7, 2019
I find this interesting. This is my understanding.
A priority of the pacer is to maintain a smaller heap over time and to reduce mark assit (MA) so more M's can be used for application work during any GC cycle. A GC may start early (before the heap reaches the GC Percent threshold) if it means reducing MA time. In the end, the total GC time would stay at or below 25%.
This change is allowing the GC time to grow above 25% to help reduce the size of the heap in some heavy allocation scenarios. This will increase the amount of MA time and reduce the application throughput during a GC?
Your hope is the performance loss there is gained back in the allocator?
In the end, the heap size remains as small as possible?
mknyszek commentedon Nov 7, 2019
Pretty much, though I wouldn't characterize it as "may start early", but rather as just "starts earlier". It's the pacer's job to drive GC use to 25%, and its primary tool for doing so is deciding when to start a GC.
Both latency and throughput, but yes that's correct.
Correct. A heavily allocating RPC benchmark was able to drive the pacer to start a GC at the half-way point (trigger ratio = 0.5) in Go 1.13. The same benchmark drove the trigger ratio to 0.01 with the new allocator. The most convincing evidence of this being that the allocator just got faster was that the only thing that brought the trigger ratio back up was adding a sleep on the critical path.
In the end, this RPC benchmark saw a significant improvement in tail latency (-20% or more) and throughput (+30% or more), even with the new threshold.
Not quite. The threshold in that above CL was chosen to keep the heap size roughly the same across Go versions.
ianlancetaylor commentedon Nov 16, 2019
@mknyszek Can you let @ardan-bkennedy know when to re-run tests for this issue? Thanks.
mknyszek commentedon Nov 18, 2019
@ianlancetaylor Ah! Sorry. I completely forgot.
@ardan-bkennedy If it's on linux, windows, freebsd, feel free to try again from tip any time. :) Still working out some issues on the less popular platforms.
ardan-bkennedy commentedon Nov 19, 2019
@mknyszek I am running on a Mac. I need time to test this on linux.
Side Note: I find it interesting that you consider Darwin a less popular platform when most developers I know are working on that platform?
mknyszek commentedon Nov 19, 2019
@ardan-bkennedy That's my mistake, I omitted it by accident. I do consider it a popular platform. Please give it a try.
The "less popular" platforms I had in mind were AIX and OpenBSD, so really anything that's not those two, though AIX should be OK now.
ardan-bkennedy commentedon Nov 19, 2019
I gave Darwin a try today and the GC actually ran closer to 60% today. I just downloaded tip once more and ran it again. Looks like the program is running slower but GC looks like it is at 37%.
12 Days Ago
Tonight
go version devel +8b1e8a424a Tue Nov 19 19:59:21 2019 +0000 darwin/amd64
mknyszek commentedon Nov 20, 2019
@ardan-bkennedy Interesting. I wonder to what extent we're seeing additional start-up costs here, considering that the application only runs for about a second (though 70 ms is kind of a lot, I wonder what the distributions look like).
ardan-bkennedy commentedon Nov 20, 2019
@mknyszek the program is available for you to run. I added instructions earlier. On my machine I expect the program to run as fast as the fan out version I run with 4000 Goroutines. Which is ~750 ms
17 remaining items