Skip to content

runtime: program appears to spend 10% more time in GC on tip 3c47ead than on Go1.13.3 #35430

Closed
@ardan-bkennedy

Description

@ardan-bkennedy

What version of Go are you using (go version)?

$ gotip version
go version devel +3c47ead Thu Nov 7 19:20:57 2019 +0000 darwin/amd64

Does this issue reproduce with the latest release?

Current version of 1.13.3 runs faster. In fact, a version of gotip as of yesterday saw this program spending 50% of its time in GC. With this latest version of tip, it now is running at 33%.

What operating system and processor architecture are you using (go env)?

go env Output
$ gotip env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/bill/Library/Caches/go-build"
GOENV="/Users/bill/Library/Application Support/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GONOPROXY=""
GONOSUMDB=""
GOOS="darwin"
GOPATH="/Users/bill/code/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/Users/bill/sdk/gotip"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/Users/bill/sdk/gotip/pkg/tool/darwin_amd64"
GCCGO="gccgo"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/f8/nl6gsnzs1m7530bkx9ct8rzc0000gn/T/go-build761411139=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

https://github.com/ardanlabs/gotraining/tree/master/topics/go/profiling/trace

With the following code changes.

// Uncomment out these two lines.
44     trace.Start(os.Stdout)
45     defer trace.Stop()

Comment out line 53 and uncomment out line 56.

52     topic := "president"
53     // n := freq(topic, docs)
54     // n := freqConcurrent(topic, docs)
55     // n := freqConcurrentSem(topic, docs)
56     n := freqNumCPU(topic, docs)
57     // n := freqNumCPUTasks(topic, docs)
58     // n := freqActor(topic, docs)

Run the program

$ gotip build
$ ./trace > t.out
$ gotip tool trace t.out

What did you expect to see?

I expected to see the GC to be at or under 25% of the total runtime for the program. I didn't expect the program to run slower. Also the freqConcurrent version of the algorithm used to run at a comparable run time. Now on tip, this is faster as well by close to 300 milliseconds.

What did you see instead?

With the latest version of tip for today, I saw GC using 33% of the total run time.

On Tip

GC | 282,674,620 ns | 282,674,620 ns | 674,641 ns | 419
Selection start: 3,595,151 ns
Selection extent: 845,408,873 ns
Total Run time: 849.3ms

On 1.13.3

GC | 174,446,968 ns | 174,446,968 ns | 425,480 ns | 410
Selection start: 2,872,528 ns
Selection extent: 763,358,190 ns
Total Run time: 768.0ms

Activity

odeke-em

odeke-em commented on Nov 7, 2019

@odeke-em
Member

Thank you for reporting this issue @ardan-bkennedy!

Kindly paging @mknyszek @randall77 @aclements @RLH.

changed the title [-]runtime/GC: Program appears to spend 10% more time in GC on tip[/-] [+]runtime: program appears to spend 10% more time in GC on tip 3c47ead than on Go1.13.3[/+] on Nov 7, 2019
mknyszek

mknyszek commented on Nov 7, 2019

@mknyszek
Contributor

This is likely related to golang.org/cl/200439 which allows the GC to assist more than 25% in cases where there's a high rate of allocation.

Although this seems like a regression, please stay tuned. I'm currently in the process of landing a set of patches related to #35112 and by the end, with this additional GC use, it's a net win for heavily allocating applications (AFAICT).

The reason we're allowing GC to exceed 25% in these cases is because #35112 makes the page allocator fast enough to out-run the GC and drive the trigger ratio to very low values (like 0.01), which means the next mark phase is starting almost immediately, meaning pretty much all new memory would be allocated black, leading to an unnecessary RSS increase. By bounding the trigger ratio like in golang.org/cl/200439, your application may end up assisting more, but the latency win from #35112 should still beat that latency hit by a significant margin in my experiments.

I'll poke this thread again when I've finished landing the full stack of changes, so please try again at that point.

In the meantime, if you could provide some information about your application? In particular:

  • What is the value of GOMAXPROCS when running this program?
  • How heavily does it allocate/do you expect it to allocate?
    • Does it perform these allocations concurrently?

This will help me get a better idea of whether this will be a win, or whether this is a loss in single-threaded performance or something else.

ardan-bkennedy

ardan-bkennedy commented on Nov 7, 2019

@ardan-bkennedy
Author
Hardware Overview:

  Model Name:	MacBook Pro
  Model Identifier:	MacBookPro15,1
  Processor Name:	6-Core Intel Core i9
  Processor Speed:	2.9 GHz
  Number of Processors:	1
  Total Number of Cores:	6
  L2 Cache (per Core):	256 KB
  L3 Cache:	12 MB
  Hyper-Threading Technology:	Enabled
  Memory:	32 GB

This runs as a 12 threaded Go program. So the code is using a pool of 12 goroutines and the GC is keeping the heap at 4 meg. In the version of code that creates a goroutine per file, I see the heap grow as high as 80 meg.

The program is opening, reading, decoding and searching 4000 files. It's memory intensive to an extent. Throwing 4000 groutines at this problem on tip is finishing the work faster than using a pool. That was never the case in 1.13.

ardan-bkennedy

ardan-bkennedy commented on Nov 7, 2019

@ardan-bkennedy
Author

I find this interesting. This is my understanding.

A priority of the pacer is to maintain a smaller heap over time and to reduce mark assit (MA) so more M's can be used for application work during any GC cycle. A GC may start early (before the heap reaches the GC Percent threshold) if it means reducing MA time. In the end, the total GC time would stay at or below 25%.

This change is allowing the GC time to grow above 25% to help reduce the size of the heap in some heavy allocation scenarios. This will increase the amount of MA time and reduce the application throughput during a GC?

Your hope is the performance loss there is gained back in the allocator?

In the end, the heap size remains as small as possible?

mknyszek

mknyszek commented on Nov 7, 2019

@mknyszek
Contributor

I find this interesting. This is my understanding.

A priority of the pacer is to maintain a smaller heap over time and to reduce mark assit (MA) so more M's can be used for application work during any GC cycle. A GC may start early (before the heap reaches the GC Percent threshold) if it means reducing MA time. In the end, the total GC time would stay at or below 25%.

Pretty much, though I wouldn't characterize it as "may start early", but rather as just "starts earlier". It's the pacer's job to drive GC use to 25%, and its primary tool for doing so is deciding when to start a GC.

This change is allowing the GC time to grow above 25% to help reduce the size of the heap in some heavy allocation scenarios. This will increase the amount of MA time and reduce the application throughput during a GC?

Both latency and throughput, but yes that's correct.

Your hope is the performance loss there is gained back in the allocator?

Correct. A heavily allocating RPC benchmark was able to drive the pacer to start a GC at the half-way point (trigger ratio = 0.5) in Go 1.13. The same benchmark drove the trigger ratio to 0.01 with the new allocator. The most convincing evidence of this being that the allocator just got faster was that the only thing that brought the trigger ratio back up was adding a sleep on the critical path.

In the end, this RPC benchmark saw a significant improvement in tail latency (-20% or more) and throughput (+30% or more), even with the new threshold.

In the end, the heap size remains as small as possible?

Not quite. The threshold in that above CL was chosen to keep the heap size roughly the same across Go versions.

added
NeedsInvestigationSomeone must examine and confirm this is a valid issue and not a duplicate of an existing one.
on Nov 8, 2019
added this to the Go1.14 milestone on Nov 8, 2019
ianlancetaylor

ianlancetaylor commented on Nov 16, 2019

@ianlancetaylor
Contributor

@mknyszek Can you let @ardan-bkennedy know when to re-run tests for this issue? Thanks.

mknyszek

mknyszek commented on Nov 18, 2019

@mknyszek
Contributor

@ianlancetaylor Ah! Sorry. I completely forgot.

@ardan-bkennedy If it's on linux, windows, freebsd, feel free to try again from tip any time. :) Still working out some issues on the less popular platforms.

ardan-bkennedy

ardan-bkennedy commented on Nov 19, 2019

@ardan-bkennedy
Author

@mknyszek I am running on a Mac. I need time to test this on linux.

Side Note: I find it interesting that you consider Darwin a less popular platform when most developers I know are working on that platform?

mknyszek

mknyszek commented on Nov 19, 2019

@mknyszek
Contributor

@mknyszek I am running on a Mac. I need time to test this on linux.

Side Note: I find it interesting that you consider Darwin a less popular platform when most developers I know are working on that platform?

@ardan-bkennedy That's my mistake, I omitted it by accident. I do consider it a popular platform. Please give it a try.

The "less popular" platforms I had in mind were AIX and OpenBSD, so really anything that's not those two, though AIX should be OK now.

ardan-bkennedy

ardan-bkennedy commented on Nov 19, 2019

@ardan-bkennedy
Author

I gave Darwin a try today and the GC actually ran closer to 60% today. I just downloaded tip once more and ran it again. Looks like the program is running slower but GC looks like it is at 37%.

12 Days Ago

GC | 282,674,620 ns | 282,674,620 ns | 674,641 ns | 419
Selection start: 3,595,151 ns
Selection extent: 845,408,873 ns
Total Run time: 849.3ms

Tonight
go version devel +8b1e8a424a Tue Nov 19 19:59:21 2019 +0000 darwin/amd64

GC | 341,260,881 ns | 341,260,881 ns | 773,834 ns | 441
Selection start: 2,867,305 ns
Selection extent: 915,831,104 ns
Total Run time: 919.96ms
mknyszek

mknyszek commented on Nov 20, 2019

@mknyszek
Contributor

@ardan-bkennedy Interesting. I wonder to what extent we're seeing additional start-up costs here, considering that the application only runs for about a second (though 70 ms is kind of a lot, I wonder what the distributions look like).

ardan-bkennedy

ardan-bkennedy commented on Nov 20, 2019

@ardan-bkennedy
Author

@mknyszek the program is available for you to run. I added instructions earlier. On my machine I expect the program to run as fast as the fan out version I run with 4000 Goroutines. Which is ~750 ms

17 remaining items

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    FrozenDueToAgeNeedsInvestigationSomeone must examine and confirm this is a valid issue and not a duplicate of an existing one.

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @toothrot@mknyszek@agnivade@ardan-bkennedy@aclements

        Issue actions

          runtime: program appears to spend 10% more time in GC on tip 3c47ead than on Go1.13.3 · Issue #35430 · golang/go