Skip to content

runtime: GC causes latency spikes #14812

Closed
@Dieterbe

Description

@Dieterbe

Hello,
while running the program at https://github.com/raintank/raintank-metric/tree/master/metric_tank
I'm seeing mark-and-scan times of 15s cpu time, 20002500 ms clock time. (8 core system) for a heap of about 6.5GB
(STW pauses are fine and ~1ms)
I used https://circleci.com/gh/raintank/raintank-metric/507 to obtain the data below.

$ metric_tank --version
metrics_tank (built with go1.6, git hash 8897ef4f8f8f1a2585ee88ecadee501bfc1a4139)
$ go version
go version go1.6 linux/amd64
$ uname -a #on the host where the app runs
Linux metric-tank-3-qa 3.19.0-43-generic #49~14.04.1-Ubuntu SMP Thu Dec 31 15:44:49 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

I know the app is currently not optimized for GC workload: while i've gotten allocations down in various parts of the program there are currently probably about a million or more live pointers referencing pieces of data. I was going to work on optimizing this when Dave Cheney suggested there's a problem with the runtime and I should file a bug (https://groups.google.com/forum/#!topic/golang-nuts/Q0rXKYjy1cg)
Here's the log with gctrace and schedtrace enabled: https://gist.githubusercontent.com/Dieterbe/18453451c5af0cdececa/raw/9c4f2abd85bb7a815c6cda5c1828334d3d29817d/log.txt

at http://dieter.plaetinck.be/files/go/mt3-qa-gc-vs-no-gc.zip you'll find a zip containing this log, the binary, a cpu profile taken during gc run 1482, and a cpu and heap profile in between run 1482 and 1483

I also have these two dashboards that seem useful. (they both end just after the spike induced by GC run 1482)
https://snapshot.raintank.io/dashboard/snapshot/MtLqvc4F6015zbs4iMQSPzfizvG7OQjC
shows memory usage, GC runs and STW pause times. it also shows that incoming load (requests) of the app is constant so this conveys to me that any extra load is caused by GC, not by changing workload
https://snapshot.raintank.io/dashboard/snapshot/c2zwTZCF7BmfyzEuGF6cHN9GX9aM1V99
this shows the system stats. note the cpu spikes corresponding to the GC workload.

let me know if there's anything else I can provide,
thanks,
Dieter.

Activity

RLH

RLH commented on Mar 14, 2016

@RLH
Contributor

The GC will use as much CPU as is available. If your program is basically
idle, which it appears to be, the GC will use the idle CPU and CPU load
will naturally go up. If your application is active and load is already
high then the GC will limit its load to 25% of GOMAXPROCS. The mark and
scan phase is concurrent, it is unclear how it is adversely affecting your
idle application.

On Mon, Mar 14, 2016 at 1:54 AM, Dieter Plaetinck notifications@github.com
wrote:

Hello,
while running the program at
https://github.com/raintank/raintank-metric/tree/master/metric_tank
I'm seeing excessive time spent in mark-and-scan on go 1.6
(STW pauses are fine and ~1ms)
I used https://circleci.com/gh/raintank/raintank-metric/507 in the below
description.

./metric_tank --version
metrics_tank (built with go1.6, git hash 8897ef4f8f8f1a2585ee88ecadee501bfc1a4139)
go version go1.6 linux/amd64
$uname -a #on the host where the app runs
Linux metric-tank-3-qa 3.19.0-43-generic #49~14.04.1-Ubuntu SMP Thu Dec 31 15:44:49 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

I know the app is currently not optimized for GC workload: while i've
gotten allocations down in various parts of the program there are currently
probably about a million or more live pointers referencing pieces of data.
I was going to work on optimizing this when Dave Cheney suggested there's a
problem with the runtime and I should file a bug (
https://groups.google.com/forum/#!topic/golang-nuts/Q0rXKYjy1cg)
I'm seeing MAS times of 15s and more cpu time, 2000~2500 ms clock time. (8
core system) for a heap of about 6.5GB
Here's the log with gctrace and schedtrace enabled:
https://gist.githubusercontent.com/Dieterbe/18453451c5af0cdececa/raw/9c4f2abd85bb7a815c6cda5c1828334d3d29817d/log.txt

at http://dieter.plaetinck.be/files/go/mt3-qa-gc-vs-no-gc.zip you'll find
a zip containing this log, the binary, a cpu profile taken during gc run
1482, and a cpu and heap profile in between run 1482 and 1483

I also have these two dashboards that seem useful. (they both end just
after the spike induced by GC run 1482)

https://snapshot.raintank.io/dashboard/snapshot/MtLqvc4F6015zbs4iMQSPzfizvG7OQjC
shows memory usage, GC runs and STW pause times. it also shows that
incoming load (requests) of the app is constant so this conveys to me that
any extra load is caused by GC, not by changing workload

https://snapshot.raintank.io/dashboard/snapshot/c2zwTZCF7BmfyzEuGF6cHN9GX9aM1V99
this shows the system stats. note the cpu spikes corresponding to the GC
workload.

let me know if there's anything else I can provide,
thanks,
Dieter.


Reply to this email directly or view it on GitHub
#14812.

changed the title [-]mark and scan needs excessive amount of time (15s for 6.5GB heap)[/-] [+]runtime: mark and scan needs excessive amount of time (15s for 6.5GB heap)[/+] on Mar 14, 2016
added this to the Go1.7 milestone on Mar 14, 2016
Dieterbe

Dieterbe commented on Mar 16, 2016

@Dieterbe
ContributorAuthor

The mark and scan phase is concurrent, it is unclear how it is adversely affecting your idle application.

just a guess, but perhaps the cause is the extra workload induced by the write barrier? (i watched your gophercon talk again today :) Interestingly, when I use top, i haven't been able to ever catch a core running at 100%.

But you're right that there's essentially two things going on, which may or may not be related:

  • mark and scan takes too long
  • app slows down during GC runs, while cpu cores don't saturate.

Let me know how I can help.

aclements

aclements commented on Apr 3, 2016

@aclements
Member

Hi @Dieterbe, could you clarify what the issue is? 15s for 6.5GB is actually pretty good (I get ~5s/GB of CPU time on some benchmarks locally, but this can vary a lot based on heap layout and hardware).

If it's that the CPU utilization goes up during GC, please clarify why this is a problem (the GC has to do its work somehow, and FPGA accelerators for GC are still an open area of research :)

If it's that response time goes up during GC, could you try the CL in #15022? (And, if you're feeling adventurous, there's also https://go-review.googlesource.com/21036 and https://go-review.googlesource.com/21282)

Dieterbe

Dieterbe commented on Apr 4, 2016

@Dieterbe
ContributorAuthor

Hey @aclements!

15s for 6.5GB is actually pretty good (I get ~5s/GB of CPU time on some benchmarks locally, but this can vary a lot based on heap layout and hardware).

ok , fair enough for me. i just reported this here because @davecheney mentioned in
https://groups.google.com/forum/#!topic/golang-nuts/Q0rXKYjy1cg
that 1.5s for 5GB was unexpected and that i should open a ticket for it. so hence this ticket.

If it's that the CPU utilization goes up during GC, please clarify why this is a problem (the GC has to do its work somehow, and FPGA accelerators for GC are still an open area of research :)

of course, this is by itself not a problem.

If it's that response time goes up during GC, could you try the CL in #15022?

initially the ticket wasn't about this, but it was brought up and is definitely a problem for us. so from now on we may as well consider this the issue at hand.
I recompiled my app with a recompiled go using your patch, and did a test run before and after.
unfortunately i see no change and the latency spikes are still there (details at grafana/metrictank#172)
note that i can verify this problem quite early on. e.g. in this case i've seen spikes as early as GC run 270. the issue is there probably much earlier but my app needs to load in a lot of data before i can test. the bug mentioned in #15022 looks like it only activates after a sufficient amount of GC runs.

changed the title [-]runtime: mark and scan needs excessive amount of time (15s for 6.5GB heap)[/-] [+]runtime: GC causes latency spikes[/+] on Apr 22, 2016
aclements

aclements commented on May 16, 2016

@aclements
Member

@Dieterbe, would it be possible for you to collect a runtime trace (https://godoc.org/runtime/trace) around one of the periods of increased latency? If you do this with current master, the trace file will be entirely self-contained (otherwise, I'll also need the binary to read the trace file).

I have a hunch about what could be causing this. GC shrinks the stacks, so if many of your goroutines are constantly growing and shrinking the amount of stack they're using by at least a factor of 4, you would see a spike as many goroutines re-grew their stacks after the shrink. This should be more smeared out on master than with Go 1.6 since f11e4eb made shrinking happen concurrently at the beginning of the GC cycle, but if this is the problem I don't think that would have completely mitigated it. (Unfortunately, the trace doesn't say when stack growth happens, so it wouldn't be a smoking gun, but if many distinct goroutines have high latency right after GC that will be some evidence for this theory.)

Dieterbe

Dieterbe commented on May 16, 2016

@Dieterbe
ContributorAuthor

Hey @aclements
I did curl 'http://localhost:6063/debug/pprof/trace?seconds=20' > trace.bin
about 5~7 seconds in I think (it's a bit hard to tell) is where the GC kicks in and a latency spike was observed
files: http://dieter.plaetinck.be/files/go-gc-team-is-awesome/trace.bin and http://dieter.plaetinck.be/files/go-gc-team-is-awesome/metric_tank for the binary. compiled with official 1.6.2 . hopefully this helps to diagnose. if not let me know, maybe i can get a better trace.

modified the milestones: Go1.7Maybe, Go1.7 on May 24, 2016
Dieterbe

Dieterbe commented on May 30, 2016

@Dieterbe
ContributorAuthor

I read through and #9477 and #10345 and wonder if this issue is another similar case? note that this app is centered around a map (https://github.com/raintank/raintank-metric/blob/master/metric_tank/mdata/aggmetrics.go#L13) that has just over 1M values (and each value in turn has a bunch of pointers to things that have more pointers, and lots of strings involved too). optimizing this is on my todo, but in the meantime i wonder if maybe a GC thread blocks the map leaving other application threads (mutators) unable to interact with the map. and since everything in the app needs this map, it could explain the slow downs?

aclements

aclements commented on May 31, 2016

@aclements
Member

@Dieterbe, it's possible. Could you try the fix I posted for #10345? (https://golang.org/cl/23540)

Note that it's not that the GC thread blocks the map. Mutators are free to read and write the map while GC is scanning it; there's no synchronization on the map itself. The issue is that whatever thread gets picked to scan the buckets array of the map is stuck not being able to do anything else until it's scanned the whole bucket array. If there's other mutator work queued up on that thread, it's blocked during this time.

(Sorry I haven't had a chance to dig into the trace you sent.)

143 remaining items

gopherbot

gopherbot commented on Sep 19, 2022

@gopherbot
Contributor

Change https://go.dev/cl/431877 mentions this issue: runtime: export total GC Assist ns in MemStats and GCStats

mknyszek

mknyszek commented on Sep 19, 2022

@mknyszek
Contributor

Go 1.18 introduced a new GC pacer (that reduces the amount of assists) and 1.19 introduced GOMEMLIMIT (I saw GOMAXHEAP mentioned somewhere earlier). We've also bounded sweeping on the allocation path back in Go 1.15, I believe. Skimming over the issue, I get the impression that there's a chance some or all of the issues that have remained here, beyond what was already fixed earlier. It's possible that may not be the case, but many of the sub-threads are fairly old.

I'm inlined to put this into WaitingForInfo unless anyone here wants to chime in with an update. We can always file new issues if it turns out something remains (and it'll probably be clearer and easier to manage than continuing a conversation that started halfway through this issue :)). EDIT: It's already in WaitingForInfo. In that case, this is just an update.

added
WaitingForInfoIssue is not actionable because of missing required information, which needs to be provided.
and removed
WaitingForInfoIssue is not actionable because of missing required information, which needs to be provided.
on Dec 26, 2022
Salamandastron1

Salamandastron1 commented on Jan 24, 2023

@Salamandastron1

I have seen others succeed in speeding up their applications by attempting to reduce the number of allocations made. If Go accommodated this conceptual way of working, it would be easier than optimizing the compiler further. Off the bat is that it would be ideal for supporting functions taking multiple values from return functions to reduce the need to allocate for an error.

gopherbot

gopherbot commented on Jan 26, 2023

@gopherbot
Contributor

Timed out in state WaitingForInfo. Closing.

(I am just a bot, though. Please speak up if this is a mistake or you have the requested information.)

locked and limited conversation to collaborators on Jan 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    FrozenDueToAgeNeedsInvestigationSomeone must examine and confirm this is a valid issue and not a duplicate of an existing one.PerformanceWaitingForInfoIssue is not actionable because of missing required information, which needs to be provided.compiler/runtimeIssues related to the Go compiler and/or runtime.early-in-cycleA change that should be done early in the 3 month dev cycle.

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @Dieterbe@wendigo@rsc@quentinmit@mkevac

        Issue actions

          runtime: GC causes latency spikes · Issue #14812 · golang/go