Skip to content

runtime: aggressive GC completion is disruptive to co-tenants #17969

Open
@rhysh

Description

@rhysh

What version of Go are you using (go version)?

go version devel +1e3c57c Wed Nov 16 20:31:40 2016 +0000 with CL 33093 PS 4 cherry-picked on top, with GOEXPERIMENT=preemptibleloops

What operating system and processor architecture are you using (go env)?

linux/amd64

What did you do?

I have a program that reads ~1MB records from stdin, decodes the records, and sends UDP datagrams. The program runs with around 20 goroutines. Around 40 copies of the program run on an 8-core host.

What did you expect to see?

I expected the 95th percentile of mark termination and sweep termination pauses to be 100µs or less.

What did you see instead?

The program's 95th percentile sweep termination time is around 60ms, and 95th percentile mark termination time is around 30ms. Here's an example gctrace line:

gc 11249 @74577.976s 0%: 11+185+35 ms clock, 90+143/358/582+280 ms cpu, 112->114->56 MB, 115 MB goal, 8 P

The mark termination pauses are very easy to identify in the execution trace—if this bug needs to be about either sweep term pauses or mark term pauses, let's use it for mark term.

Sweep termination pause distribution:

N 10000  sum 115454  mean 11.5454  gmean 0.602555  std dev 22.7221  variance 516.294

     min 0.011
   1%ile 0.014
   5%ile 0.016
  25%ile 0.023
  median 0.34
  75%ile 12
  95%ile 59
  99%ile 107
     max 199

⠀⠀⠀⠀⠀⠀⠰⡄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡖ 0.274
⠀⠀⠀⠀⠀⠀⡂⠄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇
⠠⠤⠤⠤⠤⠤⠄⠰⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠄⠧ 0.000
⠈⠉⠉⠉⠉⠉⠙⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠙⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠙⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠁
      0                     50                     100

Mark termination pause distribution:

N 10000  sum 61391.3  mean 6.13913  gmean 0.93492  std dev 12.9254  variance 167.066

     min 0.04
   1%ile 0.053
   5%ile 0.064
  25%ile 0.1
  median 0.82
  75%ile 6.6
  95%ile 29
  99%ile 64
     max 209

⠀⠀⠀⠀⠀⠀⠰⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡖ 0.447
⠀⠀⠀⠀⠀⠀⡁⠄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇
⠠⠤⠤⠤⠤⠤⠄⠑⠒⠦⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠄⠧ 0.000
⠈⠉⠉⠉⠉⠉⠙⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠋⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠁
      0                                   50

And here's the execution trace of a slow mark termination pause, taking over 16ms. Proc 6 organizes the mark termination phase. Procs 1, 2, 3, 4, and 5 execute it quickly at around t=1007ms. Proc 0 does mark termination around t=1014ms, and Proc 7 delays until t=1024ms at which point the global pause concludes.

screen shot 2016-11-17 at 1 58 11 pm

Activity

rhysh

rhysh commented on Nov 17, 2016

@rhysh
ContributorAuthor
dr2chase

dr2chase commented on Nov 17, 2016

@dr2chase
Contributor

A quick sanity check -- does this mean that the total number of goroutines on your 8-core box is about 800? (40 copies times 20 goroutines). What's the OS scheduling quantum? How many of those per-process goroutines are eligible to run?

My first, lightly-informed guess is that some of the "running" goroutines are instead waiting to be given a core by the kernel, and they're not waiting at a Go safe point, so the thread cannot proceed to a safe point until the kernel says it can run, and the GC cannot finish a phase until all the threads have proceeded to a safe point.

rhysh

rhysh commented on Nov 17, 2016

@rhysh
ContributorAuthor

Here's the execution trace of a 77ms mark termination pause:

screen shot 2016-11-17 at 2 37 52 pm

Proc 0 does mark termination 72ms after Proc 4 begins the phase, and 68ms after the other straggler (Proc 1) observes the phase. There's an additional 5ms delay between when Proc 0 finishes its mark termination work and when Proc 4 declares the phase complete.

rhysh

rhysh commented on Nov 17, 2016

@rhysh
ContributorAuthor

@dr2chase Yes, there would be around 800 goroutines total on the machine.

From /proc/sys/kernel/sched_rr_timeslice_ms, the scheduling quantum appears to be 25ms.

Each instance of the program is generally idle, waiting for a record to come on stdin. When one arrives, it's processed by a single goroutine. That goroutine later hands the data off to another goroutine which does some more analysis. Each process usually has 0 running goroutines. Sometimes they'll have 1 running goroutine. Occasionally for bursts of around 100µs there'll be up to three goroutines running in parallel in a process.

The total CPU usage of all of the programs on that machine—including the 40 Go programs and the JVM process that feeds them data—is well below 8 cores when averaged over several seconds. If the threads are unable to execute, it's not from lack of CPU—at least in the average case. Are you asking if there might be sub-second bursts of high CPU demand?

I think the execution trace indicates that all goroutines/threads are at a safe point: there appears to be a "proc stop" event following each goroutine execution before the start of the mark termination phase. There may be a problem getting each P to be picked up by an M (with a core from the OS) in order to run the mark termination phase for that P .. but it doesn't look to me like there are goroutines pausing short of a safepoint.

I set /proc/sys/kernel/sched_rr_timeslice_ms to 1 on one machine, which doesn't seem to have had a significant impact on the GC pause durations:

Sweep termination pause distribution:

N 1000  sum 10702.4  mean 10.7024  gmean 0.591121  std dev 20.7411  variance 430.195

     min 0.011
   1%ile 0.014
   5%ile 0.016
  25%ile 0.022
  median 0.355
  75%ile 12
  95%ile 52.65
  99%ile 101.317
     max 178

⠀⠀⠀⠀⠀⠀⢰⢲⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡖ 0.180
⠀⠀⠀⠀⠀⠀⠌⠈⡄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇
⠠⠤⠤⠤⠤⠤⠅⠀⠲⠒⠲⠶⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠄⠧ 0.000
⠈⠉⠉⠉⠉⠉⠉⠋⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠙⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠋⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠁
       0                     50                      100

Mark termination pause distribution:

N 1000  sum 6475.96  mean 6.47596  gmean 0.931998  std dev 13.2368  variance 175.213

     min 0.04
   1%ile 0.051
   5%ile 0.06135
  25%ile 0.1
  median 0.835
  75%ile 6.55833
  95%ile 32
  99%ile 65.6633
     max 147

⠀⠀⠀⠀⠀⠀⢐⠢⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡖ 0.303
⠀⠀⠀⠀⠀⠀⠌⠨⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇
⠠⠤⠤⠤⠤⠤⠅⠈⠓⠒⠲⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠄⠧ 0.000
⠈⠉⠉⠉⠉⠉⠙⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠋⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠁
      0                                   50
added
NeedsInvestigationSomeone must examine and confirm this is a valid issue and not a duplicate of an existing one.
on Nov 18, 2016
added this to the Go1.8Maybe milestone on Nov 18, 2016
RLH

RLH commented on Nov 18, 2016

@RLH
Contributor

40 Go programs and a JVM competing for 8 HW threads managed by the OS could
result in one or more of the threads backing GOMAXPROCS being starved. The
fact that it is the thread is a GC thread isn't special. Go hasn't
attempted to solve these types of co-tenancy problems, they are hard. The
GC assumes that it is omnipotent and that GOMAXPROCS represents the number
of HW threads at its disposal.

The one thing that might help is that if you reduce GOMAXPROCS to 2 or 4 so
that when a GC does start it doesn't grab all 8 of the HW threads.

On Thu, Nov 17, 2016 at 6:29 PM, Rhys Hiltner notifications@github.com
wrote:

@dr2chase https://github.com/dr2chase Yes, there would be around 800
goroutines total on the machine.

From /proc/sys/kernel/sched_rr_timeslice_ms, the scheduling quantum
appears to be 25ms.

Each instance of the program is generally idle, waiting for a record to
come on stdin. When one arrives, it's processed by a single goroutine. That
goroutine later hands the data off to another goroutine which does some
more analysis. Each process usually has 0 running goroutines. Sometimes
they'll have 1 running goroutine. Occasionally for bursts of around 100µs
there'll be up to three goroutines running in parallel in a process.

The total CPU usage of all of the programs on that machine—including the
40 Go programs and the JVM process that feeds them data—is well below 8
cores when averaged over several seconds. If the threads are unable to
execute, it's not from lack of CPU—at least in the average case. Are you
asking if there might be sub-second bursts of high CPU demand?

I think the execution trace indicates that all goroutines/threads are at a
safe point: there appears to be a "proc stop" event following each
goroutine execution before the start of the mark termination phase. There
may be a problem getting each P to be picked up by an M (with a core from
the OS) in order to run the mark termination phase for that P .. but it
doesn't look to me like there are goroutines pausing short of a safepoint.

I set /proc/sys/kernel/sched_rr_timeslice_ms to 1 on one machine, which
doesn't seem to have had a significant impact on the GC pause durations:

Sweep termination pause distribution:

N 1000 sum 10702.4 mean 10.7024 gmean 0.591121 std dev 20.7411 variance 430.195

 min 0.011

1%ile 0.014
5%ile 0.016
25%ile 0.022
median 0.355
75%ile 12
95%ile 52.65
99%ile 101.317
max 178

⠀⠀⠀⠀⠀⠀⢰⢲⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡖ 0.180
⠀⠀⠀⠀⠀⠀⠌⠈⡄⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇
⠠⠤⠤⠤⠤⠤⠅⠀⠲⠒⠲⠶⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠄⠧ 0.000
⠈⠉⠉⠉⠉⠉⠉⠋⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠙⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠋⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠁
0 50 100

Mark termination pause distribution:

N 1000 sum 6475.96 mean 6.47596 gmean 0.931998 std dev 13.2368 variance 175.213

 min 0.04

1%ile 0.051
5%ile 0.06135
25%ile 0.1
median 0.835
75%ile 6.55833
95%ile 32
99%ile 65.6633
max 147

⠀⠀⠀⠀⠀⠀⢐⠢⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡖ 0.303
⠀⠀⠀⠀⠀⠀⠌⠨⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇
⠠⠤⠤⠤⠤⠤⠅⠈⠓⠒⠲⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠄⠧ 0.000
⠈⠉⠉⠉⠉⠉⠙⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠋⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠁
0 50


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#17969 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AA7Wn5K6SBpMeRTF4T2CIrCjw9lDb24rks5q_ONogaJpZM4K14BK
.

rhysh

rhysh commented on Nov 18, 2016

@rhysh
ContributorAuthor

I've changed GOMAXPROCS to 2 and the tail latencies are now significantly better-controlled. Thanks @RLH!

The STW phases need to happen on each P .. why are GOMAXPROCS Ms required to do that? Without addressing co-tenancy in general, could the scheduler be adjusted so that once the bit of STW bookkeeping is done on a particular P, that the M would release that P and attempt to grab a P that still needs to complete the phase? A change like that might allow more programs to meet the GC pause goals without requiring tuning (of GOMAXPROCS).

Here are the distributions of the mark termination pauses with go version devel +1e3c57c Wed Nov 16 20:31:40 2016 +0000 (meaning that the loop preemption patch is not active):

GOMAXPROCS unset (defaulting to 8):

N 10000  sum 59861.4  mean 5.98614  gmean 0.911389  std dev 12.7627  variance 162.888

     min 0.039
   1%ile 0.054
   5%ile 0.064
  25%ile 0.1
  median 0.81
  75%ile 6.3
  95%ile 29
  99%ile 65
     max 186

⠀⠀⠀⠀⠀⠀⠰⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡖ 0.461
⠀⠀⠀⠀⠀⠀⡁⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇
⠠⠤⠤⠤⠤⠤⠄⠓⠲⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠄⠧ 0.000
⠈⠉⠉⠉⠉⠉⠙⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠙⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠁
      0                                 50

GOMAXPROCS manually set to 2:

N 10000  sum 8406.86  mean 0.840686  gmean 0.0734456  std dev 3.39466  variance 11.5237

     min 0.012
   1%ile 0.018
   5%ile 0.022
  25%ile 0.032
  median 0.04
  75%ile 0.055
  95%ile 5.3
  99%ile 15.6633
     max 90

⠀⠀⠀⠀⠀⠀⠂⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡖ 33.757
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⡇
⠠⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠤⠄⠧ 0.000
⠈⠉⠉⠉⠉⠉⠋⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠙⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠋⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠉⠁
      0                       10                        20
RLH

RLH commented on Nov 20, 2016

@RLH
Contributor

The problem isn't that the GC STW needs GOMAXPROCS HW threads to do its job. However, any critical path needs at least one and the other Go programs and the JVM are conspiring with the OS to prevent that. Any program's critical path could be delayed, it just happens that we are talking GC today. To make things worse if one of the conspirators is doing a concurrent mark it will recruit up to GOMAXPROCS idle Ps, potentially using up all the HW threads the OS is managing. The OS simple notes the pressure on the HW thread resource and takes away a HW thread in the middle of the STW's critical path and doesn't give it back for a long time.

The GOMAXPROCS=2 hack simply limits the HW threads the OS gives to any single Go program. Instead of 1 Go program being able to eat up all 8 HW threads it now takes 4. The numbers you reported seems to help confirm this.

rhysh

rhysh commented on Nov 25, 2016

@rhysh
ContributorAuthor

So the change to GOMAXPROCS makes the processes less noisy for the benefit of their neighbors, rather than make the processes individually better-able to run in a noisy environment.

It sounds then like there's nothing to be done for a while, at least for Go 1.8. Should this issue be closed, or postponed/rolled into future co-tenancy work?

Thanks for helping me understand this behavior.

RLH

RLH commented on Nov 25, 2016

@RLH
Contributor
changed the title [-]runtime: multi-ms mark termination pauses, even with loop preemption active[/-] [+]runtime: aggressive GC completion is disruptive to co-tenants[/+] on Nov 28, 2016
rhysh

rhysh commented on Nov 28, 2016

@rhysh
ContributorAuthor

On the other hand, the Go runtime is becoming a noisier neighbor: the GC is getting better and better at completing quickly. Commit 0bae74e (for #14179) landed after I took this data, but looks like it will further those efforts. Since GOMAXPROCS defaults to NumCPU, even a mostly-idle daemon can create significant pressure on HW threads. This makes the host machine a noisier place for all programs.

Can Go be a less noisy neighbor by default? Is it worth addressing in Go 1.8 or 1.9?

To be specific: why is it beneficial to complete GC cycles quickly?

  • Write barrier overhead is measurable but low, and my understanding of the plans for ROC is that the barrier would need to be enabled all of the time.
  • The known latency bugs due to weird assist behavior are nearly solved (runtime: GC causes latency spikes #14812 remains outstanding), so there's less risk to having mutator assists enabled for a longer time.
  • A faster GC cycle will result in less floating garbage, but the volume of new floating garbage is bounded by the mutator assists.

As the behavior and performance of the GC improves, what is the effect that the idle mark workers have on the performance of a single Go program? They seem to have a negative effect on neighboring programs, at least in aggregate.

/cc @RLH @aclements

29 remaining items

modified the milestones: Go1.10, Go1.11 on Nov 22, 2017
modified the milestones: Go1.11, Go1.12 on Jul 9, 2018
modified the milestones: Go1.12, Go1.13 on Jan 8, 2019
modified the milestones: Go1.13, Go1.14 on Jun 25, 2019
modified the milestones: Go1.14, Backlog on Oct 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    NeedsInvestigationSomeone must examine and confirm this is a valid issue and not a duplicate of an existing one.Performanceearly-in-cycleA change that should be done early in the 3 month dev cycle.

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @bradfitz@josharian@rsc@quentinmit@minux

        Issue actions

          runtime: aggressive GC completion is disruptive to co-tenants · Issue #17969 · golang/go