runtime: performance degradation for versions > 1.4.x #14189

toddboom · 2016-02-02T05:25:20Z

This issue is a follow-up to a thread that started on the mailing list here:
https://groups.google.com/forum/#!topic/golang-dev/sfwfQTu7eM4

We recently reverted back to Go 1.4.3 for the most recent release of InfluxDB after testing against 1.5.x and collecting feedback from both our own benchmarks and the experiences of InfluxDB users. There were significant performance hits with both 1.5.x and 1.6beta1, across a variety of dimensions. For example, with write performance, we saw a 33% reduction in overall throughput, coupled with a 26% increase in average response time. We expected that the GC improvements would lead to a reduction in response times, but that hasn't proven to be the case. It's far from exhaustive, but here are the findings from a medium-scale write test:

version	points/sec	response time
Go 1.4.3	398,747	33.97ms
Go 1.5.3	300,750	43.12ms
Go 1.6rc1	315,545	39.44ms

As requested in the mailing list thread, we ran the test with GODEBUG=gctrace=1 and recorded the output alongside our log messages for context. It doesn't look like the GC-specific lines should be too hard to grep out if our other log lines are too noisy.

We also ran with cpuprofile and memprofile on, and collected the output. We've included results for these tests on all three versions of Go listed above, and provided the compiled binaries as well. You can find them in this Google Drive folder:
https://drive.google.com/open?id=0B3YrTka8Ai_McUNCak1YSW9ZM28

I hope this is enough information to assist in debugging the underlying issue. Definitely let me know if there are any other tests I can run to help. We're looking forward to getting to better performance than Go 1.4.3!

The text was updated successfully, but these errors were encountered:

RLH · 2016-02-02T06:23:02Z

gctrace reports that you have 8 HW threads (GOMAXPROCS=8) and the GC is
constraining the heap to half a Gigabyte. Allowing the GC to grow the heap
to fill available RAM may solve this problem. Perhaps adjusting GOGC to
GOGC=1600 and rerunning these numbers will help.

On Tue, Feb 2, 2016 at 12:25 AM, Todd Persen [email protected]
wrote:

This issue is a follow-up to a thread that started on the mailing list
here:
https://groups.google.com/forum/#!topic/golang-dev/sfwfQTu7eM4

We recently reverted back to Go 1.4.3 for the most recent release of
InfluxDB after testing against 1.5.x and collecting feedback from both our
own benchmarks and the experiences of InfluxDB users. There were
significant performance hits with both 1.5.x and 1.6beta1, across a variety
of dimensions. For example, with write performance, we saw a 33% reduction
in overall throughput, coupled with a 26% increase in average response
time. We expected that the GC improvements would lead to a reduction in
response times, but that hasn't proven to be the case. It's far from
exhaustive, but here are the findings from a medium-scale write test:
version points/sec response time
Go 1.4.3 398,747 33.97ms
Go 1.5.3 300,750 43.12ms
Go 1.6rc1 315,545 39.44ms

As requested in the mailing list thread, we ran the test with
GODEBUG=gctrace=1 and recorded the output alongside our log messages for
context. It doesn't look like the GC-specific lines should be too hard to
grep out if our other log lines are too noisy.

We also ran with cpuprofile and memprofile on, and collected the output.
We've included results for these tests on all three versions of Go listed
above, and provided the compiled binaries as well. You can find them in
this Google Drive folder:
https://drive.google.com/open?id=0B3YrTka8Ai_McUNCak1YSW9ZM28

I hope this is enough information to assist in debugging the underlying
issue. Definitely let me know if there are any other tests I can run to
help. We're looking forward to getting to better performance than Go
1.4.3! [image: 💥]

—
Reply to this email directly or view it on GitHub
#14189.

toddboom · 2016-02-02T06:38:41Z

@RLH I'll try that now with 1.6rc1 and report back. Thanks!

toddboom · 2016-02-02T06:54:03Z

@RLH So, some preliminary results. Interesting.

Using GOGC=1600, I got the following:

version	points/sec	response time
Go 1.4.3	496,271	24.67ms
Go 1.6rc1	452,760	17.08ms

So, the response time for both came down, with a greater improvement on 1.6rc1. It's now in line with what I'd expect for the new GC, but we're still seeing better performance with 1.4.3 on overall throughput. Does the GOGC parameter behave the same way on 1.4.x as it does on later versions?

RLH · 2016-02-02T13:05:20Z

Yes, GOGC behaves the same.

On Tue, Feb 2, 2016 at 1:54 AM, Todd Persen [email protected]
wrote:

@RLH https://github.com/RLH So, some preliminary results. Interesting.

Using GOGC=1600, I got the following:
version points/sec response time
Go 1.4.3 496,271 24.67ms
Go 1.6rc1 452,760 17.08ms

So, the response time for both came down, with a greater improvement on
1.6rc1. It's now in line with what I'd expect for the new GC, but we're
still seeing better performance with 1.4.3 on overall throughput. Does the
GOGC parameter behave the same way on 1.4.x as it does on later versions?

—
Reply to this email directly or view it on GitHub
#14189 (comment).

methane · 2016-02-02T17:13:16Z

Does GODEBUG=gcstoptheworld=1 helps?

RLH · 2016-02-02T19:18:15Z

The unrelated discussion concerning tsml.Cache needs to be moved to its own
issue.

On Tue, Feb 2, 2016 at 1:20 PM, INADA Naoki [email protected]
wrote:

I'm sorry about slightly offtopic.

It seems tsm1.Cache is GC unfriendly. There are so many pointers in the
cache.
Each value type has time time.Time. time.Time contains pointer *Location.
So changing time time.Time to unixnano int64 may reduce GC cost.

I've wrote quick benchmark:

func BenchmarkCacheFloatEntries(b *testing.B) {
for i := 0; i < b.N; i++ {
cache := NewCache(10000)
for j := 0; j < 10000; j++ {
v := NewValue(time.Unix(1, 0), float64(j))
cache.Write("test", []Value{v})
}
}
}

before:
$ go test -run=xxx -bench=BenchmarkCache
PASS
BenchmarkCacheFloatEntries-4 300 4019495 ns/op

after:
$ go test -run=xxx -bench=BenchmarkCache
PASS
BenchmarkCacheFloatEntries-4 500 3720449 ns/op

patch: https://gist.github.com/methane/59c8777907b10dc983a1

—
Reply to this email directly or view it on GitHub
#14189 (comment).

aclements · 2016-02-02T22:04:43Z

It looks our GC rate in 1.6 is 1.5X what it was in 1.4. Based on your logs, go-1.4.3.txt does 762 GCs in 8.73 minutes = 87 GCs/minute and go-1.6.txt does 856 GCs in 6.35 minutes = 135 GCs/minute. I haven't analyzed the heap sizes yet, but that suggests we're keeping the heap somewhat smaller than 1.4 did, which is simply forcing the GC to do more work. It would also be good to compare the scan rates (bytes/sec). Comparing the heap size CDFs and the scan rates should give us a better sense of whether this is a raw GC throughput problem or a heap trigger problem.

iand · 2016-02-18T11:05:45Z

We have a similar issue that is preventing us from moving to Go 1.5 on our production services. I have some godebug gc traces comparing go 1.4.3 and 1.5.2 under various values of GOGC. Would these be of interest for this issue?

RLH · 2016-02-18T12:31:03Z

As of today our focus is 1.6 and we won't be changing 1.5. Do you have 1.6
numbers we can compare to your 1.4 and 1.5 numbers. Such numbers would be
valuable.

On Thursday, February 18, 2016, Ian Davis [email protected] wrote:

We have a similar issue that is preventing us from moving to Go 1.5 on our
production services. I have some godebug gc traces comparing go 1.4.3 and
1.5.2 under various values of GOGC. Would these be of interest for this
issue?

—
Reply to this email directly or view it on GitHub
#14189 (comment).

iand · 2016-02-18T13:32:23Z

I will work on getting the equivalent 1.6 numbers

iand · 2016-02-18T15:35:07Z

Here are details of a similar degradation we are experiencing that is keeping us on Go 1.4.3. For context the service I took these measurements from is one instance in a cluster of real time advertising bidders. It receives bid requests via HTTP and performs two key operations that I timed. The first "augment" makes a single HTTP call to a couchbase cluster (using https://github.com/couchbase/go-couchbase) which returns about 5kB of protobuf encoded data. The second "evaluate" is a pure go section of the code that makes no network calls and is heavily optimised to avoid garbage generation. Responses from our service are expected to complete within 50ms.

I ran each test for 10 minutes with gctrace=1 and various values of GOGC. Our instrumentation takes samples every 10 seconds. The service is running on an Amazon c4.xlarge instance with 4 cpus and 7.5GB RAM. At the time of the Go 1.5 and 1.4.3 tests the service was receiving about 1100 requests per second. When I ran the 1.6 tests the rate was about 1400 per second. We rate this service on this instance type at about 2000 per second max (under Go 1.4.3)

 time  go version  GOGC    
---------------------------
 9:42       1.5.2   100         
 9:52       1.5.2   200
10.02       1.5.2   400
10:12       1.5.2   800
10:22       1.4.3  default
13:56       1.6     100         
14:06       1.6     200         
14:16       1.6     400         
14:26       1.6     800

Charts of timings of 99.9th (orange) and maximum values (blue) for augment operation. (Go 1.4.3 is the last section of the top graph, starting at 10:22)

time     99.9% augment  max augment  
------------------------------------
 9:42       17.8ms         216.3ms
 9:52       12.0ms         204.9ms
10:02        7.9ms         225.2ms
10:12        6.3ms         134.0ms
10:22        5.6ms         236.0ms
13:56       35.8ms         237.2ms
14:06        8.0ms         226.9ms
14:16        5.7ms         240.6ms
14:26        6.5ms         472.6ms

And same for evaluate operation:

time     99.9% evaluate  max evaluate  
------------------------------------
 9:42       1.5ms         27.5ms
 9:52       1.2ms          8.8ms
10:02       1.1ms         15.7ms
10:12       1.1ms         13.5ms
10:22       1.2ms        141.3ms
13:56       1.2ms        256.0ms
14:06       1.1ms        221.1ms
14:16       1.2ms        234.0ms
14:26       1.2ms        233.3ms

Overall I would characterise the results as Go 1.5 and 1.6 result in slower timings than 1.4.3 but are more consistent. The 99.9 centiles were larger but there were fewer instances where the maximum time exceeded our deadlines.

Attached are the gctrace outputs from each of the tests.

go1.4-gogcdefault.txt
go1.5-gogc100.txt
go1.5-gogc200.txt
go1.5-gogc400.txt
go1.5-gogc800.txt
go1.6-gogc100.txt
go1.6-gogc200.txt
go1.6-gogc400.txt
go1.6-gogc800.txt

RLH · 2016-02-18T19:16:06Z

Thanks Ian,

Looking at the raw 1.6 gctrace numbers in go1.6-gogc100, go1.6-gogc200, and
go1.6-gogc400 the GC took up to 3% of the CPU depending on the heap size
and the largest stop the world pauses was less than 3 ms. In fact of the
300 or so GC cycles over the 3 10 minute runs only 1 showed a stop the
world pause greater than 2 ms. and only a handful were greater than 1 ms.

The 10 minute go1.6-gogc800 had an interesting single anomaly. Only 21 GC
cycles happened over the entire run and all the cycles, except one,
completed in less than 300ms including the concurrent phases. While the
total CPU used by the GC was much less than 1% there was a single instance
(gc 13 at 208 seconds into the run) of a 241 ms stop the world pause. I
can't account for this single very large pause time.

I'm concluding that the latency problems are not related to the GC.

The only vaguely similar numbers that I have seen were when the OS /
service provider noticed that the application wasn't fully utilizing the
allocated CPU resources and took them away after an asynchronous
preemption.

It looks like GOMAXPROCS is set to 4, was anything else running on the
machine that might have reduced the number of HW threads available in some
non-deterministic fashion? Of particular interest would be such an
interrupt that might last 200 ms?

Rick

On Thu, Feb 18, 2016 at 10:35 AM, Ian Davis [email protected]
wrote:

Here are details of a similar degradation we are experiencing that is
keeping us on Go 1.4.3. For context the service I took these measurements
from is one instance in a cluster of real time advertising bidders. It
receives bid requests via HTTP and performs two key operations that I
timed. The first "augment" makes a single HTTP call to a couchbase cluster
(using https://github.com/couchbase/go-couchbase) which returns about 5kB
of protobuf encoded data. The second "evaluate" is a pure go section of the
code that makes no network calls and is heavily optimised to avoid garbage
generation. Responses from our service are expected to complete within 50ms.

I ran each test for 10 minutes with gctrace=1 and various values of GOGC.
Our instrumentation takes samples every 10 seconds. The service is running
on an Amazon c4.xlarge instance with 4 cpus and 7.5GB RAM. At the time of
the Go 1.5 and 1.4.3 tests the service was receiving about 1100 requests
per second. When I ran the 1.6 tests the rate was about 1400 per second. We
rate this service on this instance type at about 2000 per second max (under
Go 1.4.3)

time go version GOGC

9:42 1.5.2 100
9:52 1.5.2 200
10.02 1.5.2 400
10:12 1.5.2 800
10:22 1.4.3 default
13:56 1.6 100
14:06 1.6 200
14:16 1.6 400
14:26 1.6 800

Charts of timings of 99.9th (orange) and maximum values (blue) for augment
operation.

[image: augment-go1 5]
https://cloud.githubusercontent.com/assets/18375/13148113/cebd0972-d653-11e5-8d08-2f97091d7483.png
[image: augment-go1 6]
https://cloud.githubusercontent.com/assets/18375/13148120/d3e3457e-d653-11e5-9002-5d0cd18f5959.png

time 99.9% augment max augment

9:42 17.8ms 216.3ms
9:52 12.0ms 204.9ms
10:02 7.9ms 225.2ms
10:12 6.3ms 134.0ms
10:22 5.6ms 236.0ms
13:56 35.8ms 237.2ms
14:06 8.0ms 226.9ms
14:16 5.7ms 240.6ms
14:26 6.5ms 472.6ms

And same for evaluate operation:

[image: evaluate-go1 5]
https://cloud.githubusercontent.com/assets/18375/13148157/f85e0182-d653-11e5-91f0-b393e9976b3b.png
[image: evaluate-go1 6]
https://cloud.githubusercontent.com/assets/18375/13148162/fd38aab8-d653-11e5-969f-af39f9dd1b28.png

time 99.9% evaluate max evaluate

9:42 1.5ms 27.5ms
9:52 1.2ms 8.8ms
10:02 1.1ms 15.7ms
10:12 1.1ms 13.5ms
10:22 1.2ms 141.3ms
13:56 1.2ms 256.0ms
14:06 1.1ms 221.1ms
14:16 1.2ms 234.0ms
14:26 1.2ms 233.3ms

Overall I would characterise the results as Go 1.5 and 1.6 result in
slower timings than 1.4.3 but are more consistent. The 99.9 centiles were
larger but there were fewer instances where the maximum time exceeded our
deadlines.

Attached are the gctrace outputs from each of the tests.

go1.4-gogcdefault.txt
https://github.com/golang/go/files/136435/go1.4-gogcdefault.txt
go1.5-gogc100.txt
https://github.com/golang/go/files/136432/go1.5-gogc100.txt
go1.5-gogc200.txt
https://github.com/golang/go/files/136434/go1.5-gogc200.txt
go1.5-gogc400.txt
https://github.com/golang/go/files/136433/go1.5-gogc400.txt
go1.5-gogc800.txt
https://github.com/golang/go/files/136436/go1.5-gogc800.txt
go1.6-gogc100.txt
https://github.com/golang/go/files/136437/go1.6-gogc100.txt
go1.6-gogc200.txt
https://github.com/golang/go/files/136438/go1.6-gogc200.txt
go1.6-gogc400.txt
https://github.com/golang/go/files/136439/go1.6-gogc400.txt
go1.6-gogc800.txt
https://github.com/golang/go/files/136440/go1.6-gogc800.txt

—
Reply to this email directly or view it on GitHub
#14189 (comment).

iand · 2016-02-18T19:55:40Z

There are other support processes on that instance (e.g. consul agent) but they have very low cpu utilization and memory usage. GOMAXPROCS was 4 for that service (in 1.5 builds we don't set it at all so it's left as the Go default)

I saw the interrupt/spike in the gc times instrumentation we also capture. I don't know what that is for sure but it could have been a refresh of an in-memory data cache that we use as part of our bid prediction logic. I can re-run the test with those settings if you think it is worth it.

With the analysis you gave I agree that it's probably not the time spent in GC that is causing the latency. However there is lower latency when there are fewer GC cycles. Do you think it could be thread contention? Would repeating with different GOMAXPROCS give any insight?

RLH · 2016-02-18T21:05:59Z

If a CPU is being taken away, setting GOMAXPROCS to 3 will give the process
a free CPU for the other non-Go stuff. It is worth a try but it doesn't
explain 1.4 not having this problem.

On Thu, Feb 18, 2016 at 2:55 PM, Ian Davis [email protected] wrote:

There are other support processes on that instance (e.g. consul agent) but
they have very low cpu utilization and memory usage. GOMAXPROCS was 4 for
that service (in 1.5 builds we don't set it at all so it's left as the Go
default)

I saw the interrupt/spike in the gc times instrumentation we also capture.
I don't know what that is for sure but it could have been a refresh of an
in-memory data cache that we use as part of our bid prediction logic. I can
re-run the test with those settings if you think it is worth it.

With the analysis you gave I agree that it's probably not the time spent
in GC that is causing the latency. However there is lower latency when
there are fewer GC cycles. Do you think it could be thread contention?
Would repeating with different GOMAXPROCS give any insight?

—
Reply to this email directly or view it on GitHub
#14189 (comment).

aclements · 2016-02-18T22:59:59Z

@iand, CPU profiles may be enlightening. It's not exactly the right tool, but if this is being caused by something in the runtime or the standard libraries, it's possible it would show up as out of place in a profile.

iand · 2016-02-19T12:00:47Z

I retested go 1.5.2 and 1.6 versions of our service with varying GOMAXPROCS values (3, 6 and 8). GOGC was left as default. I didn't see any appreciable change in latency. If I squinted hard I might say that the network operation "augment" saw a slight improvement at higher GOMAXPROCS and the cpu operation "evaluate" saw a slight worsening. I can post images of our metrics but it hardly seems worth the noise.

I took multiple cpu profiles of the service with default settings for GOGC and GOMAXPROCS under Go 1.4.3, 1.5.2 and 1.6. I can make those available with binaries privately if you think they will help. Email me at [email protected] and I will share a link to them. I have had a superficial look at them and nothing leaps out as obviously different.

gopherbot · 2016-04-04T16:01:02Z

CL https://golang.org/cl/21503 mentions this issue.

Two GC-related functions, scang and casgstatus, wait in an active spin loop. Active spinning is never a good idea in user-space. Once we wait several times more than the expected wait time, something unexpected is happenning (e.g. the thread we are waiting for is descheduled or handling a page fault) and we need to yield to OS scheduler. Moreover, the expected wait time is very high for these functions: scang wait time can be tens of milliseconds, casgstatus can be hundreds of microseconds. It does not make sense to spin even for that time. go install -a std profile on a 4-core machine shows that 11% of time is spent in the active spin in scang: 6.12% compile compile [.] runtime.scang 3.27% compile compile [.] runtime.readgstatus 1.72% compile compile [.] runtime/internal/atomic.Load The active spin also increases tail latency in the case of the slightest oversubscription: GC goroutines spend whole quantum in the loop instead of executing user code. Here is scang wait time histogram during go install -a std: 13707.0000 - 1815442.7667 [ 118]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎... 1815442.7667 - 3617178.5333 [ 9]: ∎∎∎∎∎∎∎∎∎ 3617178.5333 - 5418914.3000 [ 11]: ∎∎∎∎∎∎∎∎∎∎∎ 5418914.3000 - 7220650.0667 [ 5]: ∎∎∎∎∎ 7220650.0667 - 9022385.8333 [ 12]: ∎∎∎∎∎∎∎∎∎∎∎∎ 9022385.8333 - 10824121.6000 [ 13]: ∎∎∎∎∎∎∎∎∎∎∎∎∎ 10824121.6000 - 12625857.3667 [ 15]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ 12625857.3667 - 14427593.1333 [ 18]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ 14427593.1333 - 16229328.9000 [ 18]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ 16229328.9000 - 18031064.6667 [ 32]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ 18031064.6667 - 19832800.4333 [ 28]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ 19832800.4333 - 21634536.2000 [ 6]: ∎∎∎∎∎∎ 21634536.2000 - 23436271.9667 [ 15]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ 23436271.9667 - 25238007.7333 [ 11]: ∎∎∎∎∎∎∎∎∎∎∎ 25238007.7333 - 27039743.5000 [ 27]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ 27039743.5000 - 28841479.2667 [ 20]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ 28841479.2667 - 30643215.0333 [ 10]: ∎∎∎∎∎∎∎∎∎∎ 30643215.0333 - 32444950.8000 [ 7]: ∎∎∎∎∎∎∎ 32444950.8000 - 34246686.5667 [ 4]: ∎∎∎∎ 34246686.5667 - 36048422.3333 [ 4]: ∎∎∎∎ 36048422.3333 - 37850158.1000 [ 1]: ∎ 37850158.1000 - 39651893.8667 [ 5]: ∎∎∎∎∎ 39651893.8667 - 41453629.6333 [ 2]: ∎∎ 41453629.6333 - 43255365.4000 [ 2]: ∎∎ 43255365.4000 - 45057101.1667 [ 2]: ∎∎ 45057101.1667 - 46858836.9333 [ 1]: ∎ 46858836.9333 - 48660572.7000 [ 2]: ∎∎ 48660572.7000 - 50462308.4667 [ 3]: ∎∎∎ 50462308.4667 - 52264044.2333 [ 2]: ∎∎ 52264044.2333 - 54065780.0000 [ 2]: ∎∎ and the zoomed-in first part: 13707.0000 - 19916.7667 [ 2]: ∎∎ 19916.7667 - 26126.5333 [ 2]: ∎∎ 26126.5333 - 32336.3000 [ 9]: ∎∎∎∎∎∎∎∎∎ 32336.3000 - 38546.0667 [ 8]: ∎∎∎∎∎∎∎∎ 38546.0667 - 44755.8333 [ 12]: ∎∎∎∎∎∎∎∎∎∎∎∎ 44755.8333 - 50965.6000 [ 10]: ∎∎∎∎∎∎∎∎∎∎ 50965.6000 - 57175.3667 [ 5]: ∎∎∎∎∎ 57175.3667 - 63385.1333 [ 6]: ∎∎∎∎∎∎ 63385.1333 - 69594.9000 [ 5]: ∎∎∎∎∎ 69594.9000 - 75804.6667 [ 6]: ∎∎∎∎∎∎ 75804.6667 - 82014.4333 [ 6]: ∎∎∎∎∎∎ 82014.4333 - 88224.2000 [ 4]: ∎∎∎∎ 88224.2000 - 94433.9667 [ 1]: ∎ 94433.9667 - 100643.7333 [ 1]: ∎ 100643.7333 - 106853.5000 [ 2]: ∎∎ 106853.5000 - 113063.2667 [ 0]: 113063.2667 - 119273.0333 [ 2]: ∎∎ 119273.0333 - 125482.8000 [ 2]: ∎∎ 125482.8000 - 131692.5667 [ 1]: ∎ 131692.5667 - 137902.3333 [ 1]: ∎ 137902.3333 - 144112.1000 [ 0]: 144112.1000 - 150321.8667 [ 2]: ∎∎ 150321.8667 - 156531.6333 [ 1]: ∎ 156531.6333 - 162741.4000 [ 1]: ∎ 162741.4000 - 168951.1667 [ 0]: 168951.1667 - 175160.9333 [ 0]: 175160.9333 - 181370.7000 [ 1]: ∎ 181370.7000 - 187580.4667 [ 1]: ∎ 187580.4667 - 193790.2333 [ 2]: ∎∎ 193790.2333 - 200000.0000 [ 0]: Here is casgstatus wait time histogram: 631.0000 - 5276.6333 [ 3]: ∎∎∎ 5276.6333 - 9922.2667 [ 5]: ∎∎∎∎∎ 9922.2667 - 14567.9000 [ 2]: ∎∎ 14567.9000 - 19213.5333 [ 6]: ∎∎∎∎∎∎ 19213.5333 - 23859.1667 [ 5]: ∎∎∎∎∎ 23859.1667 - 28504.8000 [ 6]: ∎∎∎∎∎∎ 28504.8000 - 33150.4333 [ 6]: ∎∎∎∎∎∎ 33150.4333 - 37796.0667 [ 2]: ∎∎ 37796.0667 - 42441.7000 [ 1]: ∎ 42441.7000 - 47087.3333 [ 3]: ∎∎∎ 47087.3333 - 51732.9667 [ 0]: 51732.9667 - 56378.6000 [ 1]: ∎ 56378.6000 - 61024.2333 [ 0]: 61024.2333 - 65669.8667 [ 0]: 65669.8667 - 70315.5000 [ 0]: 70315.5000 - 74961.1333 [ 1]: ∎ 74961.1333 - 79606.7667 [ 0]: 79606.7667 - 84252.4000 [ 0]: 84252.4000 - 88898.0333 [ 0]: 88898.0333 - 93543.6667 [ 0]: 93543.6667 - 98189.3000 [ 0]: 98189.3000 - 102834.9333 [ 0]: 102834.9333 - 107480.5667 [ 1]: ∎ 107480.5667 - 112126.2000 [ 0]: 112126.2000 - 116771.8333 [ 0]: 116771.8333 - 121417.4667 [ 0]: 121417.4667 - 126063.1000 [ 0]: 126063.1000 - 130708.7333 [ 0]: 130708.7333 - 135354.3667 [ 0]: 135354.3667 - 140000.0000 [ 1]: ∎ Ideally we eliminate the waiting by switching to async state machine for GC, but for now just yield to OS scheduler after a reasonable wait time. To choose yielding parameters I've measured golang.org/x/benchmarks/http tail latencies with different yield delays and oversubscription levels. With no oversubscription (to the degree possible): scang yield delay = 1, casgstatus yield delay = 1 Latency-50 1.41ms ±15% 1.41ms ± 5% ~ (p=0.611 n=13+12) Latency-95 5.21ms ± 2% 5.15ms ± 2% -1.15% (p=0.012 n=13+13) Latency-99 7.16ms ± 2% 7.05ms ± 2% -1.54% (p=0.002 n=13+13) Latency-999 10.7ms ± 9% 10.2ms ±10% -5.46% (p=0.004 n=12+13) scang yield delay = 5000, casgstatus yield delay = 3000 Latency-50 1.41ms ±15% 1.41ms ± 8% ~ (p=0.511 n=13+13) Latency-95 5.21ms ± 2% 5.14ms ± 2% -1.23% (p=0.006 n=13+13) Latency-99 7.16ms ± 2% 7.02ms ± 2% -1.94% (p=0.000 n=13+13) Latency-999 10.7ms ± 9% 10.1ms ± 8% -6.14% (p=0.000 n=12+13) scang yield delay = 10000, casgstatus yield delay = 5000 Latency-50 1.41ms ±15% 1.45ms ± 6% ~ (p=0.724 n=13+13) Latency-95 5.21ms ± 2% 5.18ms ± 1% ~ (p=0.287 n=13+13) Latency-99 7.16ms ± 2% 7.05ms ± 2% -1.64% (p=0.002 n=13+13) Latency-999 10.7ms ± 9% 10.0ms ± 5% -6.72% (p=0.000 n=12+13) scang yield delay = 30000, casgstatus yield delay = 10000 Latency-50 1.41ms ±15% 1.51ms ± 7% +6.57% (p=0.002 n=13+13) Latency-95 5.21ms ± 2% 5.21ms ± 2% ~ (p=0.960 n=13+13) Latency-99 7.16ms ± 2% 7.06ms ± 2% -1.50% (p=0.012 n=13+13) Latency-999 10.7ms ± 9% 10.0ms ± 6% -6.49% (p=0.000 n=12+13) scang yield delay = 100000, casgstatus yield delay = 50000 Latency-50 1.41ms ±15% 1.53ms ± 6% +8.48% (p=0.000 n=13+12) Latency-95 5.21ms ± 2% 5.23ms ± 2% ~ (p=0.287 n=13+13) Latency-99 7.16ms ± 2% 7.08ms ± 2% -1.21% (p=0.004 n=13+13) Latency-999 10.7ms ± 9% 9.9ms ± 3% -7.99% (p=0.000 n=12+12) scang yield delay = 200000, casgstatus yield delay = 100000 Latency-50 1.41ms ±15% 1.47ms ± 5% ~ (p=0.072 n=13+13) Latency-95 5.21ms ± 2% 5.17ms ± 2% ~ (p=0.091 n=13+13) Latency-99 7.16ms ± 2% 7.02ms ± 2% -1.99% (p=0.000 n=13+13) Latency-999 10.7ms ± 9% 9.9ms ± 5% -7.86% (p=0.000 n=12+13) With slight oversubscription (another instance of http benchmark was running in background with reduced GOMAXPROCS): scang yield delay = 1, casgstatus yield delay = 1 Latency-50 840µs ± 3% 804µs ± 3% -4.37% (p=0.000 n=15+18) Latency-95 6.52ms ± 4% 6.03ms ± 4% -7.51% (p=0.000 n=18+18) Latency-99 10.8ms ± 7% 10.0ms ± 4% -7.33% (p=0.000 n=18+14) Latency-999 18.0ms ± 9% 16.8ms ± 7% -6.84% (p=0.000 n=18+18) scang yield delay = 5000, casgstatus yield delay = 3000 Latency-50 840µs ± 3% 809µs ± 3% -3.71% (p=0.000 n=15+17) Latency-95 6.52ms ± 4% 6.11ms ± 4% -6.29% (p=0.000 n=18+18) Latency-99 10.8ms ± 7% 9.9ms ± 6% -7.55% (p=0.000 n=18+18) Latency-999 18.0ms ± 9% 16.5ms ±11% -8.49% (p=0.000 n=18+18) scang yield delay = 10000, casgstatus yield delay = 5000 Latency-50 840µs ± 3% 823µs ± 5% -2.06% (p=0.002 n=15+18) Latency-95 6.52ms ± 4% 6.32ms ± 3% -3.05% (p=0.000 n=18+18) Latency-99 10.8ms ± 7% 10.2ms ± 4% -5.22% (p=0.000 n=18+18) Latency-999 18.0ms ± 9% 16.7ms ±10% -7.09% (p=0.000 n=18+18) scang yield delay = 30000, casgstatus yield delay = 10000 Latency-50 840µs ± 3% 836µs ± 5% ~ (p=0.442 n=15+18) Latency-95 6.52ms ± 4% 6.39ms ± 3% -2.00% (p=0.000 n=18+18) Latency-99 10.8ms ± 7% 10.2ms ± 6% -5.15% (p=0.000 n=18+17) Latency-999 18.0ms ± 9% 16.6ms ± 8% -7.48% (p=0.000 n=18+18) scang yield delay = 100000, casgstatus yield delay = 50000 Latency-50 840µs ± 3% 836µs ± 6% ~ (p=0.401 n=15+18) Latency-95 6.52ms ± 4% 6.40ms ± 4% -1.79% (p=0.010 n=18+18) Latency-99 10.8ms ± 7% 10.2ms ± 5% -4.95% (p=0.000 n=18+18) Latency-999 18.0ms ± 9% 16.5ms ±14% -8.17% (p=0.000 n=18+18) scang yield delay = 200000, casgstatus yield delay = 100000 Latency-50 840µs ± 3% 828µs ± 2% -1.49% (p=0.001 n=15+17) Latency-95 6.52ms ± 4% 6.38ms ± 4% -2.04% (p=0.001 n=18+18) Latency-99 10.8ms ± 7% 10.2ms ± 4% -4.77% (p=0.000 n=18+18) Latency-999 18.0ms ± 9% 16.9ms ± 9% -6.23% (p=0.000 n=18+18) With significant oversubscription (background http benchmark was running with full GOMAXPROCS): scang yield delay = 1, casgstatus yield delay = 1 Latency-50 1.32ms ±12% 1.30ms ±13% ~ (p=0.454 n=14+14) Latency-95 16.3ms ±10% 15.3ms ± 7% -6.29% (p=0.001 n=14+14) Latency-99 29.4ms ±10% 27.9ms ± 5% -5.04% (p=0.001 n=14+12) Latency-999 49.9ms ±19% 45.9ms ± 5% -8.00% (p=0.008 n=14+13) scang yield delay = 5000, casgstatus yield delay = 3000 Latency-50 1.32ms ±12% 1.29ms ± 9% ~ (p=0.227 n=14+14) Latency-95 16.3ms ±10% 15.4ms ± 5% -5.27% (p=0.002 n=14+14) Latency-99 29.4ms ±10% 27.9ms ± 6% -5.16% (p=0.001 n=14+14) Latency-999 49.9ms ±19% 46.8ms ± 8% -6.21% (p=0.050 n=14+14) scang yield delay = 10000, casgstatus yield delay = 5000 Latency-50 1.32ms ±12% 1.35ms ± 9% ~ (p=0.401 n=14+14) Latency-95 16.3ms ±10% 15.0ms ± 4% -7.67% (p=0.000 n=14+14) Latency-99 29.4ms ±10% 27.4ms ± 5% -6.98% (p=0.000 n=14+14) Latency-999 49.9ms ±19% 44.7ms ± 5% -10.56% (p=0.000 n=14+11) scang yield delay = 30000, casgstatus yield delay = 10000 Latency-50 1.32ms ±12% 1.36ms ±10% ~ (p=0.246 n=14+14) Latency-95 16.3ms ±10% 14.9ms ± 5% -8.31% (p=0.000 n=14+14) Latency-99 29.4ms ±10% 27.4ms ± 7% -6.70% (p=0.000 n=14+14) Latency-999 49.9ms ±19% 44.9ms ±15% -10.13% (p=0.003 n=14+14) scang yield delay = 100000, casgstatus yield delay = 50000 Latency-50 1.32ms ±12% 1.41ms ± 9% +6.37% (p=0.008 n=14+13) Latency-95 16.3ms ±10% 15.1ms ± 8% -7.45% (p=0.000 n=14+14) Latency-99 29.4ms ±10% 27.5ms ±12% -6.67% (p=0.002 n=14+14) Latency-999 49.9ms ±19% 45.9ms ±16% -8.06% (p=0.019 n=14+14) scang yield delay = 200000, casgstatus yield delay = 100000 Latency-50 1.32ms ±12% 1.42ms ±10% +7.21% (p=0.003 n=14+14) Latency-95 16.3ms ±10% 15.0ms ± 7% -7.59% (p=0.000 n=14+14) Latency-99 29.4ms ±10% 27.3ms ± 8% -7.20% (p=0.000 n=14+14) Latency-999 49.9ms ±19% 44.8ms ± 8% -10.21% (p=0.001 n=14+13) All numbers are on 8 cores and with GOGC=10 (http benchmark has tiny heap, few goroutines and low allocation rate, so by default GC barely affects tail latency). 10us/5us yield delays seem to provide a reasonable compromise and give 5-10% tail latency reduction. That's what used in this change. go install -a std results on 4 core machine: name old time/op new time/op delta Time 8.39s ± 2% 7.94s ± 2% -5.34% (p=0.000 n=47+49) UserTime 24.6s ± 2% 22.9s ± 2% -6.76% (p=0.000 n=49+49) SysTime 1.77s ± 9% 1.89s ±11% +7.00% (p=0.000 n=49+49) CpuLoad 315ns ± 2% 313ns ± 1% -0.59% (p=0.000 n=49+48) # %CPU MaxRSS 97.1ms ± 4% 97.5ms ± 9% ~ (p=0.838 n=46+49) # bytes Update #14396 Update #14189 Change-Id: I3f4109bf8f7fd79b39c466576690a778232055a2 Reviewed-on: https://go-review.googlesource.com/21503 Run-TryBot: Dmitry Vyukov <[email protected]> TryBot-Result: Gobot Gobot <[email protected]> Reviewed-by: Rick Hudson <[email protected]> Reviewed-by: Austin Clements <[email protected]>

toddboom · 2016-05-12T19:32:37Z

I just wanted to revisit this issue and let everyone know that as of v0.13.0 (released today), InfluxDB is now being built with Go 1.6.2.

We did a lot of work over the last couple of months to reduce unnecessary allocations, but the GC performance in Go has also been steadily improving. Based on our own internal testing (influxdata/influxdb#6511) and user feedback (influxdata/influxdb#5217), we're consistently seeing better performance across the board when using Go 1.6.2. This is also without tuning the GOGC parameter, which further improves performance.

I'm going to close this issue out because I don't think it represents an actionable item for the Go team any longer. For anyone else who still sees better performance on Go 1.4.3, I'd recommend working to identify and reduce unnecessary allocations - you'll be happier for it!

Finally, we're seeing even better performance with tip, so we'll be looking forward to Go 1.7 later this year. Thanks to the entire Go team for all of the help and responsiveness along the way!

aclements · 2016-05-12T19:45:14Z

@toddboom, that's great! Thanks for updating the issue.

bradfitz assigned aclements Feb 2, 2016

bradfitz changed the title ~~Performance degradation for versions > 1.4.x~~ runtime: performance degradation for versions > 1.4.x Feb 2, 2016

smarterclayton mentioned this issue Feb 2, 2016

1000 node cluster is failing performance tests kubernetes/kubernetes#20400

Closed

ianlancetaylor added this to the Go1.7 milestone Feb 2, 2016

jwilder mentioned this issue Feb 2, 2016

revert to go1.4.2 (and wait for go1.6+) influxdata/influxdb#5217

Closed

methane mentioned this issue Feb 3, 2016

[tsm1] Change timestamp from time.Time to unixnano influxdata/influxdb#5522

Merged

4 tasks

timothysc mentioned this issue Feb 4, 2016

Investigate perf problems with go1.4.3 and go1.5.3 kubernetes/kubernetes#20656

Closed

rsc mentioned this issue Feb 19, 2016

all: kubernetes slower execution since Go 1.4.3 #14396

Closed

toddboom closed this as completed May 12, 2016

pavel-paulau mentioned this issue Jan 26, 2017

runtime: performance degradation after upgrade from Go 1.4 to Go 1.7 #18805

Closed

golang locked and limited conversation to collaborators May 12, 2017

gopherbot added the FrozenDueToAge label May 12, 2017

rsc unassigned aclements Jun 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: performance degradation for versions > 1.4.x #14189

runtime: performance degradation for versions > 1.4.x #14189

toddboom commented Feb 2, 2016

RLH commented Feb 2, 2016

toddboom commented Feb 2, 2016

toddboom commented Feb 2, 2016

RLH commented Feb 2, 2016

methane commented Feb 2, 2016

RLH commented Feb 2, 2016

aclements commented Feb 2, 2016

iand commented Feb 18, 2016

RLH commented Feb 18, 2016

iand commented Feb 18, 2016

iand commented Feb 18, 2016

RLH commented Feb 18, 2016

time go version GOGC

time 99.9% augment max augment

time 99.9% evaluate max evaluate

iand commented Feb 18, 2016

RLH commented Feb 18, 2016

aclements commented Feb 18, 2016

iand commented Feb 19, 2016

gopherbot commented Apr 4, 2016

toddboom commented May 12, 2016

aclements commented May 12, 2016

runtime: performance degradation for versions > 1.4.x #14189

runtime: performance degradation for versions > 1.4.x #14189

Comments

toddboom commented Feb 2, 2016

RLH commented Feb 2, 2016

toddboom commented Feb 2, 2016

toddboom commented Feb 2, 2016

RLH commented Feb 2, 2016

methane commented Feb 2, 2016

RLH commented Feb 2, 2016

aclements commented Feb 2, 2016

iand commented Feb 18, 2016

RLH commented Feb 18, 2016

iand commented Feb 18, 2016

iand commented Feb 18, 2016

RLH commented Feb 18, 2016

time go version GOGC

time 99.9% augment max augment

time 99.9% evaluate max evaluate

iand commented Feb 18, 2016

RLH commented Feb 18, 2016

aclements commented Feb 18, 2016

iand commented Feb 19, 2016

gopherbot commented Apr 4, 2016

toddboom commented May 12, 2016

aclements commented May 12, 2016