Skip to content

runtime: timer doesn't scale on multi-CPU systems with a lot of timers #15133

Closed
@valyala

Description

@valyala

Please answer these questions before submitting your issue. Thanks!

What version of Go are you using (go version)?

go version devel +dfad3f8 Tue Apr 5 16:10:33 2016 +0300 linux/amd64

What operating system and processor architecture are you using (go env)?

GOARCH="amd64"
GOBIN=""
GOEXE=""
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GORACE=""
CC="gcc"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=/tmp/go-build040466407=/tmp/go-build -gno-record-gcc-switches"
CXX="g++"
CGO_ENABLED="1"

What did you do?

I run an http server serving more than a million concurrent TCP keep-alive connections on a system with 32 CPU cores. Each connection has read and write deadlines set via net.Conn.SetReadDeadline and net.Conn.SetWriteDeadline.

What did you expect to see?

Server performance should scale with GOMAXPROCS up to the number of CPU cores.

What did you see instead?

  • addtimer and deltimer functions from runtime/time.go are at the top of CPU profile.
  • iowait reaches 20% due to memory ping-pong across CPU cores inside siftupTimer and siftdownTimer functions in runtime/time.go.

@dvyukov , could you look into this?

Activity

dvyukov

dvyukov commented on Apr 6, 2016

@dvyukov
Member

Server performance should scale with GOMAXPROCS up to the number of CPU cores.

What did you see instead? What is performance depending on number of cores?

iowait reaches 20% due to memory ping-pong across CPU cores inside siftupTimer and siftdownTimer functions in runtime/time.go.

I don't understand relation between iowait and memory ping-pong? iowait if waiting for IO, like hard drive, memory accesses are not IO.

How exactly does profile look?

Each connection has read and write deadlines set via net.Conn.SetReadDeadline and net.Conn.SetWriteDeadline.

Would it be possible to change it to SetDeadline? It would setup 1 timer instead of 2.

dvyukov

dvyukov commented on Apr 6, 2016

@dvyukov
Member

We probably can merge this into #6239

valyala

valyala commented on Apr 6, 2016

@valyala
ContributorAuthor

What did you see instead? What is performance depending on number of cores?

iowait and system CPU shares grow with increasing the number of CPU cores

I don't understand relation between iowait and memory ping-pong? iowait if waiting for IO, like hard drive, memory accesses are not IO.

I'm not an expert in iowait, but the fact is that iowait completely vanishes from 20% to 0% when deadlines on net.TCPConn connections are disabled. See the following image:
cpuusage.

Connection deadlines are enabled during 15:43-15:56
Connection deadlines are disabled starting from 15:56

How exactly does profile look?

Here is CPU profile for the process with deadlines enabled:

(pprof) top20
9.85mins of 15.65mins total (62.97%)
Dropped 623 nodes (cum <= 0.08mins)
Showing top 20 nodes out of 187 (cum >= 4.98mins)
      flat  flat%   sum%        cum   cum%
  3.27mins 20.92% 20.92%   3.27mins 20.92%  runtime.futex
  1.98mins 12.66% 33.57%   2.20mins 14.04%  syscall.Syscall
  0.66mins  4.25% 37.82%   0.66mins  4.25%  runtime.memmove
  0.54mins  3.48% 41.30%   0.54mins  3.48%  runtime.procyield
  0.52mins  3.31% 44.61%   3.18mins 20.29%  runtime.lock
  0.41mins  2.61% 47.22%   0.41mins  2.61%  runtime/internal/atomic.Xchg
  0.34mins  2.18% 49.40%   0.34mins  2.18%  runtime.epollwait
  0.25mins  1.61% 51.02%   0.25mins  1.61%  runtime.epollctl
  0.25mins  1.60% 52.61%   0.25mins  1.60%  runtime.siftdownTimer
  0.24mins  1.55% 54.16%   0.24mins  1.55%  runtime.siftupTimer
  0.24mins  1.54% 55.70%   0.24mins  1.54%  runtime.osyield
  0.19mins  1.19% 56.89%   0.19mins  1.19%  runtime/internal/atomic.Cas
  0.15mins  0.96% 57.85%   3.03mins 19.36%  runtime.deltimer
  0.14mins  0.91% 58.76%  11.56mins 73.90%  github.com/valyala/fasthttp.(*Server).serveConn
  0.12mins  0.76% 59.52%      2mins 12.78%  runtime.unlock
  0.12mins  0.74% 60.26%   0.16mins     1%  runtime.deferreturn
  0.11mins  0.71% 60.97%   0.12mins  0.77%  runtime.gopark
  0.11mins  0.71% 61.68%   0.42mins  2.68%  runtime.gentraceback
  0.10mins  0.66% 62.35%   0.23mins  1.47%  lib/metric.(*sample).Update
  0.10mins  0.62% 62.97%   4.98mins 31.86%  net.runtime_pollSetDeadline

(pprof) top20 -cum
387.31s of 938.84s total (41.25%)
Dropped 623 nodes (cum <= 4.69s)
Showing top 20 nodes out of 187 (cum >= 112.06s)
      flat  flat%   sum%        cum   cum%
         0     0%     0%    864.20s 92.05%  runtime.goexit
     0.01s 0.0011% 0.0011%    733.84s 78.16%  github.com/valyala/fasthttp.(*workerPool).getCh.func1
     0.39s 0.042% 0.043%    733.75s 78.15%  github.com/valyala/fasthttp.(*workerPool).workerFunc
     0.10s 0.011% 0.053%    693.90s 73.91%  github.com/valyala/fasthttp.(*Server).(github.com/valyala/fasthttp.serveConn)-fm
     8.53s  0.91%  0.96%    693.80s 73.90%  github.com/valyala/fasthttp.(*Server).serveConn
     1.28s  0.14%  1.10%    306.32s 32.63%  net.setDeadlineImpl
     5.86s  0.62%  1.72%    299.09s 31.86%  net.runtime_pollSetDeadline
   196.36s 20.92% 22.64%    196.36s 20.92%  runtime.futex
    31.04s  3.31% 25.94%    190.52s 20.29%  runtime.lock
        9s  0.96% 26.90%    181.79s 19.36%  runtime.deltimer
     0.80s 0.085% 26.99%    159.09s 16.95%  lib/httpserver.(*customConn).SetReadDeadline
     0.72s 0.077% 27.06%    158.10s 16.84%  net.(*conn).SetReadDeadline
     0.36s 0.038% 27.10%    157.38s 16.76%  net.(*netFD).setReadDeadline
     2.63s  0.28% 27.38%    154.28s 16.43%  lib/httpserver.(*customConn).SetWriteDeadline
     0.84s 0.089% 27.47%    150.37s 16.02%  net.(*conn).SetWriteDeadline
     0.23s 0.024% 27.50%    149.53s 15.93%  net.(*netFD).setWriteDeadline
     0.80s 0.085% 27.58%    139.24s 14.83%  runtime.addtimer
   118.83s 12.66% 40.24%    131.79s 14.04%  syscall.Syscall
     7.14s  0.76% 41.00%    120.01s 12.78%  runtime.unlock
     2.39s  0.25% 41.25%    112.06s 11.94%  github.com/valyala/fasthttp.acquireByteReader

(pprof) list addtimer$
Total: 15.65mins
ROUTINE ======================== runtime.addtimer in /home/aliaksandr/work/go1.5/src/runtime/time.go
     800ms   2.32mins (flat, cum) 14.83% of Total
         .          .     80:// Ready the goroutine arg.
         .          .     81:func goroutineReady(arg interface{}, seq uintptr) {
         .          .     82:   goready(arg.(*g), 0)
         .          .     83:}
         .          .     84:
     360ms      360ms     85:func addtimer(t *timer) {
      50ms   1.35mins     86:   lock(&timers.lock)
     320ms      7.42s     87:   addtimerLocked(t)
      30ms     50.47s     88:   unlock(&timers.lock)
      40ms       40ms     89:}
         .          .     90:
         .          .     91:// Add a timer to the heap and start or kick the timer proc.
         .          .     92:// If the new timer is earlier than any of the others.
         .          .     93:// Timers are locked.
         .          .     94:func addtimerLocked(t *timer) {


(pprof) list deltimer$
Total: 15.65mins
ROUTINE ======================== runtime.deltimer in /home/aliaksandr/work/go1.5/src/runtime/time.go
        9s   3.03mins (flat, cum) 19.36% of Total
         .          .    117:   }
         .          .    118:}
         .          .    119:
         .          .    120:// Delete timer t from the heap.
         .          .    121:// Do not need to update the timerproc: if it wakes up early, no big deal.
     200ms      200ms    122:func deltimer(t *timer) bool {
         .          .    123:   // Dereference t so that any panic happens before the lock is held.
         .          .    124:   // Discard result, because t might be moving in the heap.
      50ms       50ms    125:   _ = t.i
         .          .    126:
     180ms   1.61mins    127:   lock(&timers.lock)
         .          .    128:   // t may not be registered anymore and may have
         .          .    129:   // a bogus i (typically 0, if generated by Go).
         .          .    130:   // Verify it before proceeding.
     570ms      570ms    131:   i := t.i
     1.64s      1.64s    132:   last := len(timers.t) - 1
     2.64s      2.64s    133:   if i < 0 || i > last || timers.t[i] != t {
         .          .    134:       unlock(&timers.lock)
         .          .    135:       return false
         .          .    136:   }
         .          .    137:   if i != last {
     1.47s      1.47s    138:       timers.t[i] = timers.t[last]
     330ms      330ms    139:       timers.t[i].i = i
         .          .    140:   }
     1.71s      1.71s    141:   timers.t[last] = nil
      30ms       30ms    142:   timers.t = timers.t[:last]
         .          .    143:   if i != last {
      30ms     11.32s    144:       siftupTimer(i)
      30ms      6.23s    145:       siftdownTimer(i)
         .          .    146:   }
      30ms     58.89s    147:   unlock(&timers.lock)
      90ms       90ms    148:   return true
         .          .    149:}
         .          .    150:
         .          .    151:// Timerproc runs the time-driven events.
         .          .    152:// It sleeps until the next event in the timers heap.
         .          .    153:// If addtimer inserts a new earlier event, addtimer1 wakes timerproc early.

Here is CPU profile for the app with connection deadlines disabled:

(pprof) top20
465.38s of 1002.60s total (46.42%)
Dropped 622 nodes (cum <= 5.01s)
Showing top 20 nodes out of 192 (cum >= 15.87s)
      flat  flat%   sum%        cum   cum%
    54.74s  5.46%  5.46%     54.74s  5.46%  runtime.memmove
    54.54s  5.44% 10.90%    150.32s 14.99%  runtime.lock
    48.96s  4.88% 15.78%    330.74s 32.99%  runtime.findrunnable
    34.12s  3.40% 19.19%     34.12s  3.40%  runtime.futex
    32.93s  3.28% 22.47%     32.93s  3.28%  runtime/internal/atomic.Xchg
    27.41s  2.73% 25.20%     27.41s  2.73%  runtime.procyield
    26.82s  2.68% 27.88%     33.59s  3.35%  runtime.runqgrab
    20.20s  2.01% 29.89%     35.12s  3.50%  runtime.selectgoImpl
    17.45s  1.74% 31.63%     17.45s  1.74%  runtime/internal/atomic.Cas
    16.69s  1.66% 33.30%     21.79s  2.17%  github.com/valyala/fasthttp.(*headerScanner).next
    16.56s  1.65% 34.95%     16.56s  1.65%  math/big.addMulVVW
    16.38s  1.63% 36.58%     25.14s  2.51%  github.com/valyala/fasthttp.(*RequestHeader).AppendBytes
    14.24s  1.42% 38.01%    446.04s 44.49%  github.com/valyala/fasthttp.(*Server).serveConn
    13.88s  1.38% 39.39%     22.81s  2.28%  runtime.netpollready
    13.72s  1.37% 40.76%     13.72s  1.37%  runtime/internal/atomic.Load
    12.19s  1.22% 41.97%     32.18s  3.21%  runtime.unlock
    11.54s  1.15% 43.12%    351.66s 35.07%  runtime.schedule
    11.42s  1.14% 44.26%     11.42s  1.14%  runtime.indexbytebody
    11.02s  1.10% 45.36%     11.02s  1.10%  runtime.osyield
    10.57s  1.05% 46.42%     15.87s  1.58%  github.com/valyala/fasthttp.appendArgBytes

(pprof) top20 -cum
155.46s of 1002.60s total (15.51%)
Dropped 622 nodes (cum <= 5.01s)
Showing top 20 nodes out of 192 (cum >= 61.52s)
      flat  flat%   sum%        cum   cum%
         0     0%     0%    636.02s 63.44%  runtime.goexit
         0     0%     0%    466.84s 46.56%  github.com/valyala/fasthttp.(*workerPool).getCh.func1
     2.85s  0.28%  0.28%    466.84s 46.56%  github.com/valyala/fasthttp.(*workerPool).workerFunc
     0.28s 0.028%  0.31%    446.32s 44.52%  github.com/valyala/fasthttp.(*Server).(github.com/valyala/fasthttp.serveConn)-fm
    14.24s  1.42%  1.73%    446.04s 44.49%  github.com/valyala/fasthttp.(*Server).serveConn
     3.38s  0.34%  2.07%    361.68s 36.07%  runtime.mcall
     1.30s  0.13%  2.20%    354.98s 35.41%  runtime.park_m
    11.54s  1.15%  3.35%    351.66s 35.07%  runtime.schedule
    48.96s  4.88%  8.23%    330.74s 32.99%  runtime.findrunnable
     1.74s  0.17%  8.41%    206.95s 20.64%  lib/httpserver.getHTTPHandler.func1
     1.21s  0.12%  8.53%    199.45s 19.89%  lib/httpserver.httpHandler
     0.43s 0.043%  8.57%    161.15s 16.07%  main.requestHandler
     0.14s 0.014%  8.58%    158.64s 15.82%  main.proxyHandler
    54.54s  5.44% 14.02%    150.32s 14.99%  runtime.lock
     2.46s  0.25% 14.27%    149.65s 14.93%  main.proxyRequest
         0     0% 14.27%    112.92s 11.26%  github.com/valyala/fasthttp.(*PipelineClient).worker.func1
     4.15s  0.41% 14.68%    112.92s 11.26%  github.com/valyala/fasthttp.(*PipelineClient).writer
     5.54s  0.55% 15.24%    106.39s 10.61%  github.com/valyala/fasthttp.acquireByteReader
     0.73s 0.073% 15.31%     67.26s  6.71%  github.com/valyala/fasthttp.(*RequestHeader).SetBytesV
     1.97s   0.2% 15.51%     61.52s  6.14%  github.com/valyala/fasthttp.(*RequestHeader).SetCanonical

(pprof) list addtimer$
Total: 16.71mins
ROUTINE ======================== runtime.addtimer in /home/aliaksandr/work/go1.5/src/runtime/time.go
     150ms      1.60s (flat, cum)  0.16% of Total
         .          .     80:// Ready the goroutine arg.
         .          .     81:func goroutineReady(arg interface{}, seq uintptr) {
         .          .     82:   goready(arg.(*g), 0)
         .          .     83:}
         .          .     84:
     100ms      100ms     85:func addtimer(t *timer) {
      40ms      780ms     86:   lock(&timers.lock)
      10ms      310ms     87:   addtimerLocked(t)
         .      410ms     88:   unlock(&timers.lock)
         .          .     89:}
         .          .     90:
         .          .     91:// Add a timer to the heap and start or kick the timer proc.
         .          .     92:// If the new timer is earlier than any of the others.
         .          .     93:// Timers are locked.


(pprof) list deltimer$
Total: 16.71mins
ROUTINE ======================== runtime.deltimer in /home/aliaksandr/work/go1.5/src/runtime/time.go
     2.21s      4.53s (flat, cum)  0.45% of Total
         .          .    117:   }
         .          .    118:}
         .          .    119:
         .          .    120:// Delete timer t from the heap.
         .          .    121:// Do not need to update the timerproc: if it wakes up early, no big deal.
     370ms      370ms    122:func deltimer(t *timer) bool {
         .          .    123:   // Dereference t so that any panic happens before the lock is held.
         .          .    124:   // Discard result, because t might be moving in the heap.
      10ms       10ms    125:   _ = t.i
         .          .    126:
      60ms      1.03s    127:   lock(&timers.lock)
         .          .    128:   // t may not be registered anymore and may have
         .          .    129:   // a bogus i (typically 0, if generated by Go).
         .          .    130:   // Verify it before proceeding.
      40ms       40ms    131:   i := t.i
      30ms       30ms    132:   last := len(timers.t) - 1
     1.06s      1.06s    133:   if i < 0 || i > last || timers.t[i] != t {
         .          .    134:       unlock(&timers.lock)
         .          .    135:       return false
         .          .    136:   }
         .          .    137:   if i != last {
     230ms      230ms    138:       timers.t[i] = timers.t[last]
      70ms       70ms    139:       timers.t[i].i = i
         .          .    140:   }
     250ms      250ms    141:   timers.t[last] = nil
         .          .    142:   timers.t = timers.t[:last]
         .          .    143:   if i != last {
      20ms      180ms    144:       siftupTimer(i)
      10ms      110ms    145:       siftdownTimer(i)
         .          .    146:   }
      20ms      1.11s    147:   unlock(&timers.lock)
      40ms       40ms    148:   return true
         .          .    149:}
         .          .    150:
         .          .    151:// Timerproc runs the time-driven events.
         .          .    152:// It sleeps until the next event in the timers heap.
         .          .    153:// If addtimer inserts a new earlier event, addtimer1 wakes timerproc early.

Would it be possible to change it to SetDeadline? It would setup 1 timer instead of 2.

No, this won't work, since read and write deadlines are configured independently in our application.

valyala

valyala commented on Apr 6, 2016

@valyala
ContributorAuthor

FYI, this is not an OS scalability problem, since the CPU usage graph above until 15:33 corresponds to 'app process per CPU core' mode of the app with enabled connection deadlines. We had to switch to this mode due to the issue with timer scalability in Go runtime.

dvyukov

dvyukov commented on Apr 6, 2016

@dvyukov
Member

Thanks for the detained info!

This confirms that the main issue is the global timers mutex. siftdown/up consume ~3% of time. If we distribute timers, then siftdown/up should become faster as well (smaller per-P heap + better locality).
Unfortunately this is not trivial to do. There is #6239 for this.

FYI, this is not an OS scalability problem, since the CPU usage graph above until 15:33 corresponds to 'app process per CPU core' mode of the app with enabled connection deadlines. We had to switch to this mode due to the issue with timer scalability in Go runtime.

Sounds pretty bad. @rsc @aclements

added this to the Unplanned milestone on Apr 7, 2016
gopherbot

gopherbot commented on Jan 6, 2017

@gopherbot
Contributor

CL https://golang.org/cl/34784 mentions this issue.

valyala

valyala commented on Jan 10, 2017

@valyala
ContributorAuthor

The CL fixes timers scalability issue for us. Simple repro:

Hammer the following http server with more than 100K concurrent http keep-alive connections and send serveral requests per second on each such a connection. The server must run on multi-core machine with GOMAXPROCS=NumCPU.

package main

import (
        "flag"
        "fmt"
        "log"
        "net/http"
        "time"
)

var addr = flag.String("addr", ":80", "TCP address to listen to")

func main() {
        flag.Parse()
        s := &http.Server{
                Addr: *addr,

                // These timeouts trigger high iowait without the CL 34784
                // if many requests are sent over more than 100K
                // keep-alive http connections.
                ReadTimeout: 90*time.Second,
                WriteTimeout: 5*time.Second,

                Handler: http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
                        fmt.Fprintf(w, "OK")
                }),
        }
        if err := s.ListenAndServe(); err != nil {
                log.Fatalf("error in ListenAndServe: %s", err)
        }
}

CPU load should look like
image

  • 12:37 - 12:42 - the server without the CL - see high iowait share
  • 12:43 - 12:50 - the server with the CL.
modified the milestones: Go1.10, Unplanned on Jul 7, 2017
added
early-in-cycleA change that should be done early in the 3 month dev cycle.
on Jul 7, 2017
self-assigned this
on Jul 7, 2017
added a commit that references this issue on Jun 27, 2018
locked and limited conversation to collaborators on Sep 12, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @bradfitz@valyala@dvyukov@aclements@gopherbot

        Issue actions

          runtime: timer doesn't scale on multi-CPU systems with a lot of timers · Issue #15133 · golang/go