Skip to content

Raw performance drops by ~25% when using timeouts #626

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
cipriancraciun opened this issue Aug 12, 2019 · 8 comments
Closed

Raw performance drops by ~25% when using timeouts #626

cipriancraciun opened this issue Aug 12, 2019 · 8 comments

Comments

@cipriancraciun
Copy link
Contributor

I've observed that by just enabling ReadTimeout, WriteTimeout and IdleTimeout the raw performance drops about 25%.

I've tested it on a "dummy" server using fasthttp:
https://github.com/volution/kawipiko/blob/c78613fbf7b81782c0ab9bbb363312d19086d346/sources/cmd/server-dummy.go#L45-L47

By just enabling or disabling those three lines the performance changes.

(I'm using the latest Go 1.12 on Linux 64bit and the latest fasthttp code.)


I understand that the underlying issue is the SetDeadline implementation of net.Conn, and I've even found a Go issue (to which @valyala participated) on this topic:
golang/go#25729

However that issue was closed, but the actual problem still persists.


Could we perhaps implement an "alternative" timeout solution, perhaps something on the lines: for each connection we keep an "last touched" timestamp (updated each time we fully read a request, or write a response), and we have a background goroutine that checks these timeouts, and if they have been overrun we just c.Close() the connection.

It won't be as precise as using c.SetDeadline, however it can remove "dead" connections, or "slow-loris" attacks.

@erikdubbelboer
Copy link
Collaborator

What are you using to benchmark this?

The issue you linked to was closed because the performance issue was fixed in Go.

I just did a benchmark myself and I only get a 2% slowdown with timeouts enabled: https://gist.github.com/erikdubbelboer/0b5468f1a7ba9679b435973b0590d685

@cipriancraciun
Copy link
Contributor Author

I've used wrk to benchmark the simple dummy server I've mentioned:
https://github.com/volution/kawipiko/blob/c78613fbf7b81782c0ab9bbb363312d19086d346/sources/cmd/server-dummy.go

As said tests were run on openSUSE Tumbleweed, Linux 5.1.7 x64, Go 1.12.

The following is the output of go version and go env (sanitized, and removed empty variables):

go version go1.12.7 linux/amd64

GOARCH="amd64"
GOHOSTARCH="amd64"
GOHOSTOS="linux"
GOOS="linux"
GOROOT="/usr/lib64/go/1.12"
GOTOOLDIR="/usr/lib64/go/1.12/pkg/tool/linux_amd64"
GCCGO="gccgo"
CC="gcc"
CXX="g++"
CGO_ENABLED="1"
GOMOD="{sanitized}/go.mod"
CGO_CFLAGS="-g -O2"
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map={sanitized} -gno-record-gcc-switches"

The following snippet was used to build the server:

go build \
        -v -a \
        -tags netgo \
        -ldflags 'all=-s -extld=gcc -extldflags=-static' \
        -gcflags 'all=-l=4' \
        -o ./server-dummy \
        -- ./cmd/server-dummy.go \
#

(I've also tried without -tags netgo and static, to the same effect.


The following snippets were used to run and benchmark the server.

The complex taskset ... basically pins the server and client on different physical cores on my laptop, sets the scheduling class to "realtime", increases the priority for IO, etc. Basically I make sure other processes on my laptop (which is already pretty "quiet") don't interfere with the benchmark.

sudo -u root -n -E -P -- \
taskset -c 0,1 \
nice -n -19 -- \
ionice -c 2 -n 0 -- \
chrt -r 10 \
prlimit -n16384 -- \
sudo -u "${USER}" -n -E -P -- \
env -i -- \
\
./server-dummy \
#
sudo -u root -n -E -P -- \
taskset -c 2,3 \
nice -n -19 -- \
ionice -c 2 -n 0 -- \
chrt -r 10 \
prlimit -n16384 -- \
sudo -u "${USER}" -n -E -P -- \
env -i -- \
\
"$( type -P -- wrk )" \
        --threads 2 \
        --connections 128 \
        --timeout 1s \
        --duration 30s \
        --latency \
        -- "http://127.9.185.194:8080/" \
#

The following are the outputs of wrk:

  • without timeouts: ~135K r/s
Running 30s test @ http://127.9.185.194:8080/
  2 threads and 128 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     0.89ms  602.76us  23.86ms   61.92%
    Req/Sec    68.09k     0.90k   70.08k    83.00%
  Latency Distribution
     50%    0.87ms
     75%    1.37ms
     90%    1.65ms
     99%    2.06ms
  4065131 requests in 30.00s, 201.59MB read
Requests/sec: 135496.22
Transfer/sec:      6.72MB
  • with timeouts: ~110K r/s
Running 30s test @ http://127.9.185.194:8080/
  2 threads and 128 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.11ms  759.00us  16.92ms   62.13%
    Req/Sec    55.44k     0.94k   57.26k    80.83%
  Latency Distribution
     50%    1.02ms
     75%    1.75ms
     90%    2.07ms
     99%    2.69ms
  3309620 requests in 30.00s, 164.13MB read
Requests/sec: 110316.99
Transfer/sec:      5.47MB

@erikdubbelboer
Copy link
Collaborator

These are the outputs I'm getting on my Mac book with go1.12.8

Without timeouts:

$ wrk --threads 2 --connections 128 --timeout 1s --duration 30s --latency http://127.0.0.1:1337/
Running 30s test @ http://127.0.0.1:1337/
  2 threads and 128 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.14ms    1.35ms  33.82ms   91.32%
    Req/Sec    71.07k     4.64k   79.15k    75.91%
  Latency Distribution
     50%  613.00us
     75%    1.50ms
     90%    2.32ms
     99%    6.51ms
  4255987 requests in 30.10s, 361.24MB read
Requests/sec: 141386.73
Transfer/sec:     12.00MB

With timeouts:

$ wrk --threads 2 --connections 128 --timeout 1s --duration 30s --latency http://127.0.0.1:1337/
Running 30s test @ http://127.0.0.1:1337/
  2 threads and 128 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency     1.75ms    3.41ms  56.27ms   92.28%
    Req/Sec    69.63k     4.49k   80.78k    69.60%
  Latency Distribution
     50%  838.00us
     75%    1.03ms
     90%    2.95ms
     99%   18.39ms
  4170823 requests in 30.10s, 354.01MB read
Requests/sec: 138556.68
Transfer/sec:     11.76MB

That's only 2% slower.

@cipriancraciun
Copy link
Contributor Author

These are the outputs I'm getting on my Mac book with go1.12.8
[...]
That's only 2% slower.

Which code are you using for the server?

@erikdubbelboer
Copy link
Collaborator

Your exact code.

But I have tested it on an Ubuntu machine just now and there I get a 14% slowdown with the timeouts so I guess linux is just much slower with this. Now I'm wondering if we should try to fix this here or open an issue with Go again.

@cipriancraciun
Copy link
Contributor Author

Now I'm wondering if we should try to fix this here or open an issue with Go again.

Unless there is a hidden issue in fasthttp that is triggered when timeouts are used, I would say that the Go issue should be re-opened again, as this situation might impact other networking related projects as well.


However, with regard to fasthttp, and given the HTTP/1.1 semantic, I would say that these timeouts can easily be implemented in our own code without relying on Go's SetDeadline.

So, we have the following classes of timeouts:

  • reading / writing the entire request / response headers;
  • reading / writing at least a byte from the request / response body;
  • reading / writing the entire request / response body;
  • waiting for the next request on a keep-alive connection;

Thus, as stated in my original comment, we could implement a "background" worker that checks the status of a given connection and if it "took too long" it would just close that connection. Granted this is not very "graceful" to "slam the door" in the client's face, but in the end except a "basic" 400 error code, there isn't anything useful we can provide to the client.

Moreover, the source of these timeouts I would say are:

  • an actual client that "fell" from the network, thus it won't even receive the reply;
  • an attacker that tries to DoS our server;

Thus in neither cases, I don't think we should waste CPU, network packets, and especially network buffers to queue an 400 error.


Also, one could always implement a "hybrid" timeout approach by enabling either SetDeadline or the "background" collector on a per-connection basis. (I would guess that those implementing proxy servers, would prefer SetDeadline.)

And, speaking about "hybrid" timeouts, another reason we require timeouts is to eliminate dead clients which eat resources under load. Therefore if the server is not loaded, we could "relax" these timeouts.

@erikdubbelboer
Copy link
Collaborator

I have done another test. I have taken the basic code that fasthttp does to see if I can see a performance difference with very basic code as well: https://gist.github.com/erikdubbelboer/f3d45fe3d04235b9c16fcf3524d54079

On Ubuntu I'm getting:

Requests/sec: 236582.31 # with
Requests/sec: 217022.16 # without

This is a 9% slowdown which isn't the 14% I'm getting with fasthttp but it does show that Go itself is causing the slowdown.

@erikdubbelboer
Copy link
Collaborator

I have done some more performance testing and the only difference when I enable Server.ReadTimeout is that net.Conn.SetReadDeadline() is taking a bunch of CPU.

I have tried to write my own deadline code and the only fast option I could manage was something like:

        mu.Lock()
        if t, ok := deadlines[conn]; ok {
                if !t.Stop() {
                        <-t.C
                }
                t.Reset(s.ReadTimeout)
        } else {
                deadlines[conn] = time.AfterFunc(s.ReadTimeout, func() {
                        delete(deadlines, conn)
                        conn.Close()
                })
        }
        mu.Unlock()

Having a single goroutine and a list of deadlines is much slower as Go itself is much better at handling the timeouts.
My code resulted in a 5% slowdown which is only a little bit faster than the normal 9% I was seeing. I don't think its worth looking into this more as it's such a small improvement and would bring a lot of complexity with it. Also as soon as you're actually doing something in your handler instead of serving a static byte slice the overhead of the deadline code will very quickly be negligible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants