Fasthttp behind Aws load balancer. Keepalive conn are causing trouble #348

Rulox · 2018-06-12T12:50:25Z

Hi!

We're using a light/fast fasthttp server as a proxy in our services infrastructure. However, we've been experiencing some issues when we use an amazon Load Balancer. Sometimes (and this is randomly) the ALB returns 502 because the request can't find the fasthttp service. Note that ALB uses keepalive connections by default and that can't be changed.

After a while doing some research, we were suspicious that fasthttp was closing the keepalive connections at some point, and the ALB couldn't re-use it, so it would return a 502.

If we set the Server.DisableKeepAlive = true everything works as expected (with a lot more of load of course)

We reduced our implementation to the minimum to test:

s := &fasthttp.Server{
		Handler:     OurHandler,
		Concurrency: fasthttp.DefaultConcurrency,
	}
	s.DisableKeepalive = true // If this is false, we see the error randomly.

	log.Fatal(s.ListenAndServe(":" + strconv.Itoa(port)))

The handler basically does this:

        // h is an instance of *fasthttp.HostClient configured with some parameters
	if err := h.proxy.Do(req, resp); err != nil {
		log.Error("error when proxying the request: ", err)
	}

Is there any chance someone has experienced this? I'm not sure how we should proceed with the keepalive connections in the fasthttp.Server, as we are using pretty much all the default parameters.

Thanks in advance!

The text was updated successfully, but these errors were encountered:

erikdubbelboer · 2018-06-12T13:19:07Z

Can you try this, it's what we use with the Google Loadbalancer:

s := &fasthttp.Server{
		Handler:              OurHandler,
		ReadTimeout:          time.Hour,
		WriteTimeout:         time.Hour,
		MaxKeepaliveDuration: time.Hour,
	}

How many requests are you getting per second?

Rulox · 2018-06-12T13:54:12Z

Hey thanks for the quick response.
We've been trying tweak some of the Timeouts too and sometimes it seems is better.

About the requests, it's weird. We have 2 instances of the same service:

The first one is receiving around 30k req/s method=POST and it's fine.
The other one receives only 1req/s method=GET and it's the one that is not working correctly. (Same handler, we don't check the method).

Thanks again, we'll try those parameters.

erikdubbelboer · 2018-06-12T14:08:35Z

You might also want to try my fork of fasthttp which is actually being maintained (this original version is not maintained anymore): https://github.com/erikdubbelboer/fasthttp

kirillDanshin · 2018-08-12T21:39:32Z

@erikdubbelboer if you fixed this issue, can you please send us a PR with the fix?
also, this original version is maintained again 🙂

Hemant-Mann · 2018-08-14T06:57:38Z

Hey @erikdubbelboer
I am also using fasthttp behind Google Loadbalancer with your settings - #348 (comment)

But still some of the times I am getting 502 - backend_connection_closed_before_data_sent_to_client

What could be the issue?

kirillDanshin · 2018-08-14T07:13:16Z

@Hemant-Mann I think your 502 errors can occur exactly when connection timeout happens. Please check logs for a time patterns and share us some info

erikdubbelboer · 2018-08-14T11:15:58Z

@Hemant-Mann I'll have another look. I'm about to board a 16 hour flight so it will take a while 😄

kirillDanshin · 2018-08-14T13:24:04Z

@erikdubbelboer have a good trip ;)

Rulox · 2018-08-14T14:23:07Z

Hey guys, just an FYI, we are still having the issue, the only thing we could do was to deactivate the keep alive in the fasthttp Server.

We didn't have time to try your fork @erikdubbelboer :( (Have a nice trip btw!)

PS: I can get some logs of what happens during these errors, but most of the time the request doesn't even reach the go app (running inside a docker container), so I don't think is going to be possible or easy.

Thank you and glad this is being maintained again!

kirillDanshin · 2018-08-14T14:39:08Z

@Rulox I'll be glad to help as soon as I get some logs
FYI, it may be an issue with LB configuration: you should synchronize Keep-Alive time in LB settings and in your fasthttp server configuration

erikdubbelboer · 2018-08-16T06:30:43Z

While searching for the bug I noticed that I have always misunderstood MaxKeepaliveDuration. I thought this was the max idle time between requests. But it is actually the max total time the connection can exist. This means that when you set it to 1 hour connections will always be terminated after that hour. This could explain the issue you are having. The documentation explains this well, I just never read it before 😢

I suggest setting MaxKeepaliveDuration to 0 so connections will be kept open forever. Keep ReadTimeout at a high value so connections don't easily expire while being idle. Let me know if that fixes your issue?

kirillDanshin · 2018-08-16T08:06:06Z

@erikdubbelboer unfortunately, this should not be a goto solution, 'cause you can get yourself leaking connection pool

erikdubbelboer · 2018-08-16T08:18:17Z

@kirillDanshin I don't see how? Connections will still timeout after ReadTimeout. Unless you don't set that either, but my example included a ReadTimeout of 1 hour.

kirillDanshin · 2018-08-16T09:25:18Z

@erikdubbelboer timeouts didn't work for me when using ELB, they're keeping connections alive for a really long time, but never reuse it after time configured in ELB

erikdubbelboer · 2018-08-16T09:35:13Z

@kirillDanshin That sounds like an AWS issue then. I only use the Google Load Balancer which always worked perfectly fine so far.

Rulox · 2018-08-16T10:09:38Z

Hey guys, I don't think it has anything to do with the idle keepalive duration setting. We're seeing the error also on the first 5/6 requests when an user enters our website, so these are new connections from a new host.

Our fasthttp service acts as a proxy with a server and a client (both fasthttp, check my first comment)

    ___________                   _________________
   |  Client   |  <------------> | fasthttp Server |
   |___________|                 | fasthttp Client | <----------------> | Other services in docker network |

So far this is a recap:

There are not logs, because ELB returns a 502 before the request reach our service (this only happens with this service). It's like ELB wants to reuse the connection and the connection is dropped already by the service, so docker doesn't really know what to do, and it can't reach the destination (so 502 in ELB)
Disabling keepalive in the fasthttp server "fixes" the problem (No more 502 but performance is really bad, so we can't do this, as we receive hundreds of thousands of req/sec)
We've set the same parameters in the ELB keepalive settings (duration, idle, etc) and the fasthttp server, and somehow quality improved, but we still see the errors.

I'm starting to think it might be the combination with a fasthttp server and client? This doesn't make any sense. But we have other services (in python for example) that are working well without disruption.

I've tried to change from fasthttp to nethttp to check if there's something we're doing bad, but to be honest, fasthttp performance is much better, and that's something we really need.

Also @Hemant-Mann says he's having issues with the Google LB too, that makes me think that is not only on our end? (Hemant are you using fasthttp just as a web server? something you can relate to my architecture?)

Thanks guys!

erikdubbelboer · 2018-08-16T13:50:21Z

In your diagram Client is ELB?
Can you maybe share the Go code so we can check it?
How many new connections per second are we talking about?
Have you tried setting /proc/sys/net/core/somaxconn to 1023 for example?
Could it be that the server the fasthttp code is running on has 100% CPU usage? I have seen new connections not being accepted by overloaded servers before.

I have looked over the code again but I don't see anything that could cause this.
Serve just accepts new connections in a loop and hands them off to the worker pool
The worker pool function just executes the connection function and closes the connection when it's done
And the connection function just reads in a loop and returns after ReadTimeout

One more thing you could try is removing this optimization. In this comment valyala mentions that the issue was fixed in go1.10 so the optimization shouldn't be needed anymore and could maybe cause issues in some cases where this causes the connection to be closed before you would expect.

I think the issue for @Hemant-Mann might have been the MaxKeepaliveDuration if it only happens once in a while.

Hemant-Mann · 2018-08-16T16:14:16Z

Hi Everyone

So I analysed my GCP load balancer logs for the last 24 hours

Additional Info:

OS: Ubuntu 16.04
GO version 1.10
Timeout settings Fasthttp behind Aws load balancer. Keepalive conn are causing trouble #348 (comment)
Average number of requests served by load balancer is around ~3000 RPS

Below is the stats (Date_Hour: error_count)

{
  "2018-08-15_21": 41,
  "2018-08-15_22": 27,
  "2018-08-15_23": 1,
  "2018-08-16_06": 2,
  "2018-08-16_07": 1,
  "2018-08-16_08": 1,
  "2018-08-16_09": 3,
  "2018-08-16_12": 1,
  "2018-08-16_14": 1,
  "2018-08-16_15": 10,
  "2018-08-16_16": 1,
  "2018-08-16_19": 5,
  "2018-08-16_21": 1
}

This shows the number of 502's generated because of backend_connection_closed_before_data_sent_to_client

kirillDanshin · 2018-08-16T16:19:26Z

@Hemant-Mann please send us your CPU load stats and ulimit -n

Hemant-Mann · 2018-08-16T16:39:21Z

@kirillDanshin I have not configured any separate monitoring for CPU stats other than what the google cloud offers and the backend service autoscales at 60% capacity

and ulimit -n = 100000

erikdubbelboer · 2018-08-16T16:49:05Z

@Hemant-Mann autoscale might be at 60% but insividual machines might still be at 100%. We had the same issue in the past. Can you check if around the error time any machines are at 100%?

Rulox · 2018-08-16T19:11:10Z

Hey @erikdubbelboer I'm going to try to extract the main code as an example.

It happens in our pre-production env with a few connections (1 or 2) too, so we've discarded overload (CPU, connection pool, etc) Thanks!

erikdubbelboer · 2018-08-17T09:12:56Z

If it already happens with that few connections then I think I really need your code to reproduce the issue.

erikdubbelboer · 2018-08-29T15:44:11Z

@Rulox any update on this?

bbuchanan-bcs · 2018-08-30T02:47:09Z

I use fasthttp for multiple services all behind ALBs in AWS. Although, I'm probably using a year old version. I've never seen this issue. The only settings I change from the default are the ReadTimeout, MaxRequestBodySize and Concurrency.

Rulox · 2018-08-30T11:01:51Z

@erikdubbelboer Yeah sorry it's been a crazy month.

I made this as an example https://github.com/Rulox/proxy-tiny

That's pretty much our code (deleting business logic code, which is some header manipulation and security, but nothing else).

The main.go in the root contains the reverse proxy implementation, I prepared a docker-compose.yml as well if you want to try (see readme)

If you see anything weird please let me know! Thanks a lot

erikdubbelboer · 2018-08-31T07:11:53Z

@Rulox in the

// Here we do some things for Auth (check and add headers basically) [..]

code do you remove the Connection header? If you don't remove this and your upstream sends back a Connection: close header you will forward this to the AWS loadbalancer which will then close the connection as well.

This proxy is very little code and should work in theory. Have you tried this simple proxy with the AWS loadbalancer as well and is it causing the exact same issues?

Rulox · 2018-08-31T11:48:37Z

@erikdubbelboer thanks for the quick response. Yes I remove the Connection header before doing the request to the proxied service, and before sending back the response to the client (ELB in this case).

Like this:

func (h *HttpProxy) reverseProxyHandler(ctx *fasthttp.RequestCtx) {
	req := &ctx.Request
	resp := &ctx.Response

        req.Header.Del("Connection")
	if err := h.proxy.Do(req, resp); err != nil {
		resp.SetStatusCode(fasthttp.StatusServiceUnavailable)
		fmt.Printf("error when proxying the request: %s", err)
	}
        resp.Header.Del("Connection")
}

That's going to be my next test, use this code behind the ELB. Thanks for the heads up

Hemant-Mann · 2018-09-02T10:46:25Z

Hi Guys
I have configured fasthttp with default settings and it seem to be working just fine with GCP

Thanks for your efforts: @erikdubbelboer and @kirillDanshin

bslizon · 2018-09-28T06:55:53Z

Maybe you can see this issue moby/moby#31208 and this https://success.docker.com/article/ipvs-connection-timeout-issue
We put our app in docker behind AWS ELB and found the same problem, there is always HTTPCode_ELB_5XX_Count.
Our solution is to enable tcp keepalive, period=3min, the PR #427 has been merged.

Rulox · 2018-09-28T08:42:52Z

Awesome @bslizon I'll try , thanks

erikdubbelboer · 2018-10-08T11:32:52Z

@Rulox is this still and issue or can this be closed?

Rulox · 2018-10-16T13:24:13Z

@erikdubbelboer You can close if you want to keep the list clean, we haven't had the time to test it as we're swamped with different things so I'm not sure when we will have the time. Sorry for any inconvenience.

erikdubbelboer · 2018-10-16T14:08:46Z

@Rulox Ok I'll close it for now. You can reopen it if you find the same issue in the future.

Arnold1 · 2019-06-29T02:52:39Z

@erikdubbelboer hi, im using fasthttp v1.1.0. what are the suggested settings for MaxKeepaliveDuration, readTimeout, writeTimeout? i want to make sure the connections are open as long as possible and dont close MaxKeepaliveDuration time...

erikdubbelboer · 2019-07-03T21:38:10Z

You'll have to use 1.3.0 and set IdleTimeout.

Arnold1 · 2019-07-08T19:16:08Z

@erikdubbelboer what is the issue with fasthttp v1.1.0?

1: is there a way to monitor how many cuncurrent requests i receive?
2: is there a way to monitor how many hosts are connected with my fasthttp server?
3: is there a way to monitor every time a new connection opens?

for 1 and 3 i can use:
// This function is intended be used by monitoring systems
func (s *Server) GetCurrentConcurrency() uint32 {
return atomic.LoadUint32(&s.concurrency)
}

// GetOpenConnectionsCount returns a number of opened connections.
//
// This function is intended be used by monitoring systems
func (s *Server) GetOpenConnectionsCount() int32 {
return atomic.LoadInt32(&s.open) - 1
}

erikdubbelboer · 2019-07-09T15:42:54Z

@Arnold1 v1.1.0 is and old release. It's obviously always better to use a newer version to get the latest improvements and bug fixes. I just released v1.4.0, I suggest you use that.

I suggest you add a simple counter to your Handler and print the difference each second.
Server.GetOpenConnectionsCount returns the number of open connections like you already found out.
You can use the Server.ConnState API to keep track of this.

Arnold1 · 2019-07-09T18:47:06Z

@erikdubbelboer

ok will try v1.4.

i tried to use server.GetCurrentConcurrency() but it always is set to 0. why is that?

erikdubbelboer · 2019-07-09T19:05:58Z

server.GetCurrentConcurrency() gets your current concurrency setting. Which you have set to 0 to use the default. You want to use server.GetOpenConnectionsCount().

Arnold1 · 2019-07-09T19:14:04Z

isnt OpenConnections different from concurrentRequests?

im thinking of adding a counter to my handler:

func requestRoute(ctx *fasthttp.RequestCtx) {
	atomic.AddUint32(&concurrency, 1)
	defer atomic.AddUint32(&concurrency, ^uint32(0))

	// do stuff
}

erikdubbelboer · 2019-07-10T08:39:05Z

Yes, server.GetOpenConnectionsCount() will also count idle connections. If you want the number of concurrent requests currently being served you'll have to do it like this yes.

kirillDanshin added bug help wanted labels Aug 12, 2018

kirillDanshin added the pending/submitter-response label Aug 29, 2018

erikdubbelboer closed this as completed Oct 16, 2018

dgrr mentioned this issue Sep 24, 2019

4xx and Unexpected EOF #660

Closed

Fasthttp behind Aws load balancer. Keepalive conn are causing trouble #348

Fasthttp behind Aws load balancer. Keepalive conn are causing trouble #348

Comments

Rulox commented Jun 12, 2018

erikdubbelboer commented Jun 12, 2018

Uh oh!

Rulox commented Jun 12, 2018

Uh oh!

erikdubbelboer commented Jun 12, 2018

Uh oh!

kirillDanshin commented Aug 12, 2018

Uh oh!

Hemant-Mann commented Aug 14, 2018

Uh oh!

kirillDanshin commented Aug 14, 2018

Uh oh!

erikdubbelboer commented Aug 14, 2018

Uh oh!

kirillDanshin commented Aug 14, 2018

Uh oh!

Rulox commented Aug 14, 2018

Uh oh!

kirillDanshin commented Aug 14, 2018

Uh oh!

erikdubbelboer commented Aug 16, 2018

Uh oh!

kirillDanshin commented Aug 16, 2018

Uh oh!

erikdubbelboer commented Aug 16, 2018

Uh oh!

kirillDanshin commented Aug 16, 2018

Uh oh!

erikdubbelboer commented Aug 16, 2018

Uh oh!

Rulox commented Aug 16, 2018

Uh oh!

erikdubbelboer commented Aug 16, 2018

Uh oh!

Hemant-Mann commented Aug 16, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kirillDanshin commented Aug 16, 2018

Uh oh!

Hemant-Mann commented Aug 16, 2018

Uh oh!

erikdubbelboer commented Aug 16, 2018

Uh oh!

Rulox commented Aug 16, 2018

Uh oh!

erikdubbelboer commented Aug 17, 2018

Uh oh!

erikdubbelboer commented Aug 29, 2018

Uh oh!

bbuchanan-bcs commented Aug 30, 2018

Uh oh!

Rulox commented Aug 30, 2018

Uh oh!

erikdubbelboer commented Aug 31, 2018

Uh oh!

Rulox commented Aug 31, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Hemant-Mann commented Sep 2, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bslizon commented Sep 28, 2018

Uh oh!

Rulox commented Sep 28, 2018

Uh oh!

erikdubbelboer commented Oct 8, 2018

Uh oh!

Rulox commented Oct 16, 2018

Uh oh!

erikdubbelboer commented Oct 16, 2018

Uh oh!

Arnold1 commented Jun 29, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

erikdubbelboer commented Jul 3, 2019

Uh oh!

Hemant-Mann commented Aug 16, 2018 •

edited

Loading

Rulox commented Aug 31, 2018 •

edited

Loading

Hemant-Mann commented Sep 2, 2018 •

edited

Loading

Arnold1 commented Jun 29, 2019 •

edited

Loading

Arnold1 commented Jul 8, 2019 •

edited

Loading

Arnold1 commented Jul 9, 2019 •

edited

Loading