Skip to content

Fasthttp behind Aws load balancer. Keepalive conn are causing trouble #348

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Rulox opened this issue Jun 12, 2018 · 42 comments
Closed

Fasthttp behind Aws load balancer. Keepalive conn are causing trouble #348

Rulox opened this issue Jun 12, 2018 · 42 comments

Comments

@Rulox
Copy link

Rulox commented Jun 12, 2018

Hi!

We're using a light/fast fasthttp server as a proxy in our services infrastructure. However, we've been experiencing some issues when we use an amazon Load Balancer. Sometimes (and this is randomly) the ALB returns 502 because the request can't find the fasthttp service. Note that ALB uses keepalive connections by default and that can't be changed.

After a while doing some research, we were suspicious that fasthttp was closing the keepalive connections at some point, and the ALB couldn't re-use it, so it would return a 502.

If we set the Server.DisableKeepAlive = true everything works as expected (with a lot more of load of course)

We reduced our implementation to the minimum to test:

s := &fasthttp.Server{
		Handler:     OurHandler,
		Concurrency: fasthttp.DefaultConcurrency,
	}
	s.DisableKeepalive = true // If this is false, we see the error randomly.

	log.Fatal(s.ListenAndServe(":" + strconv.Itoa(port)))

The handler basically does this:

        // h is an instance of *fasthttp.HostClient configured with some parameters
	if err := h.proxy.Do(req, resp); err != nil {
		log.Error("error when proxying the request: ", err)
	}

Is there any chance someone has experienced this? I'm not sure how we should proceed with the keepalive connections in the fasthttp.Server, as we are using pretty much all the default parameters.

Thanks in advance!

@erikdubbelboer
Copy link
Collaborator

Can you try this, it's what we use with the Google Loadbalancer:

s := &fasthttp.Server{
		Handler:              OurHandler,
		ReadTimeout:          time.Hour,
		WriteTimeout:         time.Hour,
		MaxKeepaliveDuration: time.Hour,
	}

How many requests are you getting per second?

@Rulox
Copy link
Author

Rulox commented Jun 12, 2018

Hey thanks for the quick response.
We've been trying tweak some of the Timeouts too and sometimes it seems is better.

About the requests, it's weird. We have 2 instances of the same service:

  • The first one is receiving around 30k req/s method=POST and it's fine.
  • The other one receives only 1req/s method=GET and it's the one that is not working correctly. (Same handler, we don't check the method).

Thanks again, we'll try those parameters.

@erikdubbelboer
Copy link
Collaborator

You might also want to try my fork of fasthttp which is actually being maintained (this original version is not maintained anymore): https://github.com/erikdubbelboer/fasthttp

@kirillDanshin
Copy link
Collaborator

@erikdubbelboer if you fixed this issue, can you please send us a PR with the fix?
also, this original version is maintained again 🙂

@Hemant-Mann
Copy link

Hey @erikdubbelboer
I am also using fasthttp behind Google Loadbalancer with your settings - #348 (comment)

But still some of the times I am getting 502 - backend_connection_closed_before_data_sent_to_client

What could be the issue?

@kirillDanshin
Copy link
Collaborator

@Hemant-Mann I think your 502 errors can occur exactly when connection timeout happens. Please check logs for a time patterns and share us some info

@erikdubbelboer
Copy link
Collaborator

@Hemant-Mann I'll have another look. I'm about to board a 16 hour flight so it will take a while 😄

@kirillDanshin
Copy link
Collaborator

@erikdubbelboer have a good trip ;)

@Rulox
Copy link
Author

Rulox commented Aug 14, 2018

Hey guys, just an FYI, we are still having the issue, the only thing we could do was to deactivate the keep alive in the fasthttp Server.

We didn't have time to try your fork @erikdubbelboer :( (Have a nice trip btw!)

PS: I can get some logs of what happens during these errors, but most of the time the request doesn't even reach the go app (running inside a docker container), so I don't think is going to be possible or easy.

Thank you and glad this is being maintained again!

@kirillDanshin
Copy link
Collaborator

@Rulox I'll be glad to help as soon as I get some logs
FYI, it may be an issue with LB configuration: you should synchronize Keep-Alive time in LB settings and in your fasthttp server configuration

@erikdubbelboer
Copy link
Collaborator

While searching for the bug I noticed that I have always misunderstood MaxKeepaliveDuration. I thought this was the max idle time between requests. But it is actually the max total time the connection can exist. This means that when you set it to 1 hour connections will always be terminated after that hour. This could explain the issue you are having. The documentation explains this well, I just never read it before 😢

I suggest setting MaxKeepaliveDuration to 0 so connections will be kept open forever. Keep ReadTimeout at a high value so connections don't easily expire while being idle. Let me know if that fixes your issue?

@kirillDanshin
Copy link
Collaborator

@erikdubbelboer unfortunately, this should not be a goto solution, 'cause you can get yourself leaking connection pool

@erikdubbelboer
Copy link
Collaborator

@kirillDanshin I don't see how? Connections will still timeout after ReadTimeout. Unless you don't set that either, but my example included a ReadTimeout of 1 hour.

@kirillDanshin
Copy link
Collaborator

@erikdubbelboer timeouts didn't work for me when using ELB, they're keeping connections alive for a really long time, but never reuse it after time configured in ELB

@erikdubbelboer
Copy link
Collaborator

@kirillDanshin That sounds like an AWS issue then. I only use the Google Load Balancer which always worked perfectly fine so far.

@Rulox
Copy link
Author

Rulox commented Aug 16, 2018

Hey guys, I don't think it has anything to do with the idle keepalive duration setting. We're seeing the error also on the first 5/6 requests when an user enters our website, so these are new connections from a new host.

Our fasthttp service acts as a proxy with a server and a client (both fasthttp, check my first comment)

    ___________                   _________________
   |  Client   |  <------------> | fasthttp Server |
   |___________|                 | fasthttp Client | <----------------> | Other services in docker network |

So far this is a recap:

  • There are not logs, because ELB returns a 502 before the request reach our service (this only happens with this service). It's like ELB wants to reuse the connection and the connection is dropped already by the service, so docker doesn't really know what to do, and it can't reach the destination (so 502 in ELB)
  • Disabling keepalive in the fasthttp server "fixes" the problem (No more 502 but performance is really bad, so we can't do this, as we receive hundreds of thousands of req/sec)
  • We've set the same parameters in the ELB keepalive settings (duration, idle, etc) and the fasthttp server, and somehow quality improved, but we still see the errors.

I'm starting to think it might be the combination with a fasthttp server and client? This doesn't make any sense. But we have other services (in python for example) that are working well without disruption.

I've tried to change from fasthttp to nethttp to check if there's something we're doing bad, but to be honest, fasthttp performance is much better, and that's something we really need.

Also @Hemant-Mann says he's having issues with the Google LB too, that makes me think that is not only on our end? (Hemant are you using fasthttp just as a web server? something you can relate to my architecture?)

Thanks guys!

@erikdubbelboer
Copy link
Collaborator

In your diagram Client is ELB?
Can you maybe share the Go code so we can check it?
How many new connections per second are we talking about?
Have you tried setting /proc/sys/net/core/somaxconn to 1023 for example?
Could it be that the server the fasthttp code is running on has 100% CPU usage? I have seen new connections not being accepted by overloaded servers before.

I have looked over the code again but I don't see anything that could cause this.
Serve just accepts new connections in a loop and hands them off to the worker pool
The worker pool function just executes the connection function and closes the connection when it's done
And the connection function just reads in a loop and returns after ReadTimeout

One more thing you could try is removing this optimization. In this comment valyala mentions that the issue was fixed in go1.10 so the optimization shouldn't be needed anymore and could maybe cause issues in some cases where this causes the connection to be closed before you would expect.

I think the issue for @Hemant-Mann might have been the MaxKeepaliveDuration if it only happens once in a while.

@Hemant-Mann
Copy link

Hemant-Mann commented Aug 16, 2018

Hi Everyone

So I analysed my GCP load balancer logs for the last 24 hours

Additional Info:

Below is the stats (Date_Hour: error_count)

{
  "2018-08-15_21": 41,
  "2018-08-15_22": 27,
  "2018-08-15_23": 1,
  "2018-08-16_06": 2,
  "2018-08-16_07": 1,
  "2018-08-16_08": 1,
  "2018-08-16_09": 3,
  "2018-08-16_12": 1,
  "2018-08-16_14": 1,
  "2018-08-16_15": 10,
  "2018-08-16_16": 1,
  "2018-08-16_19": 5,
  "2018-08-16_21": 1
}

This shows the number of 502's generated because of backend_connection_closed_before_data_sent_to_client

@kirillDanshin
Copy link
Collaborator

@Hemant-Mann please send us your CPU load stats and ulimit -n

@Hemant-Mann
Copy link

@kirillDanshin I have not configured any separate monitoring for CPU stats other than what the google cloud offers and the backend service autoscales at 60% capacity

and ulimit -n = 100000

@erikdubbelboer
Copy link
Collaborator

@Hemant-Mann autoscale might be at 60% but insividual machines might still be at 100%. We had the same issue in the past. Can you check if around the error time any machines are at 100%?

@Rulox
Copy link
Author

Rulox commented Aug 16, 2018

Hey @erikdubbelboer I'm going to try to extract the main code as an example.

It happens in our pre-production env with a few connections (1 or 2) too, so we've discarded overload (CPU, connection pool, etc) Thanks!

@erikdubbelboer
Copy link
Collaborator

If it already happens with that few connections then I think I really need your code to reproduce the issue.

@erikdubbelboer
Copy link
Collaborator

@Rulox any update on this?

@bbuchanan-bcs
Copy link

I use fasthttp for multiple services all behind ALBs in AWS. Although, I'm probably using a year old version. I've never seen this issue. The only settings I change from the default are the ReadTimeout, MaxRequestBodySize and Concurrency.

@Rulox
Copy link
Author

Rulox commented Aug 30, 2018

@erikdubbelboer Yeah sorry it's been a crazy month.

I made this as an example https://github.com/Rulox/proxy-tiny

That's pretty much our code (deleting business logic code, which is some header manipulation and security, but nothing else).

The main.go in the root contains the reverse proxy implementation, I prepared a docker-compose.yml as well if you want to try (see readme)

If you see anything weird please let me know! Thanks a lot

@erikdubbelboer
Copy link
Collaborator

@Rulox in the

// Here we do some things for Auth (check and add headers basically) [..]

code do you remove the Connection header? If you don't remove this and your upstream sends back a Connection: close header you will forward this to the AWS loadbalancer which will then close the connection as well.

This proxy is very little code and should work in theory. Have you tried this simple proxy with the AWS loadbalancer as well and is it causing the exact same issues?

@Rulox
Copy link
Author

Rulox commented Aug 31, 2018

@erikdubbelboer thanks for the quick response. Yes I remove the Connection header before doing the request to the proxied service, and before sending back the response to the client (ELB in this case).

Like this:

func (h *HttpProxy) reverseProxyHandler(ctx *fasthttp.RequestCtx) {
	req := &ctx.Request
	resp := &ctx.Response

        req.Header.Del("Connection")
	if err := h.proxy.Do(req, resp); err != nil {
		resp.SetStatusCode(fasthttp.StatusServiceUnavailable)
		fmt.Printf("error when proxying the request: %s", err)
	}
        resp.Header.Del("Connection")
}

That's going to be my next test, use this code behind the ELB. Thanks for the heads up

@Hemant-Mann
Copy link

Hemant-Mann commented Sep 2, 2018

Hi Guys
I have configured fasthttp with default settings and it seem to be working just fine with GCP

Thanks for your efforts: @erikdubbelboer and @kirillDanshin

@bslizon
Copy link
Contributor

bslizon commented Sep 28, 2018

Maybe you can see this issue moby/moby#31208 and this https://success.docker.com/article/ipvs-connection-timeout-issue
We put our app in docker behind AWS ELB and found the same problem, there is always HTTPCode_ELB_5XX_Count.
Our solution is to enable tcp keepalive, period=3min, the PR #427 has been merged.

@Rulox
Copy link
Author

Rulox commented Sep 28, 2018

Awesome @bslizon I'll try , thanks

@erikdubbelboer
Copy link
Collaborator

@Rulox is this still and issue or can this be closed?

@Rulox
Copy link
Author

Rulox commented Oct 16, 2018

@erikdubbelboer You can close if you want to keep the list clean, we haven't had the time to test it as we're swamped with different things so I'm not sure when we will have the time. Sorry for any inconvenience.

@erikdubbelboer
Copy link
Collaborator

@Rulox Ok I'll close it for now. You can reopen it if you find the same issue in the future.

@Arnold1
Copy link

Arnold1 commented Jun 29, 2019

@erikdubbelboer hi, im using fasthttp v1.1.0. what are the suggested settings for MaxKeepaliveDuration, readTimeout, writeTimeout? i want to make sure the connections are open as long as possible and dont close MaxKeepaliveDuration time...

@erikdubbelboer
Copy link
Collaborator

You'll have to use 1.3.0 and set IdleTimeout.

@Arnold1
Copy link

Arnold1 commented Jul 8, 2019

@erikdubbelboer what is the issue with fasthttp v1.1.0?

  • 1: is there a way to monitor how many cuncurrent requests i receive?
  • 2: is there a way to monitor how many hosts are connected with my fasthttp server?
  • 3: is there a way to monitor every time a new connection opens?

for 1 and 3 i can use:
// This function is intended be used by monitoring systems
func (s *Server) GetCurrentConcurrency() uint32 {
return atomic.LoadUint32(&s.concurrency)
}

// GetOpenConnectionsCount returns a number of opened connections.
//
// This function is intended be used by monitoring systems
func (s *Server) GetOpenConnectionsCount() int32 {
return atomic.LoadInt32(&s.open) - 1
}

@erikdubbelboer
Copy link
Collaborator

@Arnold1 v1.1.0 is and old release. It's obviously always better to use a newer version to get the latest improvements and bug fixes. I just released v1.4.0, I suggest you use that.

  1. I suggest you add a simple counter to your Handler and print the difference each second.
  2. Server.GetOpenConnectionsCount returns the number of open connections like you already found out.
  3. You can use the Server.ConnState API to keep track of this.

@Arnold1
Copy link

Arnold1 commented Jul 9, 2019

@erikdubbelboer

ok will try v1.4.

  1. i tried to use server.GetCurrentConcurrency() but it always is set to 0. why is that?

@erikdubbelboer
Copy link
Collaborator

server.GetCurrentConcurrency() gets your current concurrency setting. Which you have set to 0 to use the default. You want to use server.GetOpenConnectionsCount().

@Arnold1
Copy link

Arnold1 commented Jul 9, 2019

isnt OpenConnections different from concurrentRequests?

im thinking of adding a counter to my handler:

func requestRoute(ctx *fasthttp.RequestCtx) {
	atomic.AddUint32(&concurrency, 1)
	defer atomic.AddUint32(&concurrency, ^uint32(0))

	// do stuff
}

@erikdubbelboer
Copy link
Collaborator

Yes, server.GetOpenConnectionsCount() will also count idle connections. If you want the number of concurrent requests currently being served you'll have to do it like this yes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants