Skip to content

NodeHttp2Handler does not handle session failures #1525

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
isker opened this issue Sep 14, 2020 · 12 comments
Closed

NodeHttp2Handler does not handle session failures #1525

isker opened this issue Sep 14, 2020 · 12 comments

Comments

@isker
Copy link
Contributor

isker commented Sep 14, 2020

Describe the bug
NodeHttp2Handler pools its Http2Sessions per authority. An Http2Session can be closed for a variety of reasons, including those outside of the control of the client. (See the docs for the goaway event which can be triggered by the server at any time.)

Whenever a session is closed, it is invalid for further use, but NodeHttp2Handler only invalidates its cache for sessions it has closed itself. This is insufficient - if the server triggers node to close the session for any reason, the NodeHttp2Handler is invalid for that authority forever.

SDK version number
@aws-sdk/[email protected]

Is the issue in the browser/Node.js/ReactNative?
node

Details of the browser/Node.js/ReactNative version

λ node -v
v13.9.0

To Reproduce (observed behavior)
The repro steps and context for this issue are the same as in #1524.

Expected behavior
NodeHttp2Handler should invalidate its cache on the close event for any Http2Session. Compare to another popular client:
https://github.com/hisco/http2-client/blob/12a9e6fa6701a46e92e9b7adf395ce646b2a26b4/lib/request.js#L299-L303

@isker isker changed the title NodeHttp2Handler does not handle session failures NodeHttp2Handler does not handle session failures Sep 14, 2020
@TysonAndre
Copy link
Contributor

TysonAndre commented Mar 29, 2021

I believe that a patch to fix this would look something along these lines? Note that I haven't tested or even run this - anything involving http/2 would probably require multiple unit tests - node's http/2 implementation has some edge cases in older node versions that can lead to crashes (possibly related to closing a session while it's in the process of closing)

--- a/packages/node-http-handler/src/node-http2-handler.ts
+++ b/packages/node-http-handler/src/node-http2-handler.ts
@@ -107,14 +107,31 @@ export class NodeHttp2Handler implements HttpHandler {
 
     const newSession = connect(authority);
     connectionPool.set(authority, newSession);
+    const destroySessionCb = () => {
+      this.destroySession(authority, newSession);
+    };
+    newSession.on('close' , destroySessionCb); // probably redundant?
+    newSession.on('goaway' , destroySessionCb);
+    newSession.on('error' , destroySessionCb);
+    newSession.on('frameError' , destroySessionCb);
 
     const sessionTimeout = this.sessionTimeout;
     if (sessionTimeout) {
-      newSession.setTimeout(sessionTimeout, () => {
-        newSession.close();
-        connectionPool.delete(authority);
-      });
+      newSession.setTimeout(sessionTimeout, destroySessionCb);
     }
     return newSession;
   }
+
+  /**
+   * Destroy a session and remove it from the http2 pool.
+   *
+   * This check ensures that the session is only closed once
+   * and that an event on one session does not close a different session.
+   */
+  private destroySession(authority: string, session: ClientHttp2Session): void {
+    if (this.connectionPool.get(authority) === session) {
+      this.connectionPool.delete(authority);
+      session.close();
+    }
+  }
 }

I'm moderately familiar with node's http/2 implementation - I'd proposed a fix for a similar bug in https://github.com/parse-community/node-apn/ in parse-community/node-apn#27 - the maintainers went with a different approach inspired by that code, an HTTP/2 client for APNs(Apple Push Notification Service), to tolerate spurious connection errors to HTTP/2

I'm not sure which approach you planned to take, but feel free to use this patch or base code off of the unit tests I wrote for https://github.com/parse-community/node-apn/pull/27/files#diff-ecc2f3b14c967f02fd074ac2d0e0b876e683c84f2755d251bbaa74a813a82130 (e.g. createAndStartMockServer, createAndStartMockLowLevelServer) - but I assume aws's sdk already has something similar

I'd also been involved in setting up tests for https://github.com/parse-community/node-apn/blob/master/test/client.js

  • I mention this because the tests of a mock server emitting goaway on the client may be relevant

Also, what's the expected impact?

I'd been assuming that a spurious aws server error or networking error may lead to prompt errors (and the setTimeout not being reached), meaning that a failed connection would result in all subsequent requests to aws for that authority failing, which has made me hesitate to start using NodeHttp2Handler for a long lived node process?


NodeHttp2Handler should invalidate its cache on the close event for any Http2Session

I think it should be done earlier, when the first error is encountered - close is only called after the session is finished being destroyed, which would only happen after all requests on that http2 connection had completed

https://nodejs.org/api/http2.html#http2_event_close

The 'close' event is emitted once the Http2Session has been destroyed. Its listener does not expect any arguments.

@TysonAndre
Copy link
Contributor

https://github.com/hisco/http2-client/blob/12a9e6fa6701a46e92e9b7adf395ce646b2a26b4/lib/request.js#L306-L320

The linked client also checks the .closed property before closing to see if it was already closed, and removes all event listeners of the client

TysonAndre added a commit to TysonAndre/aws-sdk-js-v3 that referenced this issue Apr 22, 2021
fix: detect errors on the NodeHttp2Handler, immediately destroy
connections on unexpected error mode, and reconnect.

Prior to this PR, if the server sent the client a GOAWAY frame,
the session would not be removed from the connection pool and requests
would fail indefinitely.

This tries to avoid keeping streams(requests) on the Http2Session
(tcp connection) from being stuck in an open state waiting for a gentle
close even if there were unexpected protocol or connection errors
during the request, assuming http2 errors are rare.
(if a server or load balancer or network is misbehaving,
close() might get stuck waiting for requests to finish,
especially if requests and sessions don't have timeouts?)

I'm only slightly familiar with http/2 client implementations from
working on clients for Apple Push Notification Service.

- In those, the client could rely on a session and request timeout
  existing, so close() would finish. In aws-sdk-js-v3, timeouts are optional.
- I've seen some strange race conditions prior to node 12 in different
  client implementations for close and/or destroying

Fixes aws#1525
TysonAndre added a commit to TysonAndre/aws-sdk-js-v3 that referenced this issue Apr 22, 2021
fix: detect errors on the NodeHttp2Handler, immediately destroy
connections on unexpected error mode, and reconnect.

Prior to this PR, if the server sent the client a GOAWAY frame,
the session would not be removed from the connection pool and requests
would fail indefinitely.

This tries to avoid keeping streams(requests) on the Http2Session
(tcp connection) from being stuck in an open state waiting for a gentle
close even if there were unexpected protocol or connection errors
during the request, assuming http2 errors are rare.
(if a server or load balancer or network is misbehaving,
close() might get stuck waiting for requests to finish,
especially if requests and sessions don't have timeouts?)

I'm only slightly familiar with http/2 client implementations from
working on clients for Apple Push Notification Service.

- In those, the client could rely on a session and request timeout
  existing, so close() would finish. In aws-sdk-js-v3, timeouts are optional.
- I've seen some strange race conditions prior to node 12 in different
  client implementations for close and/or destroying

Fixes aws#1525
@hari-007
Copy link

hari-007 commented May 9, 2021

Is this below error from client-lex-runtime-v2 also relates to this?.

image

It is happening after the bot session timeout time, even though I am on the putSession call !!

  • If so, when are you (@trivikr or @TysonAndre) planning to push this high priority labeled patch?

  • If not, I want to open an issue for this as it is forcing me to restart the node server

Any simple reply would be appreciated.

@TysonAndre
Copy link
Contributor

Is this below error from client-lex-runtime-v2 also relates to this?.

That seems likely to be related to this - the intent of the PR was to fix issues like that when http/2 was used (reconnect for destroyed sessions) so that I could safely use clients such as dynamo (and various others) with http/2

The PR you linked wasn't merged yet and will likely be fixed by that PR - 3.14.0 was published on April 30th and the PR was merged 5 days later on May 5th.

If so, when are you (@trivikr or @TysonAndre) planning to push this high priority labeled patch?

I'm an external contributor to AWS, not a maintainer, so I can't publish anything, but releases seem to typically be on fridays?

Unit tests were added to confirm that disconnections initiated by the server
resulted in the connection being removed from the connection pool.

More thorough manual tests would be useful to test all possible networking edge cases,
e.g. completely losing a connection to aws load balancers, a connection closing without a goaway frame, etc,
testing this with a high volume of traffic to see if it recovers properly, etc.,
but I'm not familiar with how to simulate that accurately in node

I'd want to have it properly tested however the aws team normally tests changes rather than rushing to merge it out

@trivikr
Copy link
Member

trivikr commented May 11, 2021

Closing as the fix was released in v3.15.0.

Please reopen or comment if the issue still exists.

@trivikr trivikr closed this as completed May 11, 2021
@hari-007
Copy link

@trivikr : I am still able to reproduce this issue. And it is not happening with V2 though.

  • It is exactly happening after 5 minutes of idle action. Also, so consistent (very exact 5 minute window, it will throw the session timeout)
  • Everything is the same as I commented above. Not working with 3.14.0, 3.15.0, and 3.16.0 too
  • For now, I am switching back to V2 library as my project timeline closing. Hope someone can reproduce, fix, and release soon. @TysonAndre : Got you, thanks for the quick reply.

Demo git project to reproduce this:

  • Here is the demo reproducable git repo for > node-aws-lex-bot-communicator-demo
    • I created this demo project to show case this session failure scenario with Lex client.
    • I have given all the instructions for how to reproduce in that repo ReadMe.MD file
    • Along with the session destroyed, I often got the Connection reset error like below :

image

@TysonAndre
Copy link
Contributor

TysonAndre commented May 15, 2021

It is exactly happening after 5 minutes of idle action. Also, so consistent (very exact 5 minute window, it will throw the session timeout)

https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-target-groups.html#target-group-attributes

I'm not sure if this is actually the issue (I'm not familiar with client-lex-runtime-v2 and don't work for aws/amazon), but I suspect it may be related to the load balancer idle timeout.

deregistration_delay.timeout_seconds
The amount of time for Elastic Load Balancing to wait before deregistering a target. The range is 0–3600 seconds. The default value is 300 seconds. (5 minutes)

@TysonAndre
Copy link
Contributor

@hari-007 in case it's an issue that only happens on older node versions, what node major&minor version (node --version) are you using in production?

@TysonAndre
Copy link
Contributor

+  private destroySession(authority: string, session: ClientHttp2Session): void {
+    if (this.connectionPool.get(authority) !== session) {
+      // Already closed?
+      return;
+    }
+    this.connectionPool.delete(authority);
+    session.removeAllListeners("goaway");
+    session.removeAllListeners("error");
+    session.removeAllListeners("frameError");

This seems like it may have been overly broad - the implementation was based on another one mentioned earlier in the link - https://github.com/hisco/http2-client/blob/12a9e6fa6701a46e92e9b7adf395ce646b2a26b4/lib/request.js#L306-L320
(in case the lack of listener cleanup would lead to a leak - I'm not sure if that's necessary)

I was assuming the http2 client wouldn't register its own listeners, but I'm thinking that assumption is a mistake - https://nodejs.org/api/events.html - if it was registering its own listeners before/after aws-sdk registered one or during shutdown to suppress thrown exceptions, then removeAllListeners would cause issues

@TysonAndre
Copy link
Contributor

https://github.com/hari-007/node-aws-lex-bot-communicator-demo/blob/f80045d406506ab3f4791cf3c4954438ae7e1d2d/lex-client-v3.js

image

@hari-007 Is that screenshot the output from the console.log you added? Do subsequent requests to await this.lexRunTimeV2Service.recognizeText(lexServiceV2RequestParams); succeed or fail (until the next idle timeout)

(someone with more time could run this themselves and figure it out)

It's likely you only know if a connection is reset once you actually try to send data and the remote tcp server sends a TCP RST in response

      return await lexRunTimeV2Service.recognizeText(lexServiceV2RequestParams);
      

      /** Trail - 2 */
      // Here we are creating LexRunTimeV2 instance once per class instantiate
      // return await this.lexRunTimeV2Service.recognizeText(lexServiceV2RequestParams);
    } catch (ex) {
      console.log(ex);
      throw ex;
    }

A possible mitigation to keep connections alive would be to call session.ping() - just the presence of this with an appropriate interval would make the load balancer less likely to shut down connections

A project I've contributed to had used this approach for push http/2 connections to a different service for push notifications - https://github.com/parse-community/node-apn/blob/c0d2bfb714fd2c428cf84654229d69783ba0a9c6/lib/client.js#L35-L45

However, I'm not sure if the lack of ping is deliberate on the part of aws (to make resource leaks less likely if an application doesn't reuse agents), though it'd be convenient to have an option to enable ping with a configurable timeout when enabling a NodeHttp2Handler (and clearInterval when destroying the handler)

@hari-007
Copy link

hari-007 commented May 15, 2021

@TysonAndre :: Much appreciate your quick response. I was just mentioned you to thank you for the reply, but you took the charge of the issue and provided a detailed analysis again. Thanks for this.

I'm not sure if this is actually the issue (I'm not familiar with client-lex-runtime-v2 and don't work for AWS/amazon), but I suspect it may be related to the load balancer idle timeout.

  • Yes, I too thought about the same. As this session destroy is consistent (After 5 to 6 consecutive trials with timer observed).

what node major&minor version (node --version) are you using in production?

  • Currently, it is my local. And building this Lex app for my freelance project (Probably my client is using >12.x.x and <15.x.x)
  • I am running with 12.22.1 from Mac

Is that screenshot the output from the console.log you added? Do subsequent requests to await this.lexRunTimeV2Service.recognizeText(lexServiceV2RequestParams); succeed or fail (until the next idle timeout)

  • Yes, for that console log. This screenshot (ECONNREST Reset) happened randomly once and then it got resolved from the next request onwards. And Subsequent requests got succeded.
  • But the actual session destroy forcing me to restart the server.

A possible mitigation to keep connections alive would be to call session.ping() - just the presence of this with an appropriate interval would make the load balancer less likely to shut down connections

A project I've contributed to had used this approach to push http/2 connections to a different service for push notifications - https://github.com/parse-community/node-apn/blob/c0d2bfb714fd2c428cf84654229d69783ba0a9c6/lib/client.js#L35-L45

  • As I have the alternative of V2 library, I am not using your approach.
  • But if there is no V2 library, I am sure that your time internal ping hack to keep LB connection alive would be a prominent solution until we get the session handling from node-http.

@github-actions
Copy link

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs and link to relevant comments in this thread.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators May 30, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants