NodeHttp2Handler does not handle session failures #1525

isker · 2020-09-14T21:40:54Z

Describe the bug
NodeHttp2Handler pools its Http2Sessions per authority. An Http2Session can be closed for a variety of reasons, including those outside of the control of the client. (See the docs for the goaway event which can be triggered by the server at any time.)

Whenever a session is closed, it is invalid for further use, but NodeHttp2Handler only invalidates its cache for sessions it has closed itself. This is insufficient - if the server triggers node to close the session for any reason, the NodeHttp2Handler is invalid for that authority forever.

SDK version number
@aws-sdk/[email protected]

Is the issue in the browser/Node.js/ReactNative?
node

Details of the browser/Node.js/ReactNative version

λ node -v
v13.9.0

To Reproduce (observed behavior)
The repro steps and context for this issue are the same as in #1524.

Expected behavior
NodeHttp2Handler should invalidate its cache on the close event for any Http2Session. Compare to another popular client:
https://github.com/hisco/http2-client/blob/12a9e6fa6701a46e92e9b7adf395ce646b2a26b4/lib/request.js#L299-L303

The text was updated successfully, but these errors were encountered:

TysonAndre · 2021-03-29T16:13:45Z

I believe that a patch to fix this would look something along these lines? Note that I haven't tested or even run this - anything involving http/2 would probably require multiple unit tests - node's http/2 implementation has some edge cases in older node versions that can lead to crashes (possibly related to closing a session while it's in the process of closing)

--- a/packages/node-http-handler/src/node-http2-handler.ts
+++ b/packages/node-http-handler/src/node-http2-handler.ts
@@ -107,14 +107,31 @@ export class NodeHttp2Handler implements HttpHandler {
 
     const newSession = connect(authority);
     connectionPool.set(authority, newSession);
+    const destroySessionCb = () => {
+      this.destroySession(authority, newSession);
+    };
+    newSession.on('close' , destroySessionCb); // probably redundant?
+    newSession.on('goaway' , destroySessionCb);
+    newSession.on('error' , destroySessionCb);
+    newSession.on('frameError' , destroySessionCb);
 
     const sessionTimeout = this.sessionTimeout;
     if (sessionTimeout) {
-      newSession.setTimeout(sessionTimeout, () => {
-        newSession.close();
-        connectionPool.delete(authority);
-      });
+      newSession.setTimeout(sessionTimeout, destroySessionCb);
     }
     return newSession;
   }
+
+  /**
+   * Destroy a session and remove it from the http2 pool.
+   *
+   * This check ensures that the session is only closed once
+   * and that an event on one session does not close a different session.
+   */
+  private destroySession(authority: string, session: ClientHttp2Session): void {
+    if (this.connectionPool.get(authority) === session) {
+      this.connectionPool.delete(authority);
+      session.close();
+    }
+  }
 }

I'm moderately familiar with node's http/2 implementation - I'd proposed a fix for a similar bug in https://github.com/parse-community/node-apn/ in parse-community/node-apn#27 - the maintainers went with a different approach inspired by that code, an HTTP/2 client for APNs(Apple Push Notification Service), to tolerate spurious connection errors to HTTP/2

I'm not sure which approach you planned to take, but feel free to use this patch or base code off of the unit tests I wrote for https://github.com/parse-community/node-apn/pull/27/files#diff-ecc2f3b14c967f02fd074ac2d0e0b876e683c84f2755d251bbaa74a813a82130 (e.g. createAndStartMockServer, createAndStartMockLowLevelServer) - but I assume aws's sdk already has something similar

I'd also been involved in setting up tests for https://github.com/parse-community/node-apn/blob/master/test/client.js

I mention this because the tests of a mock server emitting goaway on the client may be relevant

Also, what's the expected impact?

I'd been assuming that a spurious aws server error or networking error may lead to prompt errors (and the setTimeout not being reached), meaning that a failed connection would result in all subsequent requests to aws for that authority failing, which has made me hesitate to start using NodeHttp2Handler for a long lived node process?

NodeHttp2Handler should invalidate its cache on the close event for any Http2Session

I think it should be done earlier, when the first error is encountered - close is only called after the session is finished being destroyed, which would only happen after all requests on that http2 connection had completed

https://nodejs.org/api/http2.html#http2_event_close

The 'close' event is emitted once the Http2Session has been destroyed. Its listener does not expect any arguments.

TysonAndre · 2021-04-06T17:19:06Z

https://github.com/hisco/http2-client/blob/12a9e6fa6701a46e92e9b7adf395ce646b2a26b4/lib/request.js#L306-L320

The linked client also checks the .closed property before closing to see if it was already closed, and removes all event listeners of the client

fix: detect errors on the NodeHttp2Handler, immediately destroy connections on unexpected error mode, and reconnect. Prior to this PR, if the server sent the client a GOAWAY frame, the session would not be removed from the connection pool and requests would fail indefinitely. This tries to avoid keeping streams(requests) on the Http2Session (tcp connection) from being stuck in an open state waiting for a gentle close even if there were unexpected protocol or connection errors during the request, assuming http2 errors are rare. (if a server or load balancer or network is misbehaving, close() might get stuck waiting for requests to finish, especially if requests and sessions don't have timeouts?) I'm only slightly familiar with http/2 client implementations from working on clients for Apple Push Notification Service. - In those, the client could rely on a session and request timeout existing, so close() would finish. In aws-sdk-js-v3, timeouts are optional. - I've seen some strange race conditions prior to node 12 in different client implementations for close and/or destroying Fixes aws#1525

hari-007 · 2021-05-09T16:38:38Z

Is this below error from client-lex-runtime-v2 also relates to this?.

It is happening after the bot session timeout time, even though I am on the putSession call !!

If so, when are you (@trivikr or @TysonAndre) planning to push this high priority labeled patch?
- Also, I saw that fix [fix(node-http-handler): handle NodeHttp2Handler session failure fix(node-http-handler): handle NodeHttp2Handler session failure #2289] got merged 4 days back and wondering what would be the release/publish a policy for this kind of important issues
If not, I want to open an issue for this as it is forcing me to restart the node server

Any simple reply would be appreciated.

TysonAndre · 2021-05-09T17:22:34Z

Is this below error from client-lex-runtime-v2 also relates to this?.

That seems likely to be related to this - the intent of the PR was to fix issues like that when http/2 was used (reconnect for destroyed sessions) so that I could safely use clients such as dynamo (and various others) with http/2

The PR you linked wasn't merged yet and will likely be fixed by that PR - 3.14.0 was published on April 30th and the PR was merged 5 days later on May 5th.

If so, when are you (@trivikr or @TysonAndre) planning to push this high priority labeled patch?

I'm an external contributor to AWS, not a maintainer, so I can't publish anything, but releases seem to typically be on fridays?

Unit tests were added to confirm that disconnections initiated by the server
resulted in the connection being removed from the connection pool.

More thorough manual tests would be useful to test all possible networking edge cases,
e.g. completely losing a connection to aws load balancers, a connection closing without a goaway frame, etc,
testing this with a high volume of traffic to see if it recovers properly, etc.,
but I'm not familiar with how to simulate that accurately in node

I'd want to have it properly tested however the aws team normally tests changes rather than rushing to merge it out

trivikr · 2021-05-11T01:14:57Z

Closing as the fix was released in v3.15.0.

Please reopen or comment if the issue still exists.

hari-007 · 2021-05-15T17:39:38Z

@trivikr : I am still able to reproduce this issue. And it is not happening with V2 though.

It is exactly happening after 5 minutes of idle action. Also, so consistent (very exact 5 minute window, it will throw the session timeout)
Everything is the same as I commented above. Not working with 3.14.0, 3.15.0, and 3.16.0 too
For now, I am switching back to V2 library as my project timeline closing. Hope someone can reproduce, fix, and release soon. @TysonAndre : Got you, thanks for the quick reply.

Demo git project to reproduce this:

Here is the demo reproducable git repo for > node-aws-lex-bot-communicator-demo
- I created this demo project to show case this session failure scenario with Lex client.
- I have given all the instructions for how to reproduce in that repo ReadMe.MD file
- Along with the session destroyed, I often got the Connection reset error like below :

TysonAndre · 2021-05-15T17:45:11Z

It is exactly happening after 5 minutes of idle action. Also, so consistent (very exact 5 minute window, it will throw the session timeout)

https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-target-groups.html#target-group-attributes

I'm not sure if this is actually the issue (I'm not familiar with client-lex-runtime-v2 and don't work for aws/amazon), but I suspect it may be related to the load balancer idle timeout.

deregistration_delay.timeout_seconds
The amount of time for Elastic Load Balancing to wait before deregistering a target. The range is 0–3600 seconds. The default value is 300 seconds. (5 minutes)

TysonAndre · 2021-05-15T17:48:31Z

@hari-007 in case it's an issue that only happens on older node versions, what node major&minor version (node --version) are you using in production?

TysonAndre · 2021-05-15T17:55:13Z

+  private destroySession(authority: string, session: ClientHttp2Session): void {
+    if (this.connectionPool.get(authority) !== session) {
+      // Already closed?
+      return;
+    }
+    this.connectionPool.delete(authority);
+    session.removeAllListeners("goaway");
+    session.removeAllListeners("error");
+    session.removeAllListeners("frameError");

This seems like it may have been overly broad - the implementation was based on another one mentioned earlier in the link - https://github.com/hisco/http2-client/blob/12a9e6fa6701a46e92e9b7adf395ce646b2a26b4/lib/request.js#L306-L320
(in case the lack of listener cleanup would lead to a leak - I'm not sure if that's necessary)

I was assuming the http2 client wouldn't register its own listeners, but I'm thinking that assumption is a mistake - https://nodejs.org/api/events.html - if it was registering its own listeners before/after aws-sdk registered one or during shutdown to suppress thrown exceptions, then removeAllListeners would cause issues

TysonAndre · 2021-05-15T18:05:09Z

https://github.com/hari-007/node-aws-lex-bot-communicator-demo/blob/f80045d406506ab3f4791cf3c4954438ae7e1d2d/lex-client-v3.js

@hari-007 Is that screenshot the output from the console.log you added? Do subsequent requests to await this.lexRunTimeV2Service.recognizeText(lexServiceV2RequestParams); succeed or fail (until the next idle timeout)

(someone with more time could run this themselves and figure it out)

It's likely you only know if a connection is reset once you actually try to send data and the remote tcp server sends a TCP RST in response

      return await lexRunTimeV2Service.recognizeText(lexServiceV2RequestParams);
      

      /** Trail - 2 */
      // Here we are creating LexRunTimeV2 instance once per class instantiate
      // return await this.lexRunTimeV2Service.recognizeText(lexServiceV2RequestParams);
    } catch (ex) {
      console.log(ex);
      throw ex;
    }

A possible mitigation to keep connections alive would be to call session.ping() - just the presence of this with an appropriate interval would make the load balancer less likely to shut down connections

A project I've contributed to had used this approach for push http/2 connections to a different service for push notifications - https://github.com/parse-community/node-apn/blob/c0d2bfb714fd2c428cf84654229d69783ba0a9c6/lib/client.js#L35-L45

However, I'm not sure if the lack of ping is deliberate on the part of aws (to make resource leaks less likely if an application doesn't reuse agents), though it'd be convenient to have an option to enable ping with a configurable timeout when enabling a NodeHttp2Handler (and clearInterval when destroying the handler)

hari-007 · 2021-05-15T18:36:50Z

@TysonAndre :: Much appreciate your quick response. I was just mentioned you to thank you for the reply, but you took the charge of the issue and provided a detailed analysis again. Thanks for this.

I'm not sure if this is actually the issue (I'm not familiar with client-lex-runtime-v2 and don't work for AWS/amazon), but I suspect it may be related to the load balancer idle timeout.

Yes, I too thought about the same. As this session destroy is consistent (After 5 to 6 consecutive trials with timer observed).

what node major&minor version (node --version) are you using in production?

Currently, it is my local. And building this Lex app for my freelance project (Probably my client is using >12.x.x and <15.x.x)
I am running with 12.22.1 from Mac

Is that screenshot the output from the console.log you added? Do subsequent requests to await this.lexRunTimeV2Service.recognizeText(lexServiceV2RequestParams); succeed or fail (until the next idle timeout)

Yes, for that console log. This screenshot (ECONNREST Reset) happened randomly once and then it got resolved from the next request onwards. And Subsequent requests got succeded.
But the actual session destroy forcing me to restart the server.

A possible mitigation to keep connections alive would be to call session.ping() - just the presence of this with an appropriate interval would make the load balancer less likely to shut down connections

A project I've contributed to had used this approach to push http/2 connections to a different service for push notifications - https://github.com/parse-community/node-apn/blob/c0d2bfb714fd2c428cf84654229d69783ba0a9c6/lib/client.js#L35-L45

As I have the alternative of V2 library, I am not using your approach.
But if there is no V2 library, I am sure that your time internal ping hack to keep LB connection alive would be a prominent solution until we get the session handling from node-http.

github-actions · 2021-05-30T00:13:40Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs and link to relevant comments in this thread.

isker changed the title ~~NodeHttp2Handler does not handle session failures~~ NodeHttp2Handler does not handle session failures Sep 14, 2020

AllanZhengYP added the High Priority label Oct 8, 2020

TysonAndre mentioned this issue Apr 22, 2021

fix(node-http-handler): handle NodeHttp2Handler session failure #2289

Merged

trivikr closed this as completed May 11, 2021

TysonAndre mentioned this issue May 28, 2021

Set default H2 session timeout #1206

Closed

github-actions bot locked as resolved and limited conversation to collaborators May 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

NodeHttp2Handler does not handle session failures #1525

NodeHttp2Handler does not handle session failures #1525

isker commented Sep 14, 2020 •

edited

Loading

TysonAndre commented Mar 29, 2021 •

edited

Loading

Uh oh!

TysonAndre commented Apr 6, 2021

Uh oh!

hari-007 commented May 9, 2021

Uh oh!

TysonAndre commented May 9, 2021

Uh oh!

trivikr commented May 11, 2021

Uh oh!

hari-007 commented May 15, 2021

Uh oh!

TysonAndre commented May 15, 2021 •

edited

Loading

Uh oh!

TysonAndre commented May 15, 2021

Uh oh!

TysonAndre commented May 15, 2021

Uh oh!

TysonAndre commented May 15, 2021

Uh oh!

hari-007 commented May 15, 2021 •

edited

Loading

Uh oh!

github-actions bot commented May 30, 2021

Uh oh!

NodeHttp2Handler does not handle session failures #1525

NodeHttp2Handler does not handle session failures #1525

Comments

isker commented Sep 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TysonAndre commented Mar 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TysonAndre commented Apr 6, 2021

Uh oh!

hari-007 commented May 9, 2021

Uh oh!

TysonAndre commented May 9, 2021

Uh oh!

trivikr commented May 11, 2021

Uh oh!

hari-007 commented May 15, 2021

Demo git project to reproduce this:

Uh oh!

TysonAndre commented May 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TysonAndre commented May 15, 2021

Uh oh!

TysonAndre commented May 15, 2021

Uh oh!

TysonAndre commented May 15, 2021

Uh oh!

hari-007 commented May 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented May 30, 2021

Uh oh!

isker commented Sep 14, 2020 •

edited

Loading

TysonAndre commented Mar 29, 2021 •

edited

Loading

TysonAndre commented May 15, 2021 •

edited

Loading

hari-007 commented May 15, 2021 •

edited

Loading