Skip to content

Request timeout #29

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 22 commits into from
Oct 13, 2020
Merged

Request timeout #29

merged 22 commits into from
Oct 13, 2020

Conversation

davimacedo
Copy link
Member

No description provided.

@davimacedo
Copy link
Member Author

@TysonAndre what do you think?

@TysonAndre
Copy link

I'm not sure if you've seen this, but https://github.com/parse-community/node-apn/pull/27/files#diff-7e3bd8e7059a21bfd255649cbfc84b33 added various integration tests for the client properly handling some of the edge cases. You may want to copy those to make sure that the code works as intended on the supported node versions.

@TysonAndre
Copy link

Looking at it again,

  • The issues I remember with APNs silently terminating TCP connections were with the binary protocol, not the HTTP/2 protocol. It's possible but not certain HTTP/2 has similar issues.
  • I'm still learning about http/2 in node

The node library seems fairly low level and seems to be leaving applications to detect dead connections and reconnect.

The golang library seems more mature (This is a completely different language, but looking at other languages may help in thinking up robust solutions). In it, it internally maintains a connection pool. In order to detect dead connections, it sends a ping frame and checks if any ping response(pong) was sent within 15 seconds of that ping. https://github.com/golang/go/blob/f8d80977b784fd4879963e61dc9fca1fc9bf2193/src/net/http/h2_bundle.go#L7202-L7214

(In my experience, https://github.com/uniqush/uniqush-push/ (https://github.com/uniqush/uniqush-push/blob/master/srv/apns/http_api/processor.go) has been relatively stable and these issues have been new when working on a project using node-apn, but that may just be a consequence of other factors, such as golang's http2 library vs node's. It could also be some other factor like the OS or the system load being lower)

@isidai
Copy link

isidai commented Oct 6, 2020

I’m currently using @parse-community/node-apn version 3.1.0. I came here because I also encountered an issue with the Promise not resolving after a timeout.

I enabled debug logging and got the following logs when I was having this problem

apn Request error: Error [ERR_HTTP2_STREAM_CANCEL]: The pending stream has been canceled (caused by: connect ETIMEDOUT xxx.xxx.xxx.xxx:443)
apn Request ended with status null and responseData: 
apn Session error: Error: connect ETIMEDOUT xxx.xxx.xxx.xxx:443
apn Session closed

When looking into this pull request, it's not really the root-cause of the issue, but I did notice a few problems.

  1. The duration passed in as an argument to this.session.ping is undefined in the event of an error, so should there be an error, the duration is not properly set.

  2. When an error is caught, the ping response was found to be undefined.
    "No ping response after undefined ms"
    "Ping response after undefined ms"

There also didn't seem to be any recovery process from any errors that could occur during the health check.
https://github.com/parse-community/node-apn/blob/master/lib/provider.js#L45

  1. The shutdown function in provider.js does not pass any arguments.
    The client.js shutdown allows you to pass a callback, but you can't pass a callback from the provider.

  2. There's also a discrepency with the "Note" in the readme (link 1) stemming from (link 2)
    Link 1: https://github.com/parse-community/node-apn/blob/master/doc/provider.markdown#connectionshutdown
    Link 2: 

    if (this.isDestroyed) {

@davimacedo
Copy link
Member Author

davimacedo commented Oct 6, 2020

Hi guys.

I believe that there is a lot to be improved here and the idea of managing a pool of connections would be the best. But, at least for my case, this PR looks correctly addressing the issue with stuck pushes. Time to time the pushes were stuck and my understanding now is that it happens because if for some reason Apple do not respond to the request, the request is never closed and the promise never resolved. I tested this new PR and the problem stopped to happen. I've sent about 1M push notifications. I also enabled the debug logs while sending the pushes and I've noticed that some of them timed out but the process resumed smoothly for the next requests.

I'm only afraid to start addressing a lot of different issues (most of them regarding the legacy code) in this PR and we turn out taking a lot of time to solve the main issue.

So my suggestion is: we list the showstopper items we should fix before merging this PR, fix them, and address the other ones in future PRs. What do you think we really should fix for merging this PR?

Also please feel free to either commit suggested changes in this PR or open new PRs with other issues.

It would also be good if few of you could test this branch in your environment as well.

@davimacedo davimacedo requested a review from dplewis October 8, 2020 05:16
@davimacedo
Copy link
Member Author

@dplewis @funkenstrahlen would you mind to take a look here as well? I wanted to have it merged for the next version of parse server.

@TysonAndre
Copy link

I'm not sure if you've seen this, but https://github.com/parse-community/node-apn/pull/27/files#diff-7e3bd8e7059a21bfd255649cbfc84b33 added various integration tests for the client properly handling some of the edge cases. You may want to copy those to make sure that the code works as intended on the supported node versions.

The test case should have caught the this.destroy bug if it was added. The test cases that seem unrealistic such as pathologically slow responses could be removed or adjusted for the test expectations (e.g. expect all promises to be resolved and some minimum/maximum number to get successful responses)

@davimacedo
Copy link
Member Author

I'm not sure if you've seen this, but https://github.com/parse-community/node-apn/pull/27/files#diff-7e3bd8e7059a21bfd255649cbfc84b33 added various integration tests for the client properly handling some of the edge cases. You may want to copy those to make sure that the code works as intended on the supported node versions.

The test case should have caught the this.destroy bug if it was added. The test cases that seem unrealistic such as pathologically slow responses could be removed or adjusted for the test expectations (e.g. expect all promises to be resolved and some minimum/maximum number to get successful responses)

I've included the tests. I've just removed the one that tries to simulate a loaded server.

Copy link

@TysonAndre TysonAndre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall.

Would prefer weakening the test of an overloaded server (e.g. only check that all promises resolve) over removing the test due to issues (in the tested code?) but not a blocker since this is a large PR

@davimacedo
Copy link
Member Author

davimacedo commented Oct 9, 2020

Looks good overall.

Would prefer weakening the test of an overloaded server (e.g. only check that all promises resolve) over removing the test due to issues (in the tested code?) but not a blocker since this is a large PR

In this case, the test was simulating a case in which apns servers would be delaying more and more over time but continuing to send successful responses. I am not sure if it can happen in the real world and the current PR is not addressing this issue. If something delays more than the timeout threshold it simply cancels the request and do not try to predict if it is the server slowing down. I'd go as is for now and try to improve later. In fact, today I've notice that one of my batches had a lot of timeouts in sequence and eventually resumed. The good news is that the process didn't stuck and the server continued sending pushes normally but in fact for some reason apns servers have intermittent bad performance in some times of the day for approximately the last 30 days and that's probably the reason why we started having problems. But I'm not sure which would be the best approach to overcome this scenario. Maybe recycle the connection after X timeouts? Maybe retry the timeout requests after some time? But i'm not sure if only wait more is the right way to solve that. I believe those requests are never responded and that's why the process was stuck.

@petitchevalroux
Copy link

Just to give you feedback. We have been using this PR in production since last Thursday with success. The hanging problem seems to be resolved. Thanks for your work.

@davimacedo
Copy link
Member Author

Just to give you feedback. We have been using this PR in production since last Thursday with success. The hanging problem seems to be resolved. Thanks for your work.

It is good to know!

Guys, if no other concern arises, I am ready to mode forward in merging this PR. Please let me know.

Copy link

@TysonAndre TysonAndre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM other than the event listener leak when the environment variable DEBUG=apn is set

@TysonAndre
Copy link

The latest 5 commits seem reasonable; I haven't tested them but I don't expect any issues

@TysonAndre
Copy link

LGTM for the latest changes, logs seem reasonable for the test output with DEBUG=apn for common use cases

@davimacedo
Copy link
Member Author

Nice. Thank you so much for all your help! And let's continue addressing the other ideas you've raised. Feel free to open new PRs as well.

@davimacedo davimacedo merged commit 94cc9e4 into master Oct 13, 2020
@mtrezza mtrezza deleted the requestTimeout branch March 27, 2022 11:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants