Skip to content

TransferManager hitting "Connection Reset" #373

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rcoh opened this issue Mar 6, 2015 · 20 comments
Closed

TransferManager hitting "Connection Reset" #373

rcoh opened this issue Mar 6, 2015 · 20 comments
Assignees

Comments

@rcoh
Copy link

rcoh commented Mar 6, 2015

We're using the transfer manager to download files but periodically running into the following (full stack trace at bottom)

com.amazonaws.AmazonClientException: Unable to store object contents to disk: Connection reset

The documentation seems to indicate that this happens when a connection is reused too many times:

Also, don't overuse a connection. Amazon S3 will accept up to 100 requests before it closes a connection (resulting in 'connection reset'). Rather than having this happen, use a connection for 80-90 requests before closing and re-opening a new connection.

Obviously we can retry these requests, but it isn't ideal. Is it something we're doing wrong with the library that's causing this? Or is the library not managing connections properly?

com.amazonaws.AmazonClientException: Unable to store object contents to disk: Connection reset
    at com.amazonaws.services.s3.internal.ServiceUtils.downloadObjectToFile(ServiceUtils.java:270)
    at com.amazonaws.services.s3.internal.ServiceUtils.retryableDownloadS3ObjectToFile(ServiceUtils.java:344)
    at com.amazonaws.services.s3.transfer.TransferManager$2.call(TransferManager.java:731)
    at java.util.concurrent.FutureTask.run(FutureTask.java:262)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.SocketException: Connection reset
    at java.net.SocketInputStream.read(SocketInputStream.java:196)
    at java.net.SocketInputStream.read(SocketInputStream.java:122)
    at sun.security.ssl.InputRecord.readFully(InputRecord.java:442)
    at sun.security.ssl.InputRecord.readV3Record(InputRecord.java:554)
    at sun.security.ssl.InputRecord.read(InputRecord.java:509)
    at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:934)
    at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:891)
    at sun.security.ssl.AppInputStream.read(AppInputStream.java:102)
    at org.apache.http.impl.io.AbstractSessionInputBuffer.read(AbstractSessionInputBuffer.java:198)
    at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:178)
    at org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:137)
    at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:71)
    at com.amazonaws.event.ProgressInputStream.read(ProgressInputStream.java:151)
    at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:71)
    at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:71)
    at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:71)
    at com.amazonaws.event.ProgressInputStream.read(ProgressInputStream.java:151)
    at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:71)
    at java.io.FilterInputStream.read(FilterInputStream.java:107)
    at com.amazonaws.services.s3.internal.ServiceUtils.downloadObjectToFile(ServiceUtils.java:265)
    ... 6 more
@rcoh rcoh closed this as completed Mar 6, 2015
@rcoh rcoh reopened this Mar 6, 2015
@david-at-aws
Copy link
Contributor

According to that stack trace you've made it to the point where you're reading the object content, which means S3 sent you a successful response; the connection is not getting reset because we've accidentally exceeded the service's 100-requests-per-connection limit. Something else is going wrong that's resetting the connection mid-download. Unfortunately it's hard to say what - connections can be reset by anyone on the network between you and S3 for many different reasons. :(

If you can grab request ids for some of these failed requests (maybe by turning on Apache HttpClient header logging?), the S3 team can double-check what's happening on their end for these requests. If you can grab a packet capture from your end that would also be interesting (although it'll presumably be quite large if you're pulling a bunch of data from S3).

If you can't uncover the root cause of the resets, retries are your only recourse. The TransferManager has retries built in, but it's explicitly not retrying on SocketException. Strange... I'll chase down whether it's safe for us to add retries on SocketException in a future release - from a quick glance it seems like we should be able to.

@rcoh
Copy link
Author

rcoh commented Mar 12, 2015

Thanks. Unfortunately given our data volumes and the rarity of this exception we can't feasibly do that amount of logging. Retrying should work fine though. Thanks!

@rcoh rcoh closed this as completed Mar 12, 2015
@acmcelwee
Copy link

I'm seeing this pretty consistently, as well. I'm using the transferManager.downloadDirectory to download directories with a mix of file sizes, from a few kilobytes to a few gigabytes. The total directory sizes range from 40-80G. For now, I've wrapped my downloadDirectory calls in retry logic, but from reading the downloadDirectory code, retrying at that level means that you lose any progress you've made so far, since it passes false for resumeExistingDownload. Retrying within the transferManager layer seems like it would help a lot by executing the retries at the s3 object level, retaining the progress that's been made in downloading the overall directory, so far. Either handling retries of SocketException, or allowing resumable calls of downloadDirectory would be enough to resolve the issue for me.

AWS SDK Version: 1.9.39

EDIT:
I'll also mention that I've tried tweaking the following settings in the ClientConnection of my s3 client passed to the transfer manager:

  • raising/lowering socket and connection timeouts
  • lowering the connection ttl
  • raise/lower the max connections
    I think the one that's helped the most was lowering the connection ttl, but the issue still persists. I'll also note that for all cases, I'm using the transfer manager within a host on ec2.

@rcoh rcoh reopened this Jun 3, 2015
@david-at-aws
Copy link
Contributor

Adding an option to do resumable downloadDirectory calls sounds like a good idea. I'll add that to our backlog, or would be happy to take a look at a pull request if you'd like to put one together. I'd also still very much like to figure out a way to retry on SocketExceptions without causing trouble for aborts, but haven't had a chance to dig into it yet.

@acmcelwee
Copy link

@david-at-aws PR here. However, I looked for the test suite for the s3 and TransferManager stuff, but I'm not finding anything. Is there a suite that I can validate my PR against and update w/ a test for the resume functionality? What's the common contributor workflow for people validating their changes w/ their own projects? Publish an artifact to either local or my own nexus repo?

@david-at-aws
Copy link
Contributor

We've got a test suite internally that we run changes through before merging them. We realize it'd be much better for you to be able to run the tests yourself before sending us the PR, and we're working on separating it from some Amazon-internal infrastructure it currently depends on so we can publish it on GitHub - unfortunately not quite there yet.

@acmcelwee
Copy link

Got it. Do you publish nightly snapshots that I could pull into a project?

Also, an update on the original issue: Enabling tcp keepalive in the ClientConnection and lowering my connection ttl seems to be the most effective thing I've found at reducing the frequency of the "Connection Reset" exception.

@l15k4
Copy link

l15k4 commented Oct 13, 2015

Shouldn't the RetryPolicy make a new request when this happens? So that the exception doesn't bubble up and the file is downloaded again if it failed before?

@l15k4
Copy link

l15k4 commented Oct 13, 2015

I'm using version 1.10.26 and when I download in parallel by 3 threads (each having its ownclient instance) I'm getting following error very frequently :

Caused by: java.net.SocketException: Connection reset
at java.net.SocketInputStream.read(SocketInputStream.java:209) ~[na:1.8.0_45]
at java.net.SocketInputStream.read(SocketInputStream.java:141) ~[na:1.8.0_45]
at org.apache.http.impl.io.AbstractSessionInputBuffer.read(AbstractSessionInputBuffer.java:198) ~[gwiq.jar:0.9-SNAPSHOT]
at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:178) ~[gwiq.jar:0.9-SNAPSHOT]
at org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:137) ~[gwiq.jar:0.9-SNAPSHOT]
at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:72) ~[gwiq.jar:0.9-SNAPSHOT]
at com.amazonaws.event.ProgressInputStream.read(ProgressInputStream.java:151) ~[gwiq.jar:0.9-SNAPSHOT]
at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:72) ~[gwiq.jar:0.9-SNAPSHOT]
at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:72) ~[gwiq.jar:0.9-SNAPSHOT]
at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:72) ~[gwiq.jar:0.9-SNAPSHOT]
at com.amazonaws.event.ProgressInputStream.read(ProgressInputStream.java:151) ~[gwiq.jar:0.9-SNAPSHOT]
at java.security.DigestInputStream.read(DigestInputStream.java:161) ~[na:1.8.0_45]
at com.amazonaws.services.s3.internal.DigestValidationInputStream.read(DigestValidationInputStream.java:59) ~[gwiq.jar:0.9-SNAPSHOT]
at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:72) ~[gwiq.jar:0.9-SNAPSHOT]
at java.io.BufferedInputStream.fill(BufferedInputStream.java:246) ~[na:1.8.0_45]
at java.io.BufferedInputStream.read1(BufferedInputStream.java:286) ~[na:1.8.0_45]
at java.io.BufferedInputStream.read(BufferedInputStream.java:345) ~[na:1.8.0_45]
at java.io.FilterInputStream.read(FilterInputStream.java:107) ~[na:1.8.0_45]

@l15k4
Copy link

l15k4 commented Oct 13, 2015

I noticed that this happens in "lazy" environment when the stream isn't read right after the connection returns the response. For instance if you create an Iterator of s3Obj and read its content lazily while iterating it...

@jtrunick
Copy link

jtrunick commented Nov 4, 2015

I'm running into this issue as well. Any suggestions as to ideal configuration parameters appreciated. @acmcelwee

@acmcelwee
Copy link

I never found any ideal config to get things consistently working. We started snappy compressing tar archives for that data, so our downloads w/ TransferManager are all large files, rather than a "directory" of a large number of files. Things have worked out a lot better since we made the switch.

@l15k4
Copy link

l15k4 commented Nov 5, 2015

@jtrunick try to profile how much time does it take between establishing the connection and the actual reading of the content... I bet the problem is here... Lazy collections would cause that

@dagnir
Copy link
Contributor

dagnir commented Aug 11, 2016

Hello @rcoh, @acmcelwee, and @l15k4, sorry for the long delay in response. Are you guys still experiencing this issue regularly?

@dagnir dagnir self-assigned this Aug 11, 2016
@dagnir
Copy link
Contributor

dagnir commented Aug 11, 2016

Pinging @jtrunick as well

@acmcelwee
Copy link

I'm not actively using any of the code where I ran into the issue, so I don't have any new data points to add.

@rcoh
Copy link
Author

rcoh commented Aug 11, 2016

No new data points either.

@l15k4
Copy link

l15k4 commented Aug 19, 2016

I'm more than sure that this happens when people are processing data lazily, using iterators or streams which increases lifetime of particular socket connection. AWS doesn't like long-living s3 socket connections...

It can be always fixed by increasing socketTimeout from default 50 seconds to say 120 seconds :

s3Conf.setSocketTimeout(120 * 1000)

@dagnir
Copy link
Contributor

dagnir commented Aug 19, 2016

Thank you @acmcelwee @rcoh and @l15k4. A fix has been made to TransferManager so that it will retry the download on a Connection reset, and optionally (by passing in a flag to download and downloadDirectory) to resume the download from the current end of the partial object on disk.

edit: The update should be available in our next release.

@dagnir dagnir closed this as completed Aug 19, 2016
@rdifalco
Copy link

rdifalco commented Sep 2, 2017

I ran into this for a 7G download.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants