AmazonS3.getObject (GetObjectRequest, File) sucks! #485

almson · 2015-08-11T13:44:30Z

More specifically, the code that it delegates to in com.amazonaws.services.s3.internal.ServiceUtils.retryableDownloadS3ObjectToFile is flawed. Here are the problems I've identified in the methods for downloading S3 objects to disk:

The retry logic in retryableDownloadS3ObjectToFile is poor. Ideally, it should just use the ClientConfiguration retry logic. The wrong things it current does are:
- Only retries once
- Does not retry on SocketException, which is the most common exception you'll encounter when downloading large files. Ugh! Utter fail. SDKDefaultRetryCondition will retry on a client exception caused by a SocketException
retryableDownloadS3ObjectToFile aborts the underlying S3ObjectInputStream on success
retryableDownloadS3ObjectToFile has a horrible name
AmazonS3Client checks MD5 integrity with the elegant DigestValidationInputStream, and then asks retryableDownloadS3ObjectToFile to do the integrity check again on the file itself.
ServiceUtils.downloadObjectToFile calculates the expensive integrity check even if told not to.
Why does retryableDownloadS3ObjectToFile have such a complicated method of receiving its parameters?
ServiceUtils.downloadToFile doesn't check if expectedFileLength is -1, even though that's exactly what downloadObjectToFile will pass to it.

The text was updated successfully, but these errors were encountered:

shorea · 2015-08-11T16:47:37Z

@almson Thanks for the feedback! I agree with some of your points.

The retry logic in retryableDownloadS3ObjectToFile is poor.
The ClientConfiguration retry logic is honored in the core (AmazonHttpClient). There SocketExceptions are retried as you'd expect. retryableDownloadS3ObjectToFile gets this for free by calling the getObject method on the client so it could potentially retry 2 * numOfRetriesInClientConfig in the worst case scenario, not just once. You are correct this does not retry on SocketExceptions while reading the S3ObjectInputStream, only those thrown when making the HTTP request and receiving the response.
retryableDownloadS3ObjectToFile aborts the underlying S3ObjectInputStream on success
Yes this seems odd. I don't see any benefit in aborting when all bytes have been read from the input stream. As the JavaDocs state abort is only beneficial when the cost of reading to the end of the stream outweighs the benefits of reusing an http connection. Ideally we should only be aborting on failure when there could be a large number of bytes still unread in the stream.
retryableDownloadS3ObjectToFile has a horrible name
Yes I agree. We are open to suggestions and do accept pull requests. :)
AmazonS3Client checks MD5 integrity with the elegant DigestValidationInputStream, and then asks retryableDownloadS3ObjectToFile to do the integrity check again on the file itself.
Something could be lost in translation while writing to the file so it's important to check the MD5 sum of the file rather than just the input stream. Perhaps we could turn off the input stream validation for this use case as it doesn't make a whole lot of sense to validate twice.
ServiceUtils.downloadObjectToFile calculates the expensive integrity check even if told not to.
Yes this seems like a bug. There's no point in calculating it if we aren't using it.
Why does retryableDownloadS3ObjectToFile have such a complicated method of receiving its parameters?
Do you mean the RetryableS3DownloadTask callback? We could pass in the client and request directly. No strong preference either way.
ServiceUtils.downloadToFile doesn't check if expectedFileLength is -1, even though that's exactly what downloadObjectToFile will pass to it.
This is definitely confusing. The file length is only checked if appendData is true. appendData won't be true in getObject(GetObjectRequest, File) as we explicitly pass in ServiceUtils.OVERWRITE_MODE. We pass this flag around a couple of times so it's not immediately obvious that appendData will be false when we explicitly pass -1 as the expectedFileLength, we can clean this up to reduce the confusion.

almson · 2015-08-11T18:20:48Z

Thank you for taking my feedback into consideration!

Something could be lost in translation while writing to the file so it's important to check the MD5 sum of the file rather than just the input stream.

Generally, you have to trust the OS and the hardware to write to a file what you ask it to write (or nothing would work!). More users are paranoid about performance than file integrity, so I would suggest sticking to the fast stream-based checker. If that's not enough for someone, they're free to re-verify the file on disk as many times as they like. Do you have a reason to think that corruption of the file is more likely than the background probability of RAM, bus, disk failures?

The file length is only checked if appendData is true. appendData won't be true in getObject(GetObjectRequest, File) as we explicitly pass in ServiceUtils.OVERWRITE_MODE.

I realize that, however there is a public function in ServiceUtils:

 public static void downloadObjectToFile(S3Object s3Object,
        final File destinationFile, boolean performIntegrityCheck,
        boolean appendData) {
    downloadToFile(s3Object, destinationFile, performIntegrityCheck, appendData, -1);
}

which will fail whenever it is called with appendData == true. It's inadvisable to keep it in the codebase as-is.

shorea · 2015-08-13T22:33:40Z

Generally, you have to trust the OS and the hardware to write to a file what you ask it to write

I've discussed with a colleague and I've been convinced that the validation on the input stream would be sufficient.

which will fail whenever it is called with appendData == true. It's inadvisable to keep it in the codebase as-is.

Yes I agree, this code could use some cleanup.

I've created tasks on our backlog to address the issues. If there are ones your are particularly frustrated by or passionate about we do accept pull requests. Thanks again for such great feedback!

breedloj · 2016-04-21T17:30:54Z

Circling back around on these requests: we've made changes around a number of these suggestions so I'm going to go ahead and close this Issue for now. Please feel free to create a new issue if you have other suggestions or find that there are still major issues with getObject.

breedloj closed this as completed Apr 21, 2016

dagnir mentioned this issue Sep 5, 2017

Transfer Manager Issues #1288

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

AmazonS3.getObject (GetObjectRequest, File) sucks! #485

AmazonS3.getObject (GetObjectRequest, File) sucks! #485

almson commented Aug 11, 2015

shorea commented Aug 11, 2015

Uh oh!

almson commented Aug 11, 2015

Uh oh!

shorea commented Aug 13, 2015

Uh oh!

breedloj commented Apr 21, 2016

Uh oh!

AmazonS3.getObject (GetObjectRequest, File) sucks! #485

AmazonS3.getObject (GetObjectRequest, File) sucks! #485

Comments

almson commented Aug 11, 2015

shorea commented Aug 11, 2015

Uh oh!

almson commented Aug 11, 2015

Uh oh!

shorea commented Aug 13, 2015

Uh oh!

breedloj commented Apr 21, 2016

Uh oh!