Skip to content

AmazonS3.getObject (GetObjectRequest, File) sucks! #485

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
almson opened this issue Aug 11, 2015 · 4 comments
Closed

AmazonS3.getObject (GetObjectRequest, File) sucks! #485

almson opened this issue Aug 11, 2015 · 4 comments

Comments

@almson
Copy link

almson commented Aug 11, 2015

More specifically, the code that it delegates to in com.amazonaws.services.s3.internal.ServiceUtils.retryableDownloadS3ObjectToFile is flawed. Here are the problems I've identified in the methods for downloading S3 objects to disk:

  • The retry logic in retryableDownloadS3ObjectToFile is poor. Ideally, it should just use the ClientConfiguration retry logic. The wrong things it current does are:
    • Only retries once
    • Does not retry on SocketException, which is the most common exception you'll encounter when downloading large files. Ugh! Utter fail. SDKDefaultRetryCondition will retry on a client exception caused by a SocketException
  • retryableDownloadS3ObjectToFile aborts the underlying S3ObjectInputStream on success
  • retryableDownloadS3ObjectToFile has a horrible name
  • AmazonS3Client checks MD5 integrity with the elegant DigestValidationInputStream, and then asks retryableDownloadS3ObjectToFile to do the integrity check again on the file itself.
  • ServiceUtils.downloadObjectToFile calculates the expensive integrity check even if told not to.
  • Why does retryableDownloadS3ObjectToFile have such a complicated method of receiving its parameters?
  • ServiceUtils.downloadToFile doesn't check if expectedFileLength is -1, even though that's exactly what downloadObjectToFile will pass to it.
@shorea
Copy link
Contributor

shorea commented Aug 11, 2015

@almson Thanks for the feedback! I agree with some of your points.

  • The retry logic in retryableDownloadS3ObjectToFile is poor.
    The ClientConfiguration retry logic is honored in the core (AmazonHttpClient). There SocketExceptions are retried as you'd expect. retryableDownloadS3ObjectToFile gets this for free by calling the getObject method on the client so it could potentially retry 2 * numOfRetriesInClientConfig in the worst case scenario, not just once. You are correct this does not retry on SocketExceptions while reading the S3ObjectInputStream, only those thrown when making the HTTP request and receiving the response.
  • retryableDownloadS3ObjectToFile aborts the underlying S3ObjectInputStream on success
    Yes this seems odd. I don't see any benefit in aborting when all bytes have been read from the input stream. As the JavaDocs state abort is only beneficial when the cost of reading to the end of the stream outweighs the benefits of reusing an http connection. Ideally we should only be aborting on failure when there could be a large number of bytes still unread in the stream.
  • retryableDownloadS3ObjectToFile has a horrible name
    Yes I agree. We are open to suggestions and do accept pull requests. :)
  • AmazonS3Client checks MD5 integrity with the elegant DigestValidationInputStream, and then asks retryableDownloadS3ObjectToFile to do the integrity check again on the file itself.
    Something could be lost in translation while writing to the file so it's important to check the MD5 sum of the file rather than just the input stream. Perhaps we could turn off the input stream validation for this use case as it doesn't make a whole lot of sense to validate twice.
  • ServiceUtils.downloadObjectToFile calculates the expensive integrity check even if told not to.
    Yes this seems like a bug. There's no point in calculating it if we aren't using it.
  • Why does retryableDownloadS3ObjectToFile have such a complicated method of receiving its parameters?
    Do you mean the RetryableS3DownloadTask callback? We could pass in the client and request directly. No strong preference either way.
  • ServiceUtils.downloadToFile doesn't check if expectedFileLength is -1, even though that's exactly what downloadObjectToFile will pass to it.
    This is definitely confusing. The file length is only checked if appendData is true. appendData won't be true in getObject(GetObjectRequest, File) as we explicitly pass in ServiceUtils.OVERWRITE_MODE. We pass this flag around a couple of times so it's not immediately obvious that appendData will be false when we explicitly pass -1 as the expectedFileLength, we can clean this up to reduce the confusion.

@almson
Copy link
Author

almson commented Aug 11, 2015

Thank you for taking my feedback into consideration!

Something could be lost in translation while writing to the file so it's important to check the MD5 sum of the file rather than just the input stream.

Generally, you have to trust the OS and the hardware to write to a file what you ask it to write (or nothing would work!). More users are paranoid about performance than file integrity, so I would suggest sticking to the fast stream-based checker. If that's not enough for someone, they're free to re-verify the file on disk as many times as they like. Do you have a reason to think that corruption of the file is more likely than the background probability of RAM, bus, disk failures?

The file length is only checked if appendData is true. appendData won't be true in getObject(GetObjectRequest, File) as we explicitly pass in ServiceUtils.OVERWRITE_MODE.

I realize that, however there is a public function in ServiceUtils:

 public static void downloadObjectToFile(S3Object s3Object,
        final File destinationFile, boolean performIntegrityCheck,
        boolean appendData) {
    downloadToFile(s3Object, destinationFile, performIntegrityCheck, appendData, -1);
}

which will fail whenever it is called with appendData == true. It's inadvisable to keep it in the codebase as-is.

@shorea
Copy link
Contributor

shorea commented Aug 13, 2015

Generally, you have to trust the OS and the hardware to write to a file what you ask it to write

I've discussed with a colleague and I've been convinced that the validation on the input stream would be sufficient.

which will fail whenever it is called with appendData == true. It's inadvisable to keep it in the codebase as-is.

Yes I agree, this code could use some cleanup.

I've created tasks on our backlog to address the issues. If there are ones your are particularly frustrated by or passionate about we do accept pull requests. Thanks again for such great feedback!

@breedloj
Copy link
Contributor

Circling back around on these requests: we've made changes around a number of these suggestions so I'm going to go ahead and close this Issue for now. Please feel free to create a new issue if you have other suggestions or find that there are still major issues with getObject.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants