-
Notifications
You must be signed in to change notification settings - Fork 1.2k
azure/gs/s3: re-design checksum computation for external dependencies/outputs #1410
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Ok, turns out user should set content-md5 by himself in order for this to work. Also ETag does indeed change when you simply copy blob from one location to another. So we need to require Content-Md5 to be present or to compute and set it ourselves. We can use ETag for dependencies though. Also, md5 for google cloud storage only supported for non-composite objects, so looks like we need to do something with it as well. |
Ok, turns out when you move multipart objects on s3 they don't retain their etag. See https://discordapp.com/channels/485586884165107732/485596304961962003/563864123587297280 . So we might need to come up with a general approach to handle all of these clouds :( |
Another approach to handle all of these in a unified way is https://discordapp.com/channels/485586884165107732/485596304961962003/563899005151608843 |
Related workaround for s3 https://discordapp.com/channels/485586884165107732/485596304961962003/563910400450625546 |
I'm giving a shot at the s3 workaround. As far as I can see we need to check if the origin object of the copy in here is a multipart uploaded object by checking if it has a suffix in the etag. If that's the case, we need to calculate from the size it has and the number of parts it was uploaded in from the suffix of the etag, the chunk size of we need to set in the TransferConfig passed to the copy in order to achieve the same number of parts in the copy. |
Actually, the head_object has the part count and the size, using that is less hacky. |
* s3: fixed wrong etag when copying multipart objects The etag of multipart objects depends of the number of parts, when copying to the cache we should do so in the same number of parts that the original object was moved/uploaded in. Fixes part of #1410 * s3: added check on copy for equal etag * s3: added specific exception for ETag mismatch * s3: use multipart copy to preserve etags Signed-off-by: Ruslan Kuprieiev <[email protected]> * test: add tests for etag preservation on s3 Signed-off-by: Ruslan Kuprieiev <[email protected]> * test: requirements: use dev version moto Specifically because of this getmoto/moto#1941 Signed-off-by: Ruslan Kuprieiev <[email protected]> * test: requirements: install dev moto without -e Signed-off-by: Ruslan Kuprieiev <[email protected]> * test: stop using moto Turned out to be quite buggy. Signed-off-by: Ruslan Kuprieiev <[email protected]>
aws/aws-sdk-java#1303 that has been generally considered undocumented and shouldn't be used as per aws-sdk-java |
Closing in favor of #3920 |
Unlike S3, azure does providecontent-md5
(need to investigate if it still works, since s3 had support for it some time ago and doesn't have it right now) This is related to external outs/deps feature, that is not implemented yet. Usualpush/pull
are not going to be affected.The text was updated successfully, but these errors were encountered: