-
Notifications
You must be signed in to change notification settings - Fork 1.2k
remote: getting a directory checksum requires uploading a file to an external cache #2647
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@MrOutis I don't understand the issue described, tbh. Could you provide please an example where it's clear what cache is being used, what remote is being used, where directory is located, etc. |
Sure, @shcheklein . Take a look at the following script, it setups a mock S3 server with a directory which then is used as a dependency on a #!/usr/bin/env bash
set -x
export AWS_ACCESS_KEY_ID='testing'
export AWS_SECRET_ACCESS_KEY='testing'
export AWS_SECURITY_TOKEN='testing'
export AWS_SESSION_TOKEN='testing'
moto_server s3 &> /dev/null &
# File structure:
# dvc-temp
# ├── data
# │ ├── alice
# │ ├── alpha
# │ └── subdir
# │ ├── 1
# │ ├── 2
# │ └── 3
# ├── empty_dir
# ├── empty_file
# └── foo
python -c '
import boto3
session = boto3.session.Session()
s3 = session.client("s3", endpoint_url="http://localhost:5000")
s3.create_bucket(Bucket="dvc-temp")
s3.put_object(Bucket="dvc-temp", Key="cache/")
s3.put_object(Bucket="dvc-temp", Key="empty_dir/")
s3.put_object(Bucket="dvc-temp", Key="empty_file", Body=b"")
s3.put_object(Bucket="dvc-temp", Key="foo", Body=b"foo")
s3.put_object(Bucket="dvc-temp", Key="data/alice", Body=b"alice")
s3.put_object(Bucket="dvc-temp", Key="data/alpha", Body=b"alpha")
s3.put_object(Bucket="dvc-temp", Key="data/subdir/1", Body=b"1")
s3.put_object(Bucket="dvc-temp", Key="data/subdir/2", Body=b"2")
s3.put_object(Bucket="dvc-temp", Key="data/subdir/3", Body=b"3")
'
repository=$(mktemp -d)
cd $repository
dvc init --no-scm
dvc remote add s3 s3://dvc-temp
dvc remote modify s3 endpointurl http://localhost:5000
dvc remote add cache remote://s3/cache
dvc config cache.s3 cache
dvc run -d remote://s3/data 'echo something' To use
Note: added newlines for readability but the file shouldn't contain newlines. Such file would be located at As I already stated, it is useful for checking the integrity of the directory. The problem is that setting up a cache for working with external dependencies is not required by DVC. As an example, you can do Let's say that the following commands are equivalent: $ dvc import-url remote://s3/data
$ dvc run -d remote://s3/data -o data/ 'aws s3 copy --recursive s3://dvc-temp/data data' Because For external outputs (it doesn't matter if it is a directory or not), we throw an error when a cache is not set: $ dvc run -o remote://s3/data/alice 'echo something'
ERROR: failed to run command - no cache location setup for 's3' outputs. The discussion resolves around the behavior of creating such index file and requiring the user to set up a cache for Hope it is more understandable now, @shcheklein :) |
Thanks, @MrOutis ! It's definitely helpful. Some questions that are still not exactly clear to me:
Is it an integrity check or a check to see that a remote dir has changed to rerun a stage if needed? If it's an integrity check, when do we run it?
Even if create an index file on the remote cache, how does it help to see if the directory changed or not? We need to calculate that file every time, right? Am I right that the question is not about creating an index file itself (we will need it anyway, right?) and more about if we need to cache it or just calculate when it's needed? If so, then can you clarify again what are the benefits of caching it? I don't understand if caching this file saves us from recalculating the dir checksum every time we want to check if something has changed on not. |
@MrOutis ^ |
Users are getting confused by this https://discordapp.com/channels/485586884165107732/563406153334128681/653666017368735782 🙁 Need to at least handle it gracefully, until we decide what to do with this. |
I used integrity check wrongly, I was talking about the remote directory changing as you pointed out.
@shcheklein , I'll come back to this one later, will need to think it better |
Another user ran into this https://discordapp.com/channels/485586884165107732/485596304961962003/667036162472411144 |
@efiop , #2647 (comment) the issue is still |
@MrOutis Yep, I don't think it is more (or as) important than the other p0s, this is a nasty ui issue though 🙁 . Added to the next sprint though, maybe we'll have time to look into it this sprint though. |
First step: at least handle this error gracefully, can think about in-memory hash computation after that. |
Simplest script to reproduce:
assuming there is |
Keeping it open to consider in-memory hash computation for some remotes. |
Closing in favor of #4144 , will be fixed very soon. |
Uh oh!
There was an error while loading. Please reload this page.
Version: 0.62.1
Description: The current approach to calculate a directory checksum is to collect its contents' checksum and then "upload" it and get the checksum provided by the remote:
https://github.com/iterative/dvc/blob/a091b67341b00fea8ef43906f59c88cb38f384ff/dvc/remote/base.py#L246-L256
For example, if it is an S3 remote, DVC will create a file containing an "index" of the directory with its checksums on the S3 cache and then will try to get the ETag of such file.
This is an easy way to check if the directory has changed. Since the "directory index" is already there, you are a
head
operation away from knowing if the directory changed or not. Otherwise, you'll need to recompute the file on the spot.Right now we don't require setting up a cache for using external dependencies, but as mentioned above, this is needed for directories.
Let's discuss if there's another way around uploading files or just settle with that implementation and modify the code and docs accordingly.
Related: #1654
The text was updated successfully, but these errors were encountered: