Skip to content

remote: getting a directory checksum requires uploading a file to an external cache #2647

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ghost opened this issue Oct 22, 2019 · 14 comments · Fixed by #3225
Closed

remote: getting a directory checksum requires uploading a file to an external cache #2647

ghost opened this issue Oct 22, 2019 · 14 comments · Fixed by #3225
Labels
discussion requires active participation to reach a conclusion p2-medium Medium priority, should be done, but less important research

Comments

@ghost
Copy link

ghost commented Oct 22, 2019

Version: 0.62.1

Description: The current approach to calculate a directory checksum is to collect its contents' checksum and then "upload" it and get the checksum provided by the remote:

https://github.com/iterative/dvc/blob/a091b67341b00fea8ef43906f59c88cb38f384ff/dvc/remote/base.py#L246-L256

For example, if it is an S3 remote, DVC will create a file containing an "index" of the directory with its checksums on the S3 cache and then will try to get the ETag of such file.

This is an easy way to check if the directory has changed. Since the "directory index" is already there, you are a head operation away from knowing if the directory changed or not. Otherwise, you'll need to recompute the file on the spot.

Right now we don't require setting up a cache for using external dependencies, but as mentioned above, this is needed for directories.

Let's discuss if there's another way around uploading files or just settle with that implementation and modify the code and docs accordingly.

Related: #1654

@ghost ghost added the discussion requires active participation to reach a conclusion label Oct 22, 2019
@shcheklein
Copy link
Member

@MrOutis I don't understand the issue described, tbh. Could you provide please an example where it's clear what cache is being used, what remote is being used, where directory is located, etc.

@ghost
Copy link
Author

ghost commented Oct 22, 2019

Sure, @shcheklein . Take a look at the following script, it setups a mock S3 server with a directory which then is used as a dependency on a dvc run command:

#!/usr/bin/env bash

set -x

export AWS_ACCESS_KEY_ID='testing'
export AWS_SECRET_ACCESS_KEY='testing'
export AWS_SECURITY_TOKEN='testing'
export AWS_SESSION_TOKEN='testing'

moto_server s3 &> /dev/null &

# File structure:
#       dvc-temp
#       ├── data
#       │  ├── alice
#       │  ├── alpha
#       │  └── subdir
#       │     ├── 1
#       │     ├── 2
#       │     └── 3
#       ├── empty_dir
#       ├── empty_file
#       └── foo

python -c '
import boto3

session = boto3.session.Session()
s3 = session.client("s3", endpoint_url="http://localhost:5000")
s3.create_bucket(Bucket="dvc-temp")
s3.put_object(Bucket="dvc-temp", Key="cache/")
s3.put_object(Bucket="dvc-temp", Key="empty_dir/")
s3.put_object(Bucket="dvc-temp", Key="empty_file", Body=b"")
s3.put_object(Bucket="dvc-temp", Key="foo", Body=b"foo")
s3.put_object(Bucket="dvc-temp", Key="data/alice", Body=b"alice")
s3.put_object(Bucket="dvc-temp", Key="data/alpha", Body=b"alpha")
s3.put_object(Bucket="dvc-temp", Key="data/subdir/1", Body=b"1")
s3.put_object(Bucket="dvc-temp", Key="data/subdir/2", Body=b"2")
s3.put_object(Bucket="dvc-temp", Key="data/subdir/3", Body=b"3")
'

repository=$(mktemp -d)
cd $repository
dvc init --no-scm
dvc remote add s3 s3://dvc-temp
dvc remote modify s3 endpointurl http://localhost:5000

dvc remote add cache remote://s3/cache
dvc config cache.s3 cache

dvc run -d remote://s3/data 'echo something'

To use remote://s3/data as a dependency, DVC needs to create a file on the remote cache with the following content:

[{'etag': '6384e2b2184bcbf58eccf10ca7a6563c', 'relpath': 'alice'},
 {'etag': '2c1743a391305fbf367df8e4f069f9f9', 'relpath': 'alpha'},
 {'etag': 'c4ca4238a0b923820dcc509a6f75849b', 'relpath': 'subdir/1'},
 {'etag': 'c81e728d9d4c2f636f067f89cc14862c', 'relpath': 'subdir/2'},
 {'etag': 'eccbc87e4b5ce2fe28308fd9f2a7baf3', 'relpath': 'subdir/3'}]

Note: added newlines for readability but the file shouldn't contain newlines.

Such file would be located at remote://cache/af/134c33d65939675318e71363241eee.dir and this is what I called an index file before (probably wrongly named -- but it is a list of the files inside the directory and its checksums).

As I already stated, it is useful for checking the integrity of the directory.

The problem is that setting up a cache for working with external dependencies is not required by DVC.

As an example, you can do dvc import-url remote://s3/data/alice and it won't need a cache at all (because is a single file and doesn't run the _get_dir_info_checksum method).

Let's say that the following commands are equivalent:

$ dvc import-url remote://s3/data
$ dvc run -d remote://s3/data -o data/ 'aws s3 copy --recursive s3://dvc-temp/data data'

Because data/ is a directory, it would require a configured cache.

For external outputs (it doesn't matter if it is a directory or not), we throw an error when a cache is not set:

$ dvc run -o remote://s3/data/alice 'echo something'

ERROR: failed to run command - no cache location setup for 's3' outputs.

The discussion resolves around the behavior of creating such index file and requiring the user to set up a cache for dvc import-urling a directory.

Hope it is more understandable now, @shcheklein :)

@shcheklein
Copy link
Member

Thanks, @MrOutis ! It's definitely helpful. Some questions that are still not exactly clear to me:

This is an easy way to perform an integrity check.

Is it an integrity check or a check to see that a remote dir has changed to rerun a stage if needed? If it's an integrity check, when do we run it?

head operation away from knowing if the directory changed or not

Even if create an index file on the remote cache, how does it help to see if the directory changed or not? We need to calculate that file every time, right?

Am I right that the question is not about creating an index file itself (we will need it anyway, right?) and more about if we need to cache it or just calculate when it's needed?

If so, then can you clarify again what are the benefits of caching it? I don't understand if caching this file saves us from recalculating the dir checksum every time we want to check if something has changed on not.

@efiop
Copy link
Contributor

efiop commented Dec 9, 2019

@MrOutis ^

@efiop efiop added the p1-important Important, aka current backlog of things to do label Dec 9, 2019
@efiop
Copy link
Contributor

efiop commented Dec 9, 2019

Users are getting confused by this https://discordapp.com/channels/485586884165107732/563406153334128681/653666017368735782 🙁 Need to at least handle it gracefully, until we decide what to do with this.

@shcheklein
Copy link
Member

@efiop @MrOutis count me in - I got confused once :) (even though I was aware about this ticket). My 2cs - I would say that the biggest problem is the UI here - that we left a user facing NPE .

@ghost
Copy link
Author

ghost commented Dec 10, 2019

@shcheklein:

Is it an integrity check or a check to see that a remote dir has changed to rerun a stage if needed? If it's an integrity check, when do we run it?

I used integrity check wrongly, I was talking about the remote directory changing as you pointed out.

If so, then can you clarify again what are the benefits of caching it? I don't understand if caching this file saves us from recalculating the dir checksum every time we want to check if something has changed on not.

@shcheklein , I'll come back to this one later, will need to think it better

@efiop efiop added p0-critical p1-important Important, aka current backlog of things to do and removed p1-important Important, aka current backlog of things to do p0-critical labels Jan 15, 2020
@efiop
Copy link
Contributor

efiop commented Jan 15, 2020

@ghost
Copy link
Author

ghost commented Jan 15, 2020

@efiop , #2647 (comment) the issue is still p1. Not sure if it is intended.

@efiop
Copy link
Contributor

efiop commented Jan 16, 2020

@MrOutis Yep, I don't think it is more (or as) important than the other p0s, this is a nasty ui issue though 🙁 . Added to the next sprint though, maybe we'll have time to look into it this sprint though.

@efiop efiop added the research label Jan 21, 2020
@pared pared self-assigned this Jan 22, 2020
@efiop
Copy link
Contributor

efiop commented Jan 22, 2020

First step: at least handle this error gracefully, can think about in-memory hash computation after that.

@pared
Copy link
Contributor

pared commented Jan 23, 2020

Simplest script to reproduce:

#!/bin/bash

rm -rf repo
mkdir repo
pushd repo

set -ex
dvc init --no-scm -q

dvc run -d s3://{bucket}/data -o result `aws2 s3 ls {bucket}/data/>>result`

assuming there is {bucket} with data folder on it

@efiop
Copy link
Contributor

efiop commented Jan 24, 2020

Keeping it open to consider in-memory hash computation for some remotes.

@efiop efiop reopened this Jan 24, 2020
@efiop efiop added p2-medium Medium priority, should be done, but less important and removed p1-important Important, aka current backlog of things to do labels Jan 24, 2020
@pared pared removed their assignment Jan 30, 2020
@efiop
Copy link
Contributor

efiop commented Sep 1, 2020

Closing in favor of #4144 , will be fixed very soon.

@efiop efiop closed this as completed Sep 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion requires active participation to reach a conclusion p2-medium Medium priority, should be done, but less important research
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants