Skip to content

Cloud versioning: Track existing cloud dataset and push updates to it #8704

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dberenbaum opened this issue Dec 16, 2022 · 5 comments
Closed
Assignees
Labels
p1-important Important, aka current backlog of things to do

Comments

@dberenbaum
Copy link
Contributor

dberenbaum commented Dec 16, 2022

Related to #8382. The current UX is awkward if you want to track an existing cloud dataset and later push updates to it. You have to do something like:

$ dvc remote add my_dataset s3://bucket/path/to/dataset
$ dvc remote modify my_dataset worktree true
$ mkdir my_dataset
$ dvc add my_dataset
$ dvc update my_dataset

Edit: replaced dvc pull with dvc update my_dataset.

dvc import-url --worktree s3://bucket/path/to/dataset my_dataset could do the workflow above in a single command. The remote config could be stored either in the .dvc file or in config files. This would enable an easy workflow to update a cloud dataset like:

$ dvc import-url --worktree s3://bucket/path/to/dataset my_dataset
# Modify
$ rm my_dataset/old_files*
$ cp -R new_files/* my_dataset
# Track changes
$ dvc add/commit my_dataset
$ dvc push
@dberenbaum dberenbaum added A: cloud-versioning p1-important Important, aka current backlog of things to do labels Dec 16, 2022
@dtrifiro dtrifiro self-assigned this Jan 10, 2023
@dtrifiro dtrifiro added this to DVC Jan 10, 2023
@github-project-automation github-project-automation bot moved this to Backlog in DVC Jan 10, 2023
@dtrifiro dtrifiro moved this from Backlog to Todo in DVC Jan 10, 2023
@dtrifiro dtrifiro moved this from Todo to In Progress in DVC Jan 13, 2023
@dberenbaum
Copy link
Contributor Author

@dtrifiro Sorry for changing product reqs here, but thinking about this more, I lean towards it making more sense in dvc add --worktree than dvc import-url --worktree. It's a less confusing UX since the commands after downloading the dataset (add, push) are more typical for dvc add.

Also, import-url normally assumes both source deps and outs, and here there are really no deps. The final output should be something like this:

outs:
- md5: 56db666947dc82019468d7fa5c25d5b4.dir
  size: 59789011
  nfiles: 2605
  path: cats-dogs
  files:
  - size: 34315
    etag: 10d2a131081a3095726c5721ed31c21f
    md5: 10d2a131081a3095726c5721ed31c21f
    relpath: data/train/cats/cat.10.jpg
    remote: myremote
  - size: 28377
    etag: 0f2bfe74e9c363064087d0cd8a322106
    md5: 0f2bfe74e9c363064087d0cd8a322106
    relpath: data/train/cats/cat.100.jpg
    remote: myremote

Compare that to import-url outputs, which would be like:

md5: bc6502513858d60c8273554d99917133
frozen: true
deps:
- md5: 32666fe69170a00790117e76b90e42e2.dir
  size: 59789011
  nfiles: 2605
  path: s3://dave-sandbox-versioning/test/worktree/cats-dogs
  files:
  - size: 34315
    version_id: svPEUz3wOVBye5DMfyXkpz7CB2r.DpfW
    etag: 10d2a131081a3095726c5721ed31c21f
    relpath: data/train/cats/cat.10.jpg
  - size: 28377
    version_id: d3zndjpU2vUZZ.oSqQ1FKb24jDPh_pcG
    etag: 0f2bfe74e9c363064087d0cd8a322106
    relpath: data/train/cats/cat.100.jpg
...
md5: bc6502513858d60c8273554d99917133
frozen: true
deps:
- md5: 32666fe69170a00790117e76b90e42e2.dir
  size: 59789011
  nfiles: 2605
  path: s3://dave-sandbox-versioning/test/worktree/cats-dogs
  files:
  - size: 34315
    version_id: svPEUz3wOVBye5DMfyXkpz7CB2r.DpfW
    etag: 10d2a131081a3095726c5721ed31c21f
    relpath: data/train/cats/cat.10.jpg
  - size: 28377
    version_id: d3zndjpU2vUZZ.oSqQ1FKb24jDPh_pcG
    etag: 0f2bfe74e9c363064087d0cd8a322106
    relpath: data/train/cats/cat.100.jpg

@dberenbaum dberenbaum changed the title cloud versioning: import-url --worktree cloud versioning: track existing cloud dataset and push updates to it Jan 14, 2023
@dberenbaum
Copy link
Contributor Author

dberenbaum commented Jan 14, 2023

You have to do something like:

$ dvc remote add my_dataset s3://bucket/path/to/dataset
$ dvc remote modify my_dataset worktree true
$ mkdir my_dataset
$ dvc add my_dataset
$ dvc pull

A slightly simpler workflow to accomplish this now would be:

$ dvc remote add my_dataset s3://bucket/path/to/dataset
$ dvc remote modify my_dataset worktree true
$ dvc add s3://bucket/path/to/dataset/my_dataset -o my_dataset

dvc add --worktree could be a shortcut for this, although in testing, dvc add -o was unusually slow, so we might need to clean up its UI if we use it.

@daavoo
Copy link
Contributor

daavoo commented Jan 17, 2023

What is the difference between the dvc add --worktree as described in the last comments and #8411 ?

Should the flag be dvc add --external and detect internally that it's a cloud versioned remote?

@dberenbaum
Copy link
Contributor Author

Should the flag be dvc add --external and detect internally that it's a cloud versioned remote?

dvc add --external saves the cloud path and metadata and operates exclusively on the cloud. There's no way to pull a local copy of the data.

dvc add s3://bucket/path/to/dataset/my_dataset -o my_dataset downloads the dataset but does not save the cloud path or cloud metadata.

What we need is a combination of those behaviors. As evidence by these two commands, the current UX is a mess, so I'm not that tied to what flag this fits (if it needs any flag) as long as it:

  1. Records the cloud location and metadata.
  2. Allows for data to be pushed, pulled, and updated between local and cloud.
  3. Works with single dvc add ... command.

What is the difference between the dvc add --worktree as described in the last comments and #8411 ?

It might be related, but #8411 is focused on pipeline outputs written directly to the cloud. Since pipelines already support external dependencies without any special behavior, adding this feature for outputs should make a cloud-centric pipeline fairly easy. We still need to discuss in #8411 whether that should have the same behavior as this issue (for example, whether those outputs can be pulled locally).

@omesser omesser changed the title cloud versioning: track existing cloud dataset and push updates to it Cloud versioning: Track existing cloud dataset and push updates to it Jan 25, 2023
@dberenbaum
Copy link
Contributor Author

@dtrifiro Given that we are deprioritizing worktree remotes, we can close this issue, but first wanted to check if you have any POC you could push as a PR or something? I think it would be useful if we need to come back to it in the future.

@dberenbaum dberenbaum closed this as not planned Won't fix, can't repro, duplicate, stale Feb 7, 2023
@github-project-automation github-project-automation bot moved this from In Progress to Done in DVC Feb 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
p1-important Important, aka current backlog of things to do
Projects
Archived in project
Development

No branches or pull requests

3 participants