Cloud versioning: Track existing cloud dataset and push updates to it #8704

dberenbaum · 2022-12-16T11:11:33Z

Related to #8382. The current UX is awkward if you want to track an existing cloud dataset and later push updates to it. You have to do something like:

$ dvc remote add my_dataset s3://bucket/path/to/dataset
$ dvc remote modify my_dataset worktree true
$ mkdir my_dataset
$ dvc add my_dataset
$ dvc update my_dataset

Edit: replaced dvc pull with dvc update my_dataset.

dvc import-url --worktree s3://bucket/path/to/dataset my_dataset could do the workflow above in a single command. The remote config could be stored either in the .dvc file or in config files. This would enable an easy workflow to update a cloud dataset like:

$ dvc import-url --worktree s3://bucket/path/to/dataset my_dataset
# Modify
$ rm my_dataset/old_files*
$ cp -R new_files/* my_dataset
# Track changes
$ dvc add/commit my_dataset
$ dvc push

The text was updated successfully, but these errors were encountered:

dberenbaum · 2023-01-14T13:03:37Z

@dtrifiro Sorry for changing product reqs here, but thinking about this more, I lean towards it making more sense in dvc add --worktree than dvc import-url --worktree. It's a less confusing UX since the commands after downloading the dataset (add, push) are more typical for dvc add.

Also, import-url normally assumes both source deps and outs, and here there are really no deps. The final output should be something like this:

outs:
- md5: 56db666947dc82019468d7fa5c25d5b4.dir
  size: 59789011
  nfiles: 2605
  path: cats-dogs
  files:
  - size: 34315
    etag: 10d2a131081a3095726c5721ed31c21f
    md5: 10d2a131081a3095726c5721ed31c21f
    relpath: data/train/cats/cat.10.jpg
    remote: myremote
  - size: 28377
    etag: 0f2bfe74e9c363064087d0cd8a322106
    md5: 0f2bfe74e9c363064087d0cd8a322106
    relpath: data/train/cats/cat.100.jpg
    remote: myremote

Compare that to import-url outputs, which would be like:

md5: bc6502513858d60c8273554d99917133
frozen: true
deps:
- md5: 32666fe69170a00790117e76b90e42e2.dir
  size: 59789011
  nfiles: 2605
  path: s3://dave-sandbox-versioning/test/worktree/cats-dogs
  files:
  - size: 34315
    version_id: svPEUz3wOVBye5DMfyXkpz7CB2r.DpfW
    etag: 10d2a131081a3095726c5721ed31c21f
    relpath: data/train/cats/cat.10.jpg
  - size: 28377
    version_id: d3zndjpU2vUZZ.oSqQ1FKb24jDPh_pcG
    etag: 0f2bfe74e9c363064087d0cd8a322106
    relpath: data/train/cats/cat.100.jpg
...
md5: bc6502513858d60c8273554d99917133
frozen: true
deps:
- md5: 32666fe69170a00790117e76b90e42e2.dir
  size: 59789011
  nfiles: 2605
  path: s3://dave-sandbox-versioning/test/worktree/cats-dogs
  files:
  - size: 34315
    version_id: svPEUz3wOVBye5DMfyXkpz7CB2r.DpfW
    etag: 10d2a131081a3095726c5721ed31c21f
    relpath: data/train/cats/cat.10.jpg
  - size: 28377
    version_id: d3zndjpU2vUZZ.oSqQ1FKb24jDPh_pcG
    etag: 0f2bfe74e9c363064087d0cd8a322106
    relpath: data/train/cats/cat.100.jpg

dberenbaum · 2023-01-14T13:41:55Z

You have to do something like:

$ dvc remote add my_dataset s3://bucket/path/to/dataset
$ dvc remote modify my_dataset worktree true
$ mkdir my_dataset
$ dvc add my_dataset
$ dvc pull

A slightly simpler workflow to accomplish this now would be:

$ dvc remote add my_dataset s3://bucket/path/to/dataset
$ dvc remote modify my_dataset worktree true
$ dvc add s3://bucket/path/to/dataset/my_dataset -o my_dataset

dvc add --worktree could be a shortcut for this, although in testing, dvc add -o was unusually slow, so we might need to clean up its UI if we use it.

daavoo · 2023-01-17T19:45:20Z

What is the difference between the dvc add --worktree as described in the last comments and #8411 ?

Should the flag be dvc add --external and detect internally that it's a cloud versioned remote?

dberenbaum · 2023-01-17T20:45:17Z

Should the flag be dvc add --external and detect internally that it's a cloud versioned remote?

dvc add --external saves the cloud path and metadata and operates exclusively on the cloud. There's no way to pull a local copy of the data.

dvc add s3://bucket/path/to/dataset/my_dataset -o my_dataset downloads the dataset but does not save the cloud path or cloud metadata.

What we need is a combination of those behaviors. As evidence by these two commands, the current UX is a mess, so I'm not that tied to what flag this fits (if it needs any flag) as long as it:

Records the cloud location and metadata.
Allows for data to be pushed, pulled, and updated between local and cloud.
Works with single dvc add ... command.

What is the difference between the dvc add --worktree as described in the last comments and #8411 ?

It might be related, but #8411 is focused on pipeline outputs written directly to the cloud. Since pipelines already support external dependencies without any special behavior, adding this feature for outputs should make a cloud-centric pipeline fairly easy. We still need to discuss in #8411 whether that should have the same behavior as this issue (for example, whether those outputs can be pulled locally).

dberenbaum · 2023-02-03T21:57:29Z

@dtrifiro Given that we are deprioritizing worktree remotes, we can close this issue, but first wanted to check if you have any POC you could push as a PR or something? I think it would be useful if we need to come back to it in the future.

dberenbaum added A: cloud-versioning p1-important Important, aka current backlog of things to do labels Dec 16, 2022

dtrifiro self-assigned this Jan 10, 2023

dtrifiro added this to DVC Jan 10, 2023

github-project-automation bot moved this to Backlog in DVC Jan 10, 2023

dtrifiro moved this from Backlog to Todo in DVC Jan 10, 2023

dtrifiro moved this from Todo to In Progress in DVC Jan 13, 2023

dberenbaum changed the title ~~cloud versioning: import-url --worktree~~ cloud versioning: track existing cloud dataset and push updates to it Jan 14, 2023

daavoo mentioned this issue Jan 17, 2023

Cloud versioning #7995

Closed

omesser changed the title ~~cloud versioning: track existing cloud dataset and push updates to it~~ Cloud versioning: Track existing cloud dataset and push updates to it Jan 25, 2023

dberenbaum closed this as not planned Won't fix, can't repro, duplicate, stale Feb 7, 2023

github-project-automation bot moved this from In Progress to Done in DVC Feb 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cloud versioning: Track existing cloud dataset and push updates to it #8704

Cloud versioning: Track existing cloud dataset and push updates to it #8704

dberenbaum commented Dec 16, 2022 •

edited

Loading

dberenbaum commented Jan 14, 2023

Uh oh!

dberenbaum commented Jan 14, 2023 •

edited

Loading

Uh oh!

daavoo commented Jan 17, 2023 •

edited

Loading

Uh oh!

dberenbaum commented Jan 17, 2023

Uh oh!

dberenbaum commented Feb 3, 2023

Uh oh!

Cloud versioning: Track existing cloud dataset and push updates to it #8704

Cloud versioning: Track existing cloud dataset and push updates to it #8704

Comments

dberenbaum commented Dec 16, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

dberenbaum commented Jan 14, 2023

Uh oh!

dberenbaum commented Jan 14, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

daavoo commented Jan 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dberenbaum commented Jan 17, 2023

Uh oh!

dberenbaum commented Feb 3, 2023

Uh oh!

dberenbaum commented Dec 16, 2022 •

edited

Loading

dberenbaum commented Jan 14, 2023 •

edited

Loading

daavoo commented Jan 17, 2023 •

edited

Loading