-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Cloud versioning: Track existing cloud dataset and push updates to it #8704
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@dtrifiro Sorry for changing product reqs here, but thinking about this more, I lean towards it making more sense in Also, outs:
- md5: 56db666947dc82019468d7fa5c25d5b4.dir
size: 59789011
nfiles: 2605
path: cats-dogs
files:
- size: 34315
etag: 10d2a131081a3095726c5721ed31c21f
md5: 10d2a131081a3095726c5721ed31c21f
relpath: data/train/cats/cat.10.jpg
remote: myremote
- size: 28377
etag: 0f2bfe74e9c363064087d0cd8a322106
md5: 0f2bfe74e9c363064087d0cd8a322106
relpath: data/train/cats/cat.100.jpg
remote: myremote Compare that to md5: bc6502513858d60c8273554d99917133
frozen: true
deps:
- md5: 32666fe69170a00790117e76b90e42e2.dir
size: 59789011
nfiles: 2605
path: s3://dave-sandbox-versioning/test/worktree/cats-dogs
files:
- size: 34315
version_id: svPEUz3wOVBye5DMfyXkpz7CB2r.DpfW
etag: 10d2a131081a3095726c5721ed31c21f
relpath: data/train/cats/cat.10.jpg
- size: 28377
version_id: d3zndjpU2vUZZ.oSqQ1FKb24jDPh_pcG
etag: 0f2bfe74e9c363064087d0cd8a322106
relpath: data/train/cats/cat.100.jpg
...
md5: bc6502513858d60c8273554d99917133
frozen: true
deps:
- md5: 32666fe69170a00790117e76b90e42e2.dir
size: 59789011
nfiles: 2605
path: s3://dave-sandbox-versioning/test/worktree/cats-dogs
files:
- size: 34315
version_id: svPEUz3wOVBye5DMfyXkpz7CB2r.DpfW
etag: 10d2a131081a3095726c5721ed31c21f
relpath: data/train/cats/cat.10.jpg
- size: 28377
version_id: d3zndjpU2vUZZ.oSqQ1FKb24jDPh_pcG
etag: 0f2bfe74e9c363064087d0cd8a322106
relpath: data/train/cats/cat.100.jpg |
A slightly simpler workflow to accomplish this now would be: $ dvc remote add my_dataset s3://bucket/path/to/dataset
$ dvc remote modify my_dataset worktree true
$ dvc add s3://bucket/path/to/dataset/my_dataset -o my_dataset
|
What is the difference between the Should the flag be |
What we need is a combination of those behaviors. As evidence by these two commands, the current UX is a mess, so I'm not that tied to what flag this fits (if it needs any flag) as long as it:
It might be related, but #8411 is focused on pipeline outputs written directly to the cloud. Since pipelines already support external dependencies without any special behavior, adding this feature for outputs should make a cloud-centric pipeline fairly easy. We still need to discuss in #8411 whether that should have the same behavior as this issue (for example, whether those outputs can be pulled locally). |
@dtrifiro Given that we are deprioritizing |
Uh oh!
There was an error while loading. Please reload this page.
Related to #8382. The current UX is awkward if you want to track an existing cloud dataset and later push updates to it. You have to do something like:
Edit: replaced
dvc pull
withdvc update my_dataset
.dvc import-url --worktree s3://bucket/path/to/dataset my_dataset
could do the workflow above in a single command. The remote config could be stored either in the .dvc file or in config files. This would enable an easy workflow to update a cloud dataset like:The text was updated successfully, but these errors were encountered: