Skip to content

cloud versioning: pull non-DVC cloud updates #8382

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dberenbaum opened this issue Sep 29, 2022 · 5 comments · Fixed by #8649
Closed

cloud versioning: pull non-DVC cloud updates #8382

dberenbaum opened this issue Sep 29, 2022 · 5 comments · Fixed by #8649
Assignees
Labels
A: data-sync Related to dvc get/fetch/import/pull/push p1-important Important, aka current backlog of things to do

Comments

@dberenbaum
Copy link
Contributor

In a project with a cloud-versioned remote, there could be either manual or automated updates happening on the remote outside of any DVC process. For example, a daily ETL process uploads new data directly to the cloud remote. Or someone unfamiliar with DVC wants to upload some new data.

A DVC user should be able to not only pull the versions of data tracked in their project, but also pull the latest version of data available on the cloud. For example, dvc update could be used to get the latest data the same way it does today for imports.

@dberenbaum dberenbaum added A: data-sync Related to dvc get/fetch/import/pull/push A: cloud-versioning labels Sep 29, 2022
@dberenbaum dberenbaum added p1-important Important, aka current backlog of things to do p2-medium Medium priority, should be done, but less important and removed p1-important Important, aka current backlog of things to do labels Oct 12, 2022
@dberenbaum dberenbaum added p3-nice-to-have It should be done this or next sprint and removed p2-medium Medium priority, should be done, but less important labels Nov 17, 2022
@dberenbaum dberenbaum added p1-important Important, aka current backlog of things to do and removed p3-nice-to-have It should be done this or next sprint labels Nov 29, 2022
@pmrowla pmrowla self-assigned this Nov 29, 2022
@pmrowla pmrowla added this to DVC Nov 29, 2022
@pmrowla pmrowla moved this to Backlog in DVC Nov 29, 2022
@pmrowla pmrowla moved this from Backlog to Todo in DVC Nov 29, 2022
@pmrowla
Copy link
Contributor

pmrowla commented Nov 29, 2022

After discussion with @dberenbaum, scope for this issue on initial release will be

  • dvc update will only be applicable in worktree = true scenarios (and not version_aware = true, worktree = false)
  • update should pull the latest versions of outs we are aware of (and have .dvc files for locally).
    • For a file output, update would get latest modifications to the file
    • For a dir output, update would get latest modifications to files in the dir, newly added files in the dir, and deletions of files within the dir
  • update will ignore files/dirs in the remote bucket that we are not aware of already (i.e. new files/dirs that are in the bucket but do not appear in any of our local .dvc/dvc.yaml files)

@pmrowla
Copy link
Contributor

pmrowla commented Nov 29, 2022

@dberenbaum one thing we didnt discuss was how to handle deletion for a standalone file output, where a tracked file no longer exists in the bucket (and has DELETE marker set).

  • For pipeline stage outs this seems like an error or at least a warning case to me? Or maybe this could also just be silently ignored (since the worktree + pipeline use case is fairly vague at the moment)?
  • For regular dvc add outs, we could treat it like the out was dvc removed, and delete the corresponding .dvc file, but this is dangerous (if the user hasn't actually git committed their recent changes to the .dvc file this would be a loss of data). It seems like this should also just be an error/warning case as well. We could also consider outputting a suggestion like ... files were marked as delete in the remote, use 'dvc remove ...' to delete them locally

@dberenbaum
Copy link
Contributor Author

One thing to clarify that may simplify this: For now, it's fine to keep the current syntax for dvc update where it requires a target. We may need to support dvc update without a target eventually, but for now being able to update a dataset at a time is enough.

@dberenbaum one thing we didnt discuss was how to handle deletion for a standalone file output, where a tracked file no longer exists in the bucket (and has DELETE marker set).

If you import data with import/import-url, then delete the source and run dvc update, it will fail. Let's stay consistent with that for now.

@pmrowla pmrowla moved this from Todo to In Progress in DVC Nov 30, 2022
@pmrowla
Copy link
Contributor

pmrowla commented Dec 1, 2022

@dberenbaum on dvc update, should the .dvc file be updated to contain all of the latest version IDs? Or should we keep existing/old version IDs (to avoid generating potential merge conflicts like we do on dvc push)?

@dberenbaum
Copy link
Contributor Author

🤔 If we can keep the existing/old version IDs, that seems best.

Repository owner moved this from In Progress to Done in DVC Dec 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: data-sync Related to dvc get/fetch/import/pull/push p1-important Important, aka current backlog of things to do
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants