-
Notifications
You must be signed in to change notification settings - Fork 1.2k
cloud versioning: multiple remotes/metadata format #8356
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Is the Supporting multiple remotes per-output leads to questionable behavior unless the user is always pushing everything to all of the remotes they intend to use, every time they make a modification to data in their DVC repo. |
This is similar to the problem in #8298, where the existing behavior for I think there's mainly 2 common use cases for having multiple remotes in DVC:
In practice, the data segregation use case only works in DVC if the user is already using
In this case, they are actually still only pushing each output to a single remote (per output), and should probably be using For the read-only vs read/write use case, it works in non-cloud versioning scenarios since you can do things like set up an S3 remote with write credentials, and then use the public For DVC's purposes, this is not possible to do in a cloud versioning scenario, unless your default read-only remote is still the |
True, so for this use case, it may not be needed to append multiple
I think users may want private but still read-only cloud remotes (the bucket policy grants the user read permissions but not write permissions). A 3rd use case is migrating remotes (let's say a company switches cloud providers). Without keeping track of the remote that generated each This info should probably be separate from the existing |
Moving this to p1 since it will be a breaking change and keeps coming up |
see also: #8653 (comment) |
Moving this to p0 since it blocks any public release of cloud versioning |
Thinking more about this, I think it would be better to only support a single remote for cloud versioning (unless Since versioning is happening in cloud and not dvc, I think it makes it easier if it were tied to a single remote. Still working on this, but I am not very convinced by above reasons for the need. :) |
I agree that it's questionable whether or not we should support pushing a single DVC out to more than one remote, but I think we do need to at least support multiple (non-default) remotes in a DVC repo. This could be done via the existing |
@dberenbaum the thing to remember is that we can't support migrating cloud versioned remotes the same way that we can with regular remotes. For cloud versioning users will have to maintain the existing remote for as long as they want to access to the data versioned by older git commits. We can technically push historical data to a new remote as object versions in the new remote, but we cannot actually update the old .dvc or dvc.lock files (unless we rewrite git history to replace/update every historical git commit with new .dvc/dvc.lock files). So even if we supported pushing old data to a new remote, there would be no way for the user to actually access it afterwards. Supporting multiple remotes would really only work in a workflow where the user is explicitly pushing a file to multiple remotes all at once, and then
We could work around this by making DVC use the remotes defined in the DVC config for a particular git commit (instead of the current behavior where we use the latest commit's remote and try to get all historical DVC objects from that one remote).
For regular DVC remotes, you can do things like define the default read-only For cloud versioning, the only practical way to handle this for versioned remotes would be for users to manage it on the cloud credentials side. So you would only define a single s3 remote, and then manage access through AWS permissions (where some users only have read-only access to a bucket, and some users have write/delete privileges). |
It seems like maybe we should just always explicitly set |
Is that how it works right now for cloud versioning? Only the latest config is used? |
For cloud versioning right now we always use either an explicit Lines 80 to 84 in fc1a192
Lines 132 to 133 in fc1a192
We do actually fill the values with the correct remote per-commit when we generate the repo index for a given commit, so fixing this for For push we don't support pushing old commits since we can't rewrite git history, but we still have the same bug (with replacing the index-filled remote) to fix there as well. For push this is also a dvc-data |
I agree that handling multiple remotes is not a critical use case. This is a p0 because it defines the metadata structure for cloud versioning. Maybe @pmrowla described the problem better in iterative/dvc-data#241. If you look up through this issue and notice how many times it has already been referenced by other issues in early development of cloud versioning, it might help clarify that having no information about the remote associated with the etag/version_id has quickly become a mess, so we need to keep this info in the .dvc file. Even outside of cloud versioning, imports and other features that reference etags or other external data might benefit from keeping info about their remotes. Whether to allow for one or multiple remote entries is less critical. It seemed cleaner to me since there is not much to prevent users from configuring multiple remotes and pushing to them, and I don't know how dvc would handle those situations, but it's not a blocker as a product requirement.
@pmrowla Migrating remote is also not a blocker, but looking at the full comment here about migrating remotes, it still seems like there are benefits to having the remote info saved:
|
Do we plan to support content-addressable storage and cloud-versioned storage at the same time? At the moment, we have |
Also, what should happen if you do |
Since we really only support pushing to a single remote now, should we be setting
but I'm not sure it makes sense for us to allow this behavior.
|
I think we should just prevent What you are saying makes sense, but I think that's more of a product question on how
|
It's fine to drop
Sounds good. It probably makes sense to set the top-level |
Part of #7995
Description
Cloud versioning does not support multiple remotes.
Reproduction
You can reproduce with the script below. Replace
CLOUD_REMOTE_1
andCLOUD_REMOTE_2
with your own s3 paths and configure your credentials.Expected
remote
for eachversion_id
, so that when pushing to a different remote, it appends aversion_id
for the other remote instead of overwriting it.remote:
syntax in a.dvc
file, DVC should push to that remote instead of the default.The text was updated successfully, but these errors were encountered: