Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Change remote globally in Git history #2960

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dmpetrov opened this issue Dec 16, 2019 · 22 comments
Closed

Change remote globally in Git history #2960

dmpetrov opened this issue Dec 16, 2019 · 22 comments
Labels
discussion requires active participation to reach a conclusion feature request Requesting a new feature

Comments

@dmpetrov
Copy link
Member

dmpetrov commented Dec 16, 2019

Changing buckets and cloud easily for a single project is a very compelling feature.

But in fact, when I need to transfer a data-remote from one bucket to another (or another cloud) I can do that properly only on HEAD of the repo. All the old commit will have old remote (in .dvc/config). As a result, when I checkout back in Git history an old remote will be used.

So, I need to keep old data remote (bucket) or I'll have troubles using my old Git commits.

Is it possible to make remote settings "global"? A single remote change should change it everywhere in the Git history. How Git does that and can it work for DVC? Are there other options?

All ideas are welcome!

@dmpetrov dmpetrov added feature request Requesting a new feature discussion requires active participation to reach a conclusion labels Dec 16, 2019
@efiop
Copy link
Contributor

efiop commented Dec 16, 2019

@dmpetrov This is mostly about properly maintaining your project. E.g. using tags or branches for released versions, which can be easilly updated to point to the new bucket.

@efiop
Copy link
Contributor

efiop commented Dec 16, 2019

E.g. say some project died and closed its remote, but someone has the cache somwhere. In that case, you fork it and put your own remote and rely on it. Or, you could specify -r remote to specific commands, like import/get(which we don't have support for right now, but we do have a ticket opened).

@dmpetrov
Copy link
Member Author

@efiop could you please elaborate? How should the team properly maintain the project to minimize this amount of work?

Use case: A team has a long-running project (year+), hundreds of commits, dozens of releases (with tags). All Git history is important. Some clients might still use old ML models, all changes in data sources have to be fully tracked from the inception. Not released commits (without tags) might be also important (we decided to release an old but a promising experiment).

Suddenly the team decided to change a cloud provider. It looks like a tremendous amount of work has to be done to make it happen. It would be great if DVC can make it in a few commands.

@efiop
Copy link
Contributor

efiop commented Dec 22, 2019

Oops, sorry for the delay.

Not released commits (without tags) might be also important (we decided to release an old but a promising experiment).

Well, giving that to your customers was a bad idea from beginning 🙂

So say you have some tag like v1 that was using s3. Now you are migrating to gs, so what you do is you go to v1, create a branch from it, adjust the remote to point to the new gs location, commit, move v1 to this new commit. Then you'll have to make your users update.

But if you, as a maintainer, would take it more seriously earlier, you would create some kind of proxy remote. E.g. http remote (like we do in dvc core project) that you will be able to trivially switch from s3 to gs without needing to adjust anything in the projects themselves.

@dmpetrov
Copy link
Member Author

dmpetrov commented Dec 23, 2019

@efiop these are good workarounds. I urge everyone to think about a holistic solution instead.

Ideally, we need the same solution as Git has - I can easily move a repo with the entire history from, for example, GitHub to GitLab by a couple of commands just by changing remote, pushing and removing the old one. Is there a way to implement something similar in DVC?

@ghost
Copy link

ghost commented Dec 23, 2019

@dmpetrov , @efiop , what about using external tools like the AWS CLI or the Google Cloud Platform CLI to sync the cache?

For example, if you are migrating to GCP, it would be gsutil cp -r .dvc/cache gs://my-bucket

@ghost
Copy link

ghost commented Dec 23, 2019

If you don't care about modifying your Git history, git rebase -i -root and modifying the commit who added the remote, pointing to the new one.

@dmpetrov
Copy link
Member Author

@MrOutis sure, you can use the tools. But it is not clear how to change the links in Git history - you will still have .dvc/config files in history pointing to the old repo. This is the major question.

Modifying the commit - yes, it is a possibility. Is there any better way to define and change data remotes? (most likely it should not be committed to config)

@shcheklein
Copy link
Member

👍 on my end, it feels that there should be a better solution to this.

@efiop
Copy link
Contributor

efiop commented Dec 25, 2019

Ideally, we need the same solution as Git has - I can easily move a repo with the entire history from, for example, GitHub to GitLab by a couple of commands just by changing remote, pushing and removing the old one. Is there a way to implement something similar in DVC?

Sure:

dvc pull (maybe with --all-commits or --all-tags or --all-branches)
dvc remote add gs gs://bucket/path
dvc push -r gs (maybe with --all-commits or --all-tags or --all-branches)
dvc remote remove s3
dvc remote default gs
git add .dvc/config
git commit
git push

Maybe I don't quite understand how you propose to "modify git history". The workarounds I've provided initially update the branches/tags that are used or propose to use some type of proxy to route your pulls. Currently I can't think of any other way to achieve this.

@shcheklein
Copy link
Member

How about we always rely on the latest commit in a branch to determine the actual remote? No matter what is committed in the history.

@efiop
Copy link
Contributor

efiop commented Dec 26, 2019

@shcheklein sounds fragile and non-obvious. Plus it again won't work until the user dvc updates or something, to update the git repo cache. You might also want to move to another remote, while dropping the old one, which would also break this.

@dmpetrov
Copy link
Member Author

dmpetrov commented Jan 3, 2020

@efiop The code you provided looks like another workaround, not a holistic solution for changing remotes "globally". The problem - if a user checks out an old revision (clones or imports v2 tag for example) it will point to an outdated remote.

@shcheklein thank you! It is definitely a global solution that might work. There are some issues with this approach (thanks @efiop to pointing to this) but at least we have something to consider or/and improve.

@Suor
Copy link
Contributor

Suor commented Jan 26, 2020

It looks like you simply want to rewrite the history for config file.

EDIT. Another option is completely separating config from git history, which we already support in the form of global/user/local configs.

@dimitry12
Copy link

But if you, as a maintainer, would take it more seriously earlier, you would create some kind of proxy remote. E.g. http remote (like we do in dvc core project) that you will be able to trivially switch from s3 to gs without needing to adjust anything in the projects themselves.

@efiop can you please elaborate or point to the details of proxy-remote implementation?

Also to summarize possible solutions I see in the thread as I am facing the same inconvenience (I want to import older model from a github-based registry using --rev parameter; and at the time of that old revision registry used a now gone dvc-remote):

  • Rewrite git history by editing .dvc/config in all commits so that it contains the currently active dvc-remote. That breaks many things outside of the dvc as commit hashes change.
  • (yet unsupported?) --remote parameter for dvc import which would use one of the dvc-remotes specified inmaster/main's .dvc/config?
  • (unsupported for good) When using dvc import automatically use the default remote from master/main's .dvc/config instead of that revision's .dvc/config. I agree that's fragile and breaking change, and may not work for the teams who don't dvc push --all-commits
  • Using proxy/alias for dvc-remote (see the first part of this comment). Seems like a nice solution.
  • Use global .dvc/config when dvc is in the role of data-registry. How would that work if I want to use different default dvc-remotes for different registries?

@dberenbaum
Copy link
Contributor

* Use global `.dvc/config` when `dvc` is in the role of data-registry. How would that work if I want to use different default dvc-remotes for different registries?

Hi @dimitry12, for this option, would dvc config --local (.dvc/config.local) work? It would give you a local config that's ignored by Git.

@Suor
Copy link
Contributor

Suor commented Mar 8, 2021

@dberenbaum yes local config will also work, as long as you manage it properly, i.e. update it over all your working copies.

@dberenbaum
Copy link
Contributor

Thanks, @Suor! I'm wondering if this issue can be closed then since it seems that the introduction of the local config makes changing remotes globally possible and on par with Git.

@dimitry12
Copy link

Local config works for me.

@shcheklein
Copy link
Member

@dberenbaum my 2cs on this: local config can be a good temporary solution, but it breaks a bit the point of repos being self descriptive/self contained. Why it can be important? For two reasons (and may be I'm missing something else):

  1. Asking everyone to do local config means an additional step people will need to remember to do. Not the biggest problem, but makes everything a bit more fragile.
  2. It complicates automation - CI (e.g. CML), Viewer - those environment where we are accessing repos from code. Local config is a possibility but will require additional steps from users. In the Viewer it'll require additional UI/UX to manage this.

@dberenbaum
Copy link
Contributor

@dberenbaum
Copy link
Contributor

Another argument for why local config is insufficient: for data registry repos where the data is being fetched from outside the project via get/import, this won't work. See https://discord.com/channels/485586884165107732/563406153334128681/826365388400754728.

@efiop efiop closed this as completed May 3, 2021
@iterative iterative locked and limited conversation to collaborators May 3, 2021

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
discussion requires active participation to reach a conclusion feature request Requesting a new feature
Projects
None yet
Development

No branches or pull requests

6 participants