Skip to content

DVC PULL doesn't work when i pushed data from one GIT Repo to S3 remote storage and try pulling from other GIT Repo #4253

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
aswin-datakalp opened this issue Jul 21, 2020 · 7 comments
Labels
awaiting response we are waiting for your reply, please respond! :) question I have a question?

Comments

@aswin-datakalp
Copy link

HI ,

Details**
DVC version: 1.1.7
Python version: 3.6.9
Platform: Linux-5.3.0-62-generic-x86_64-with-Ubuntu-18.04-bionic
Binary: False
Package: pip
Supported remotes: http, https, s3

problem***
I have two GIT REPOS, for example namely "Repo_1" and "Repo_2".

Lets say i initialize DVC in Repo_1 and add few data files and also add remote s3 storage and push data there through dvc push command. Able to see the data in AWS S3 bucket.

Now i come to "REPO_2" and initialize dvc by "dvc init" and also i add the same s3 bucket as remote storage and try to pull the data by "dvc pull". But data pulling that is not happening !!

Can you explain me why that way of linking two GIT REPOS where you can push data from one repo to aws and pull from the other repo which has same aws remote storage is not possible ??

Thanks

@ghost ghost added the triage Needs to be triaged label Jul 21, 2020
@aswin-datakalp aswin-datakalp changed the title DVC PULL doesn't work when i pushed data from one GIT repo to S3 remote storage and try pulling from other DVC PULL doesn't work when i pushed data from one GIT Repo to S3 remote storage and try pulling from other GIT Repo Jul 21, 2020
@efiop
Copy link
Contributor

efiop commented Jul 21, 2020

Hi @aswin-datakalp Short answer: It should be the same repo :) Information about dvc-tracked files is stored in git-tracked dvc-files, so you not only need to dvc push, but also to git push your changes to dvc-files and then git pull them in another instance of your repo before you can dvc pull. Take a look at this https://dvc.org/doc/use-cases/sharing-data-and-model-files article where we go through the process. Also check out our get-started guide https://dvc.org/doc/start if you haven't already.

@efiop efiop added awaiting response we are waiting for your reply, please respond! :) question I have a question? labels Jul 21, 2020
@ghost ghost removed the triage Needs to be triaged label Jul 21, 2020
@aswin-datakalp
Copy link
Author

Hi @efiop ,

I think you have understood wrongly.

The above solutions of yours will definitely if there are two people using the same GIT Repo where person 1 add's data and makes dush push and git push and then the other person (person 2) first has to do git pull to get the latest DVC tracked GIT files and then DVC pull will work for him.

My problem statement is different. In my case, i am having two different GIT repositories where both are initialized with dvc and both are being pointed to same Remote storage (AWS S3 bucket).

Now i push data from repo 1 to aws storage bucket and it is successful.

Now i go to the other repo which is initialised with dvc init and then added remote storage by dvc remote add -d "name bucket_url" and adding is also successful.

This confirms that both shares the same Remote Storage.

When i do DVC pull from the repo 2 ( data was added and push to remote from repo 1 and im pull that data from repo 2), data pulling is not happening.

Can you help me how to do this kind of explicit pulling where you push to AWS from one repo and then pull completely from a different repo which also has same remote storage cofigured and nothing else !!

I hope now you understood the problem happening !

@efiop
Copy link
Contributor

efiop commented Jul 22, 2020

@aswin-datakalp That won't work. Dvc remote is just a content based storage that doesn't know names of the files it stores. Names are stored in your git repo (e.g. in data.dvc). You can copy that data.dvc to your other repo and then dvc pull, but that is hacky.

What you probably want is dvc get/import. https://dvc.org/doc/use-cases/data-registries So you could dvc get/import data from one dvc repo in another git/dvc repo.

@dparmar61
Copy link

dparmar61 commented Apr 8, 2022

@efiop https://dvc.org/doc/use-cases/sharing-data-and-model-files gives 404 error. I tried pulling repo from GitHub having .dvc file in new repo and used dvc pull but it is giving error as below.

WARNING: No file hash info found for '/dhaval/dvc_test/DVC_test/nlp_outbound/normalization/config'. It won't be created.
1 file failed
ERROR: failed to pull data from the cloud - Checkout failed for following targets:
/dhaval/dvc_test/DVC_test/nlp_outbound/normalization/config
Is your cache up to date?
https://error.dvc.org/missing-files

@dberenbaum
Copy link
Contributor

@dparmar61 It looks like all the data was pulled except for /dhaval/dvc_test/DVC_test/nlp_outbound/normalization/config. Most likely, this is expected behavior and you can ignore the error. For example, that path might be specified as an output of a pipeline in dvc.yaml, and if that pipeline has not been run or its contents were not saved in a corresponding dvc.lock file, you will get an error like the one you saw. It should only be a concern if you know that this path was pushed to the remote.

@dparmar61
Copy link

dparmar61 commented Apr 10, 2022

@dberenbaum So config contains some data required to run code.I do not want to keep it on GitHub. So I have moved config directory to s3 and I have transferred only config.dvc file with all other required dvc files like config and .gitignore to GitHub. I think using hash present in this .dvc file "dvc pull" will be able to fetch original data from s3.

@dparmar61
Copy link

@dberenbaum It is working may be it was due to different dvc version while using push and pull.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting response we are waiting for your reply, please respond! :) question I have a question?
Projects
None yet
Development

No branches or pull requests

4 participants