-
Notifications
You must be signed in to change notification settings - Fork 1.2k
pull: how to only download data from a specified remote? #8298
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
mea culpa, the doc explains how the remote flag works and it seems consistent with the behaviour I experienced:
However, I'm really wondering how I can download all the data from a specified remote without explicitly listing all the stages/data? (Ideally I'd like not to download everything and only what's required for the repro #4742). |
--remote
and remote specified explicitly as stages output don't have the expected behaviour--remote
and remote specified explicitly as stages output don't have the expected behaviour
--remote
and remote specified explicitly as stages output don't have the expected behaviour
Discussed that first we should document the behavior better in push/pull, but we will also leave this open as a feature request. |
I took a closer look to document this, and I agree with @courentin that the current behavior is unexpected/unhelpful:
|
Update on current behavior. I tested with two local remotes, $ tree
.
├── bar.dvc
└── foo.dvc
0 directories, 2 files
$ cat .dvc/config
[core]
remote = default
['remote "default"']
url = /Users/dave/dvcremote
['remote "other"']
url = /Users/dave/dvcremote2
$ cat foo.dvc
outs:
- md5: d3b07384d113edec49eaa6238ad5ff00
size: 4
hash: md5
path: foo
$ cat bar.dvc
outs:
- md5: c157a79031e1c40f85931829bc5fc552
size: 4
hash: md5
path: bar
remote: other Here's what Simple $ dvc pull
A foo
A bar
2 files added and 2 files fetched This is what I would expect. It pulls each file from its respective remote. Next, pulling only from the $ dvc pull -r other
WARNING: Some of the cache files do not exist neither locally nor on remote. Missing cache files:
md5: d3b07384d113edec49eaa6238ad5ff00
A bar
1 file added and 1 file fetched
ERROR: failed to pull data from the cloud - Checkout failed for following targets:
foo
Is your cache up to date?
<https://error.dvc.org/missing-files> This makes sense to me also. It pulls only from the $ dvc pull -r other --allow-missing
WARNING: Some of the cache files do not exist neither locally nor on remote. Missing cache files:
md5: d3b07384d113edec49eaa6238ad5ff00
A bar
1 file added and 1 file fetched Finally, we pull only from $ dvc pull -r default
A bar
A foo
2 files added and 2 files fetched This gives us the same behavior as |
Thank you for taking a look :)
For my use case, I think it's ok to include data that has no remote field |
Hello! Thanks a lot for your help responding to this question! I am actually in, I believe, the same exact boat as OP. I have 2 datasets, which I uploaded up to S3 from local using DVC. On my local, I have a folder with images called datasetA that I uploaded to s3 by doing the dvc add datasetA, dvc push -r remoteA (which is defined in my .dvc config file). I cleared the cache (with a manual file delete), then did the same exact steps to push datasetB to remoteB. In my datasetA.dvc and datasetB.dvc files, I have their remote metadata values set to remoteA and remoteB respectively (the names of the remotes in the config). I did this manually by editing the file.
My goal is to be able to say dvc pull -r remoteA and get DatasetA files only, and vice versa with B. So I cleared my cache (manually again), and did the above commands but they both pulled both remoteA and remoteB. I still have a default remote set to remoteA, but I don't know if that is the issue. I am wondering if there is something I am missing here in how you were able to configure your dvc files to make it work? Thank you so much for everyone's time and help. (also I wish I was able to supply code but for other reasons I am unable to 😞 , sorry for the inconvenience). |
Bug Report
Description
We have a dvc projects with two remotes (
remote_a
andremote_b
).Most of our stages are parametrized and some outputs contain a
remote
attribute.For example:
We have setup some CI with cml to reproduce stages at each PR. Thus we have two job running, one on
remote_a
and the other onremote_b
. We have this kind of setup because we need to run our machine learning models on 2 different sets of data that need to resides in 2 different aws regions. Thus, the joba
should not have access to theremote_b
(which is an S3) and the reciprocal is true as well.However, when running
dvc pull --remote_a
, it failed with the errorForbidden: An error occurred (403) when calling the HeadObject operation
(full logs bellow). Looking at the logs, it seems thatdvc pull --remote_a
needs read access onremote_b
.Logs of the error
The dvc doc seems pretty clear that, only the specified remote will be pulled.
Why do
dvc pull --remote remote_a
needs access toremote_b
though?Environment information
Output of
dvc doctor
:The text was updated successfully, but these errors were encountered: