pull: how to only download data from a specified remote? #8298

courentin · 2022-09-15T13:40:41Z

Bug Report

Description

We have a dvc projects with two remotes (remote_a and remote_b).
Most of our stages are parametrized and some outputs contain a remote attribute.

For example:

stages:
  my_stage:
    foreach: ['remote_a', 'remote_b']
    do:
      cmd: echo "my job on ${ key }" > file_${ key }.txt
      outs:
        - file_${ key }.txt:
          remote: ${ key }

We have setup some CI with cml to reproduce stages at each PR. Thus we have two job running, one on remote_a and the other on remote_b. We have this kind of setup because we need to run our machine learning models on 2 different sets of data that need to resides in 2 different aws regions. Thus, the job a should not have access to the remote_b (which is an S3) and the reciprocal is true as well.

However, when running dvc pull --remote_a, it failed with the error Forbidden: An error occurred (403) when calling the HeadObject operation (full logs bellow). Looking at the logs, it seems that dvc pull --remote_a needs read access on remote_b.

Logs of the error

2022-09-14 15:45:05,240 DEBUG: Lockfile 'dvc.lock' needs to be updated.
2022-09-14 15:45:05,321 WARNING: Output 'speech_to_text/models/hparams/dump_transfer.yaml'(stage: 'dump_transfer_yaml') is missing version info. Cache for it will not be collected. Use `dvc repro` to get your pipeline up to date.
2022-09-14 15:45:05,463 DEBUG: Preparing to transfer data from 'dvc-repository-speech-models-eu' to '/github/home/dvc_cache'
2022-09-14 15:45:05,463 DEBUG: Preparing to collect status from '/github/home/dvc_cache'
2022-09-14 15:45:05,463 DEBUG: Collecting status from '/github/home/dvc_cache'
2022-09-14 15:45:05,465 DEBUG: Preparing to collect status from 'dvc-repository-speech-models-eu'
2022-09-14 15:45:05,465 DEBUG: Collecting status from 'dvc-repository-speech-models-eu'
2022-09-14 15:45:05,465 DEBUG: Querying 1 oids via object_exists
2022-09-14 15:45:06,391 ERROR: unexpected error - Forbidden: An error occurred (403) when calling the HeadObject operation: Forbidden
------------------------------------------------------------
Traceback (most recent call last):
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/s3fs/core.py", line 110, in _error_wrapper
    return await func(*args, **kwargs)
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/aiobotocore/client.py", line 265, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/dvc/cli/__init__.py", line 185, in main
    ret = cmd.do_run()
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/dvc/cli/command.py", line 22, in do_run
    return self.run()
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/dvc/commands/data_sync.py", line 31, in run
    stats = self.repo.pull(
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/dvc/repo/__init__.py", line 49, in wrapper
    return f(repo, *args, **kwargs)
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/dvc/repo/pull.py", line 34, in pull
    processed_files_count = self.fetch(
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/dvc/repo/__init__.py", line 49, in wrapper
    return f(repo, *args, **kwargs)
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/dvc/repo/fetch.py", line 45, in fetch
    used = self.used_objs(
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/dvc/repo/__init__.py", line 430, in used_objs
    for odb, objs in self.index.used_objs(
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/dvc/repo/index.py", line 240, in used_objs
    for odb, objs in stage.get_used_objs(
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/dvc/stage/__init__.py", line 695, in get_used_objs
    for odb, objs in out.get_used_objs(*args, **kwargs).items():
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/dvc/output.py", line 968, in get_used_objs
    obj = self._collect_used_dir_cache(**kwargs)
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/dvc/output.py", line 908, in _collect_used_dir_cache
    self.get_dir_cache(jobs=jobs, remote=remote)
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/dvc/output.py", line 890, in get_dir_cache
    self.repo.cloud.pull([obj.hash_info], **kwargs)
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/dvc/data_cloud.py", line 136, in pull
    return self.transfer(
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/dvc/data_cloud.py", line 88, in transfer
    return transfer(src_odb, dest_odb, objs, **kwargs)
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/dvc_data/transfer.py", line 158, in transfer
    status = compare_status(
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/dvc_data/status.py", line 185, in compare_status
    src_exists, src_missing = status(
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/dvc_data/status.py", line 136, in status
    exists = hashes.intersection(
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/dvc_data/status.py", line 56, in _indexed_dir_hashes
    dir_exists.update(
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/tqdm/std.py", line 1195, in __iter__
    for obj in iterable:
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/dvc_objects/db.py", line 279, in list_oids_exists
    yield from itertools.compress(oids, in_remote)
  File "/usr/lib/python3.9/concurrent/futures/_base.py", line 608, in result_iterator
    yield fs.pop().result()
  File "/usr/lib/python3.9/concurrent/futures/_base.py", line 445, in result
    return self.__get_result()
  File "/usr/lib/python3.9/concurrent/futures/_base.py", line 390, in __get_result
    raise self._exception
  File "/usr/lib/python3.9/concurrent/futures/thread.py", line 52, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/dvc_objects/fs/base.py", line 269, in exists
    return self.fs.exists(path)
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/fsspec/asyn.py", line 111, in wrapper
    return sync(self.loop, func, *args, **kwargs)
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/fsspec/asyn.py", line 96, in sync
    raise return_result
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/fsspec/asyn.py", line 53, in _runner
    result[0] = await coro
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/s3fs/core.py", line 888, in _exists
    await self._info(path, bucket, key, version_id=version_id)
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/s3fs/core.py", line 1140, in _info
    out = await self._call_s3(
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/s3fs/core.py", line 332, in _call_s3
    return await _error_wrapper(
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/s3fs/core.py", line 137, in _error_wrapper
    raise err
PermissionError: Forbidden
------------------------------------------------------------
2022-09-14 15:45:06,478 DEBUG: link type reflink is not available ([Errno 95] no more link types left to try out)
2022-09-14 15:45:06,478 DEBUG: Removing '/__w/speech-models/.5LXz6fyLREr373Vyh2BUtX.tmp'
2022-09-14 15:45:06,479 DEBUG: link type hardlink is not available ([Errno 95] no more link types left to try out)
2022-09-14 15:45:06,479 DEBUG: Removing '/__w/speech-models/.5LXz6fyLREr373Vyh2BUtX.tmp'
2022-09-14 15:45:06,479 DEBUG: Removing '/__w/speech-models/.5LXz6fyLREr373Vyh2BUtX.tmp'
2022-09-14 15:45:06,479 DEBUG: Removing '/github/home/dvc_cache/.7m6JcKcUQKoTh7ZJHogetT.tmp'
2022-09-14 15:45:06,484 DEBUG: Version info for developers:
DVC version: 2.22.0 (pip)
---------------------------------
Platform: Python 3.9.5 on Linux-5.4.0-1083-aws-x86_64-with-glibc2.31
Supports:
        http (aiohttp = 3.8.1, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.8.1, aiohttp-retry = 2.8.3),
        s3 (s3fs = 2022.7.1, boto3 = 1.21.21)
Cache types: symlink
Cache directory: ext4 on /dev/nvme0n1p1
Caches: local
Remotes: s3, s3
Workspace directory: ext4 on /dev/nvme0n1p1
Repo: dvc, git

Having any troubles? Hit us up at https://dvc.org/support, we are always happy to help!
2022-09-14 15:45:06,486 DEBUG: Analytics is enabled.
2022-09-14 15:45:06,527 DEBUG: Trying to spawn '['daemon', '-q', 'analytics', '/tmp/tmpghyf5o18']'
2022-09-14 15:45:06,529 DEBUG: Spawned '['daemon', '-q', 'analytics', '/tmp/tmpghyf5o18']'

The dvc doc seems pretty clear that, only the specified remote will be pulled.

Why do dvc pull --remote remote_a needs access to remote_b though?

Environment information

Output of dvc doctor:

DVC version: 2.22.0 (pip)
---------------------------------
Platform: Python 3.9.5 on Linux-5.4.0-1083-aws-x86_64-with-glibc2.31
Supports:
        http (aiohttp = 3.8.1, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.8.1, aiohttp-retry = 2.8.3),
        s3 (s3fs = 2022.7.1, boto3 = 1.21.21)
Cache types: symlink
Cache directory: ext4 on /dev/nvme0n1p1
Caches: local
Remotes: s3, s3
Workspace directory: ext4 on /dev/nvme0n1p1
Repo: dvc, git

The text was updated successfully, but these errors were encountered:

courentin · 2022-09-16T11:37:16Z

mea culpa, the doc explains how the remote flag works and it seems consistent with the behaviour I experienced:

The dvc remote used is determined in order, based on

the remote fields in the dvc.yaml or .dvc files.

the value passed to the --remote option via CLI.

the value of the core.remote config option (see dvc remote default).

However, I'm really wondering how I can download all the data from a specified remote without explicitly listing all the stages/data? (Ideally I'd like not to download everything and only what's required for the repro #4742).

dberenbaum · 2022-09-27T12:35:57Z

Discussed that first we should document the behavior better in push/pull, but we will also leave this open as a feature request.

dberenbaum · 2022-09-27T15:07:06Z

I took a closer look to document this, and I agree with @courentin that the current behavior is unexpected/unhelpful:

For data that has no remote field, it makes sense to keep the current behavior to push/pull to/from --remote A instead of the default. We could keep a feature request for an additional flag to only push/pull data that matches that specified remote. @courentin Do you need this, or is it okay to include data that has no remote field?
For data that has a specified remote field, I think DVC should skip it on push/pull. It seems surprising and potentially dangerous to need access to remote B even when specifying remote A. With the current behavior, there's no simple workaround to push things when you have access to only one remote. Is there a use case where the current behavior makes more sense?

dberenbaum · 2023-08-28T12:08:12Z

Update on current behavior.

I tested with two local remotes, default and other, and two files, foo and bar, with bar.dvc including remote: other:

$ tree
.
├── bar.dvc
└── foo.dvc

0 directories, 2 files

$ cat .dvc/config
[core]
    remote = default
['remote "default"']
    url = /Users/dave/dvcremote
['remote "other"']
    url = /Users/dave/dvcremote2

$ cat foo.dvc
outs:
- md5: d3b07384d113edec49eaa6238ad5ff00
  size: 4
  hash: md5
  path: foo

$ cat bar.dvc
outs:
- md5: c157a79031e1c40f85931829bc5fc552
  size: 4
  hash: md5
  path: bar
  remote: other

Here's what dvc pull does with different options (I reset to the state above before each pull).

Simple dvc pull:

$ dvc pull
A       foo
A       bar
2 files added and 2 files fetched

This is what I would expect. It pulls each file from its respective remote.

Next, pulling only from the other remote:

$ dvc pull -r other
WARNING: Some of the cache files do not exist neither locally nor on remote. Missing cache files:
md5: d3b07384d113edec49eaa6238ad5ff00
A       bar
1 file added and 1 file fetched
ERROR: failed to pull data from the cloud - Checkout failed for following targets:
foo
Is your cache up to date?
<https://error.dvc.org/missing-files>

This makes sense to me also. It pulls only from the other remote. If we want it not to fail, we can include --allow-missing:

$ dvc pull -r other --allow-missing
WARNING: Some of the cache files do not exist neither locally nor on remote. Missing cache files:
md5: d3b07384d113edec49eaa6238ad5ff00
A       bar
1 file added and 1 file fetched

Finally, we pull only from default:

$ dvc pull -r default
A       bar
A       foo
2 files added and 2 files fetched

This gives us the same behavior as dvc pull without any specified remote. This is the only option that doesn't make sense to me. If I manually specify -r default, I would not expect data to be pulled from other.

courentin · 2023-10-10T12:39:23Z

Thank you for taking a look :)

For data that has no remote field, it makes sense to keep the current behavior to push/pull to/from --remote A instead of the default. We could keep a feature request for an additional flag to only push/pull data that matches that specified remote. @courentin Do you need this, or is it okay to include data that has no remote field?

For my use case, I think it's ok to include data that has no remote field

spaghevin · 2024-03-21T21:18:52Z

Hello! Thanks a lot for your help responding to this question! I am actually in, I believe, the same exact boat as OP.

I have 2 datasets, which I uploaded up to S3 from local using DVC. On my local, I have a folder with images called datasetA that I uploaded to s3 by doing the dvc add datasetA, dvc push -r remoteA (which is defined in my .dvc config file). I cleared the cache (with a manual file delete), then did the same exact steps to push datasetB to remoteB. In my datasetA.dvc and datasetB.dvc files, I have their remote metadata values set to remoteA and remoteB respectively (the names of the remotes in the config). I did this manually by editing the file.

Next, pulling only from the other remote:

$ dvc pull -r other
WARNING: Some of the cache files do not exist neither locally nor on remote. Missing cache files:
md5: d3b07384d113edec49eaa6238ad5ff00
A       bar
1 file added and 1 file fetched
ERROR: failed to pull data from the cloud - Checkout failed for following targets:
foo
Is your cache up to date?
<https://error.dvc.org/missing-files>

My goal is to be able to say dvc pull -r remoteA and get DatasetA files only, and vice versa with B. So I cleared my cache (manually again), and did the above commands but they both pulled both remoteA and remoteB. I still have a default remote set to remoteA, but I don't know if that is the issue. I am wondering if there is something I am missing here in how you were able to configure your dvc files to make it work? Thank you so much for everyone's time and help.

(also I wish I was able to supply code but for other reasons I am unable to 😞 , sorry for the inconvenience).

dtrifiro added the A: data-sync Related to dvc get/fetch/import/pull/push label Sep 15, 2022

dtrifiro changed the title ~~pul: with --remote and remote specified explicitly as stages output don't have the expected behaviour~~ pull: with --remote and remote specified explicitly as stages output don't have the expected behaviour Sep 20, 2022

dtrifiro changed the title ~~pull: with --remote and remote specified explicitly as stages output don't have the expected behaviour~~ pull: how to only download data from a specified remote? Sep 20, 2022

dberenbaum added this to DVC Sep 26, 2022

dberenbaum moved this to Backlog in DVC Sep 26, 2022

dberenbaum removed the status in DVC Sep 26, 2022

dberenbaum added feature request Requesting a new feature and removed feature request Requesting a new feature labels Sep 27, 2022

pmrowla mentioned this issue Sep 28, 2022

cloud versioning: multiple remotes/metadata format #8356

Closed

dberenbaum moved this to Backlog in DVC Sep 30, 2022

dberenbaum added bug Did we break something? p2-medium Medium priority, should be done, but less important labels Nov 29, 2022

dberenbaum removed this from DVC Nov 29, 2022

dberenbaum mentioned this issue Jun 27, 2023

specify remote when adding file to dvc repo #9352

Open

dberenbaum added this to DVC Oct 13, 2023

github-project-automation bot moved this to Backlog in DVC Oct 13, 2023

dberenbaum removed the status in DVC Oct 24, 2023

dberenbaum removed this from DVC Oct 24, 2023

dberenbaum mentioned this issue Mar 22, 2024

sync: skip if remote doesn't match output remote #10365

Merged

dberenbaum closed this as completed in #10365 Mar 25, 2024

dberenbaum mentioned this issue Jun 7, 2024

Support pulling named subsets of data, or excluding files from pull #2825

Closed

spaghevin mentioned this issue Jun 14, 2024

dvc pull -r "remote_a_data"/ dvc.api.repo.pull(remote="remote_a_data") both trying to also pull another remote repository "remote_a_model" #10458

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

pull: how to only download data from a specified remote? #8298

pull: how to only download data from a specified remote? #8298

courentin commented Sep 15, 2022

courentin commented Sep 16, 2022

Uh oh!

dberenbaum commented Sep 27, 2022

Uh oh!

dberenbaum commented Sep 27, 2022

Uh oh!

dberenbaum commented Aug 28, 2023

Uh oh!

courentin commented Oct 10, 2023

Uh oh!

spaghevin commented Mar 21, 2024

Uh oh!

pull: how to only download data from a specified remote? #8298

pull: how to only download data from a specified remote? #8298

Comments

courentin commented Sep 15, 2022

Bug Report

Description

Environment information

courentin commented Sep 16, 2022

Uh oh!

dberenbaum commented Sep 27, 2022

Uh oh!

dberenbaum commented Sep 27, 2022

Uh oh!

dberenbaum commented Aug 28, 2023

Uh oh!

courentin commented Oct 10, 2023

Uh oh!

spaghevin commented Mar 21, 2024

Uh oh!