Skip to content

pull: how to only download data from a specified remote? #8298

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
courentin opened this issue Sep 15, 2022 · 6 comments · Fixed by #10365
Closed

pull: how to only download data from a specified remote? #8298

courentin opened this issue Sep 15, 2022 · 6 comments · Fixed by #10365
Labels
A: data-sync Related to dvc get/fetch/import/pull/push bug Did we break something? p2-medium Medium priority, should be done, but less important

Comments

@courentin
Copy link
Contributor

Bug Report

Description

We have a dvc projects with two remotes (remote_a and remote_b).
Most of our stages are parametrized and some outputs contain a remote attribute.

For example:

stages:
  my_stage:
    foreach: ['remote_a', 'remote_b']
    do:
      cmd: echo "my job on ${ key }" > file_${ key }.txt
      outs:
        - file_${ key }.txt:
          remote: ${ key }

We have setup some CI with cml to reproduce stages at each PR. Thus we have two job running, one on remote_a and the other on remote_b. We have this kind of setup because we need to run our machine learning models on 2 different sets of data that need to resides in 2 different aws regions. Thus, the job a should not have access to the remote_b (which is an S3) and the reciprocal is true as well.

However, when running dvc pull --remote_a, it failed with the error Forbidden: An error occurred (403) when calling the HeadObject operation (full logs bellow). Looking at the logs, it seems that dvc pull --remote_a needs read access on remote_b.

Logs of the error
2022-09-14 15:45:05,240 DEBUG: Lockfile 'dvc.lock' needs to be updated.
2022-09-14 15:45:05,321 WARNING: Output 'speech_to_text/models/hparams/dump_transfer.yaml'(stage: 'dump_transfer_yaml') is missing version info. Cache for it will not be collected. Use `dvc repro` to get your pipeline up to date.
2022-09-14 15:45:05,463 DEBUG: Preparing to transfer data from 'dvc-repository-speech-models-eu' to '/github/home/dvc_cache'
2022-09-14 15:45:05,463 DEBUG: Preparing to collect status from '/github/home/dvc_cache'
2022-09-14 15:45:05,463 DEBUG: Collecting status from '/github/home/dvc_cache'
2022-09-14 15:45:05,465 DEBUG: Preparing to collect status from 'dvc-repository-speech-models-eu'
2022-09-14 15:45:05,465 DEBUG: Collecting status from 'dvc-repository-speech-models-eu'
2022-09-14 15:45:05,465 DEBUG: Querying 1 oids via object_exists
2022-09-14 15:45:06,391 ERROR: unexpected error - Forbidden: An error occurred (403) when calling the HeadObject operation: Forbidden
------------------------------------------------------------
Traceback (most recent call last):
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/s3fs/core.py", line 110, in _error_wrapper
    return await func(*args, **kwargs)
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/aiobotocore/client.py", line 265, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/dvc/cli/__init__.py", line 185, in main
    ret = cmd.do_run()
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/dvc/cli/command.py", line 22, in do_run
    return self.run()
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/dvc/commands/data_sync.py", line 31, in run
    stats = self.repo.pull(
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/dvc/repo/__init__.py", line 49, in wrapper
    return f(repo, *args, **kwargs)
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/dvc/repo/pull.py", line 34, in pull
    processed_files_count = self.fetch(
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/dvc/repo/__init__.py", line 49, in wrapper
    return f(repo, *args, **kwargs)
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/dvc/repo/fetch.py", line 45, in fetch
    used = self.used_objs(
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/dvc/repo/__init__.py", line 430, in used_objs
    for odb, objs in self.index.used_objs(
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/dvc/repo/index.py", line 240, in used_objs
    for odb, objs in stage.get_used_objs(
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/dvc/stage/__init__.py", line 695, in get_used_objs
    for odb, objs in out.get_used_objs(*args, **kwargs).items():
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/dvc/output.py", line 968, in get_used_objs
    obj = self._collect_used_dir_cache(**kwargs)
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/dvc/output.py", line 908, in _collect_used_dir_cache
    self.get_dir_cache(jobs=jobs, remote=remote)
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/dvc/output.py", line 890, in get_dir_cache
    self.repo.cloud.pull([obj.hash_info], **kwargs)
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/dvc/data_cloud.py", line 136, in pull
    return self.transfer(
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/dvc/data_cloud.py", line 88, in transfer
    return transfer(src_odb, dest_odb, objs, **kwargs)
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/dvc_data/transfer.py", line 158, in transfer
    status = compare_status(
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/dvc_data/status.py", line 185, in compare_status
    src_exists, src_missing = status(
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/dvc_data/status.py", line 136, in status
    exists = hashes.intersection(
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/dvc_data/status.py", line 56, in _indexed_dir_hashes
    dir_exists.update(
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/tqdm/std.py", line 1195, in __iter__
    for obj in iterable:
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/dvc_objects/db.py", line 279, in list_oids_exists
    yield from itertools.compress(oids, in_remote)
  File "/usr/lib/python3.9/concurrent/futures/_base.py", line 608, in result_iterator
    yield fs.pop().result()
  File "/usr/lib/python3.9/concurrent/futures/_base.py", line 445, in result
    return self.__get_result()
  File "/usr/lib/python3.9/concurrent/futures/_base.py", line 390, in __get_result
    raise self._exception
  File "/usr/lib/python3.9/concurrent/futures/thread.py", line 52, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/dvc_objects/fs/base.py", line 269, in exists
    return self.fs.exists(path)
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/fsspec/asyn.py", line 111, in wrapper
    return sync(self.loop, func, *args, **kwargs)
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/fsspec/asyn.py", line 96, in sync
    raise return_result
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/fsspec/asyn.py", line 53, in _runner
    result[0] = await coro
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/s3fs/core.py", line 888, in _exists
    await self._info(path, bucket, key, version_id=version_id)
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/s3fs/core.py", line 1140, in _info
    out = await self._call_s3(
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/s3fs/core.py", line 332, in _call_s3
    return await _error_wrapper(
  File "/__w/speech-models/speech-models/.venv/lib/python3.9/site-packages/s3fs/core.py", line 137, in _error_wrapper
    raise err
PermissionError: Forbidden
------------------------------------------------------------
2022-09-14 15:45:06,478 DEBUG: link type reflink is not available ([Errno 95] no more link types left to try out)
2022-09-14 15:45:06,478 DEBUG: Removing '/__w/speech-models/.5LXz6fyLREr373Vyh2BUtX.tmp'
2022-09-14 15:45:06,479 DEBUG: link type hardlink is not available ([Errno 95] no more link types left to try out)
2022-09-14 15:45:06,479 DEBUG: Removing '/__w/speech-models/.5LXz6fyLREr373Vyh2BUtX.tmp'
2022-09-14 15:45:06,479 DEBUG: Removing '/__w/speech-models/.5LXz6fyLREr373Vyh2BUtX.tmp'
2022-09-14 15:45:06,479 DEBUG: Removing '/github/home/dvc_cache/.7m6JcKcUQKoTh7ZJHogetT.tmp'
2022-09-14 15:45:06,484 DEBUG: Version info for developers:
DVC version: 2.22.0 (pip)
---------------------------------
Platform: Python 3.9.5 on Linux-5.4.0-1083-aws-x86_64-with-glibc2.31
Supports:
        http (aiohttp = 3.8.1, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.8.1, aiohttp-retry = 2.8.3),
        s3 (s3fs = 2022.7.1, boto3 = 1.21.21)
Cache types: symlink
Cache directory: ext4 on /dev/nvme0n1p1
Caches: local
Remotes: s3, s3
Workspace directory: ext4 on /dev/nvme0n1p1
Repo: dvc, git

Having any troubles? Hit us up at https://dvc.org/support, we are always happy to help!
2022-09-14 15:45:06,486 DEBUG: Analytics is enabled.
2022-09-14 15:45:06,527 DEBUG: Trying to spawn '['daemon', '-q', 'analytics', '/tmp/tmpghyf5o18']'
2022-09-14 15:45:06,529 DEBUG: Spawned '['daemon', '-q', 'analytics', '/tmp/tmpghyf5o18']'

The dvc doc seems pretty clear that, only the specified remote will be pulled.

Why do dvc pull --remote remote_a needs access to remote_b though?

Environment information

Output of dvc doctor:

DVC version: 2.22.0 (pip)
---------------------------------
Platform: Python 3.9.5 on Linux-5.4.0-1083-aws-x86_64-with-glibc2.31
Supports:
        http (aiohttp = 3.8.1, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.8.1, aiohttp-retry = 2.8.3),
        s3 (s3fs = 2022.7.1, boto3 = 1.21.21)
Cache types: symlink
Cache directory: ext4 on /dev/nvme0n1p1
Caches: local
Remotes: s3, s3
Workspace directory: ext4 on /dev/nvme0n1p1
Repo: dvc, git
@dtrifiro dtrifiro added the A: data-sync Related to dvc get/fetch/import/pull/push label Sep 15, 2022
@courentin
Copy link
Contributor Author

mea culpa, the doc explains how the remote flag works and it seems consistent with the behaviour I experienced:

The dvc remote used is determined in order, based on

  • the remote fields in the dvc.yaml or .dvc files.
  • the value passed to the --remote option via CLI.
  • the value of the core.remote config option (see dvc remote default).

However, I'm really wondering how I can download all the data from a specified remote without explicitly listing all the stages/data? (Ideally I'd like not to download everything and only what's required for the repro #4742).

@dtrifiro dtrifiro changed the title pul: with --remote and remote specified explicitly as stages output don't have the expected behaviour pull: with --remote and remote specified explicitly as stages output don't have the expected behaviour Sep 20, 2022
@dtrifiro dtrifiro changed the title pull: with --remote and remote specified explicitly as stages output don't have the expected behaviour pull: how to only download data from a specified remote? Sep 20, 2022
@dberenbaum dberenbaum added this to DVC Sep 26, 2022
@dberenbaum dberenbaum moved this to Backlog in DVC Sep 26, 2022
@dberenbaum dberenbaum removed the status in DVC Sep 26, 2022
@dberenbaum
Copy link
Contributor

Discussed that first we should document the behavior better in push/pull, but we will also leave this open as a feature request.

@dberenbaum dberenbaum added feature request Requesting a new feature and removed feature request Requesting a new feature labels Sep 27, 2022
@dberenbaum
Copy link
Contributor

I took a closer look to document this, and I agree with @courentin that the current behavior is unexpected/unhelpful:

  1. For data that has no remote field, it makes sense to keep the current behavior to push/pull to/from --remote A instead of the default. We could keep a feature request for an additional flag to only push/pull data that matches that specified remote. @courentin Do you need this, or is it okay to include data that has no remote field?
  2. For data that has a specified remote field, I think DVC should skip it on push/pull. It seems surprising and potentially dangerous to need access to remote B even when specifying remote A. With the current behavior, there's no simple workaround to push things when you have access to only one remote. Is there a use case where the current behavior makes more sense?

@dberenbaum dberenbaum moved this to Backlog in DVC Sep 30, 2022
@dberenbaum dberenbaum added bug Did we break something? p2-medium Medium priority, should be done, but less important labels Nov 29, 2022
@dberenbaum dberenbaum removed this from DVC Nov 29, 2022
@dberenbaum
Copy link
Contributor

Update on current behavior.

I tested with two local remotes, default and other, and two files, foo and bar, with bar.dvc including remote: other:

$ tree
.
├── bar.dvc
└── foo.dvc

0 directories, 2 files

$ cat .dvc/config
[core]
    remote = default
['remote "default"']
    url = /Users/dave/dvcremote
['remote "other"']
    url = /Users/dave/dvcremote2

$ cat foo.dvc
outs:
- md5: d3b07384d113edec49eaa6238ad5ff00
  size: 4
  hash: md5
  path: foo

$ cat bar.dvc
outs:
- md5: c157a79031e1c40f85931829bc5fc552
  size: 4
  hash: md5
  path: bar
  remote: other

Here's what dvc pull does with different options (I reset to the state above before each pull).

Simple dvc pull:

$ dvc pull
A       foo
A       bar
2 files added and 2 files fetched

This is what I would expect. It pulls each file from its respective remote.

Next, pulling only from the other remote:

$ dvc pull -r other
WARNING: Some of the cache files do not exist neither locally nor on remote. Missing cache files:
md5: d3b07384d113edec49eaa6238ad5ff00
A       bar
1 file added and 1 file fetched
ERROR: failed to pull data from the cloud - Checkout failed for following targets:
foo
Is your cache up to date?
<https://error.dvc.org/missing-files>

This makes sense to me also. It pulls only from the other remote. If we want it not to fail, we can include --allow-missing:

$ dvc pull -r other --allow-missing
WARNING: Some of the cache files do not exist neither locally nor on remote. Missing cache files:
md5: d3b07384d113edec49eaa6238ad5ff00
A       bar
1 file added and 1 file fetched

Finally, we pull only from default:

$ dvc pull -r default
A       bar
A       foo
2 files added and 2 files fetched

This gives us the same behavior as dvc pull without any specified remote. This is the only option that doesn't make sense to me. If I manually specify -r default, I would not expect data to be pulled from other.

@courentin
Copy link
Contributor Author

Thank you for taking a look :)

For data that has no remote field, it makes sense to keep the current behavior to push/pull to/from --remote A instead of the default. We could keep a feature request for an additional flag to only push/pull data that matches that specified remote. @courentin Do you need this, or is it okay to include data that has no remote field?

For my use case, I think it's ok to include data that has no remote field

@dberenbaum dberenbaum added this to DVC Oct 13, 2023
@github-project-automation github-project-automation bot moved this to Backlog in DVC Oct 13, 2023
@dberenbaum dberenbaum removed the status in DVC Oct 24, 2023
@dberenbaum dberenbaum removed this from DVC Oct 24, 2023
@spaghevin
Copy link

Hello! Thanks a lot for your help responding to this question! I am actually in, I believe, the same exact boat as OP.

I have 2 datasets, which I uploaded up to S3 from local using DVC. On my local, I have a folder with images called datasetA that I uploaded to s3 by doing the dvc add datasetA, dvc push -r remoteA (which is defined in my .dvc config file). I cleared the cache (with a manual file delete), then did the same exact steps to push datasetB to remoteB. In my datasetA.dvc and datasetB.dvc files, I have their remote metadata values set to remoteA and remoteB respectively (the names of the remotes in the config). I did this manually by editing the file.

Next, pulling only from the other remote:

$ dvc pull -r other
WARNING: Some of the cache files do not exist neither locally nor on remote. Missing cache files:
md5: d3b07384d113edec49eaa6238ad5ff00
A       bar
1 file added and 1 file fetched
ERROR: failed to pull data from the cloud - Checkout failed for following targets:
foo
Is your cache up to date?
<https://error.dvc.org/missing-files>

My goal is to be able to say dvc pull -r remoteA and get DatasetA files only, and vice versa with B. So I cleared my cache (manually again), and did the above commands but they both pulled both remoteA and remoteB. I still have a default remote set to remoteA, but I don't know if that is the issue. I am wondering if there is something I am missing here in how you were able to configure your dvc files to make it work? Thank you so much for everyone's time and help.

(also I wish I was able to supply code but for other reasons I am unable to 😞 , sorry for the inconvenience).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: data-sync Related to dvc get/fetch/import/pull/push bug Did we break something? p2-medium Medium priority, should be done, but less important
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants