Skip to content

Commit d305089

Browse files
authored
Merge pull request #2091 from iterative/straight-to-remote
Add docs regaring --to-remote option for add/import-url
2 parents 118858c + d58af5b commit d305089

File tree

5 files changed

+107
-29
lines changed

5 files changed

+107
-29
lines changed

content/docs/command-reference/add.md

Lines changed: 32 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,8 @@ file.
77

88
```usage
99
usage: dvc add [-h] [-q | -v] [-R] [--no-commit] [--external]
10-
[--glob] [--file <filename>] [--desc <text>]
10+
[--glob] [--file <filename>] [-o <path>] [--to-remote]
11+
[-r <name>] [-j <number>] [--desc <text>]
1112
targets [targets ...]
1213
1314
positional arguments:
@@ -36,12 +37,13 @@ After checking that each `target` hasn't been added before (or tracked with
3637
other DVC commands), a few actions are taken under the hood:
3738

3839
1. Calculate the file hash.
39-
2. Move the file contents to the cache (by default in `.dvc/cache`), using the
40-
file hash to form the cached file path. (See
40+
2. Move the file contents to the cache (by default in `.dvc/cache`) (or to
41+
remote storage if `--to-remote` is given), using the file hash to form the
42+
cached file path. (See
4143
[Structure of cache directory](/doc/user-guide/project-structure/internal-files#structure-of-the-cache-directory)
4244
for more details.)
4345
3. Attempt to replace the file with a link to the cached data (more details on
44-
file linking further down).
46+
file linking further down). Skipped if `--to-remote` is used.
4547
4. Create a corresponding `.dvc` file to track the file, using its path and hash
4648
to identify the cached data. The `.dvc` file lists the DVC-tracked file as an
4749
<abbr>output</abbr> (`outs` field). Unless the `--file` option is used, the
@@ -70,6 +72,20 @@ large files. DVC also supports other link types for use on file systems without
7072
`reflink` support, but they have to be specified manually. Refer to the
7173
`cache.type` config option in `dvc config cache` for more information.
7274

75+
### Transferring data directly to remote storage
76+
77+
When you have a very big dataset that you want to move from some external
78+
location to [remote storage](/doc/command-reference/remote) while avoiding
79+
storing it locally, you can use the `--to-remote` option. This will transfer a
80+
copy of the target data directly to a remote of your choice (or the default
81+
one). A `.dvc` file will be created normally, but the data won't be found in
82+
your local project until you `dvc pull` it.
83+
84+
This option is useful when the local system can't handle the target data, but
85+
you still want to track and store it in remote storage, so that whenever you
86+
switch to a different system that can handle it, you can simply pull the data
87+
and start working on it.
88+
7389
### Adding entire directories
7490

7591
A `dvc add` target can be either a file or a directory. In the latter case, a
@@ -148,6 +164,18 @@ not.
148164
> Note that external outputs typically require an external cache setup. See
149165
> link above for more details.
150166
167+
- `--to-remote` - import an external target, but don't move it into the
168+
workspace, nor cache it. [Transfer](#example-import-straight-to-the-remote) it
169+
directly to remote storage (the default one, unless `-r` is specified)
170+
instead. Use `dvc pull` to get the data locally.
171+
172+
- `-r <name>`, `--remote <name>` - name of the
173+
[remote storage](/doc/command-reference/remote) to transfer external target to
174+
(can only be used with `--to-remote`).
175+
176+
- `-o <path>`, `--out <path>` - destination `path` for the transferred data (can
177+
only be used with `--to-remote`).
178+
151179
- `--desc <text>` - user description of the data (optional). This doesn't affect
152180
any DVC operations.
153181

content/docs/command-reference/get-url.md

Lines changed: 2 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ Download a file or directory from a supported URL (for example `s3://`,
99
## Synopsis
1010

1111
```usage
12-
usage: dvc get-url [-h] [-q | -v] [-j <number>] url [out]
12+
usage: dvc get-url [-h] [-q | -v] url [out]
1313
1414
positional arguments:
1515
url (See supported URLs in the description.)
@@ -31,7 +31,7 @@ while `out` can be used to specify the directory and/or file name desired for
3131
the downloaded data. If an existing directory is specified, then the file or
3232
directory will be placed inside.
3333

34-
DVC supports several types of (local or) remote data sources (protocols):
34+
DVC supports several types of (local or) remote locations (protocols):
3535

3636
| Type | Description | `url` format example |
3737
| --------- | ---------------------------- | --------------------------------------------- |
@@ -72,10 +72,6 @@ $ wget https://example.com/path/to/data.csv
7272

7373
## Options
7474

75-
- `-j <number>`, `--jobs <number>` - parallelism level for DVC to download data
76-
from the source. The default value is `4 * cpu_count()`. For SSH remotes, the
77-
default is `4`. Using more jobs may speed up the operation.
78-
7975
- `-h`, `--help` - prints the usage/help message, and exit.
8076

8177
- `-q`, `--quiet` - do not write anything to standard output. Exit with 0 if no

content/docs/command-reference/get.md

Lines changed: 1 addition & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -8,8 +8,7 @@ directory.
88
## Synopsis
99

1010
```usage
11-
usage: dvc get [-h] [-q | -v] [-o <path>] [--rev <commit>] [-j <number>]
12-
url path
11+
usage: dvc get [-h] [-q | -v] [-o <path>] [--rev <commit>] url path
1312
1413
positional arguments:
1514
url Location of DVC or Git repository to download from
@@ -66,12 +65,6 @@ name.
6665
download the file or directory from. The latest commit in `master` (tip of the
6766
default branch) is used by default when this option is not specified.
6867

69-
- `-j <number>`, `--jobs <number>` - parallelism level for DVC to download data
70-
from the remote. The default value is `4 * cpu_count()`. For SSH remotes, the
71-
default is `4`. Using more jobs may speed up the operation. Note that the
72-
default value can be set in the source repo using the `jobs` config option of
73-
`dvc remote modify`.
74-
7568
- `--show-url` - instead of downloading the file or directory, just print the
7669
storage location (URL) of the target data. If `path` is a Git-tracked file,
7770
this option is ignored.

content/docs/command-reference/import-url.md

Lines changed: 67 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
# import-url
22

3-
Download a file or directory from a supported URL (for example `s3://`,
4-
`ssh://`, and other protocols) into the <abbr>workspace</abbr>, and track it (an
5-
import `.dvc` file is created).
3+
Track a file or directory found in an external location (`s3://`, `/local/path`,
4+
etc.), and download it to the local project, or make a copy in
5+
[remote storage](/doc/command-reference/remote).
66

77
> See `dvc import` to download and tack data/model files or directories from
88
> other <abbr>DVC repositories</abbr> (e.g. hosted on GitHub).
@@ -11,7 +11,8 @@ import `.dvc` file is created).
1111

1212
```usage
1313
usage: dvc import-url [-h] [-q | -v] [-j <number>] [--file <filename>]
14-
[--no-exec] [--desc <text>]
14+
[--no-exec] [--to-remote] [-r <name>]
15+
[--desc <text>]
1516
url [out]
1617
1718
positional arguments:
@@ -22,8 +23,9 @@ positional arguments:
2223
## Description
2324

2425
In some cases it's convenient to add a data file or directory from an external
25-
location into the workspace, such that it can be updated later, if/when the
26-
external data source changes. Example scenarios:
26+
location into the workspace (or to
27+
[remote storage](/doc/command-reference/remote)), such that it can be updated
28+
later, if/when the external data source changes. Example scenarios:
2729

2830
- A remote system may produce occasional data files that are used in other
2931
projects.
@@ -37,6 +39,12 @@ external data source changes. Example scenarios:
3739
having to manually copy files from the supported locations (listed below), which
3840
may require installing a different tool for each type.
3941

42+
When you don't want to store the target data in your local system, you can still
43+
create an import `.dvc` file while transferring a file or directory directly to
44+
remote storage, by using the `--to-remote` option. See the
45+
[Transfer to remote storage](#example-transfer-to-remote-storage) example for
46+
more details.
47+
4048
The `url` argument specifies the external location of the data to be imported.
4149
The imported data is <abbr>cached</abbr>, and linked (or copied) to the current
4250
working directory with its original file name e.g. `data.txt` (or to a location
@@ -131,6 +139,15 @@ $ dvc run -n download_data \
131139
finish the operation(s)); or if the target data already exist locally and you
132140
want to "DVCfy" this state of the project (see also `dvc commit`).
133141

142+
- `--to-remote` - import an external target, but don't move it into the
143+
workspace, nor cache it. [Transfer](#example-import-straight-to-the-remote) it
144+
directly to remote storage (the default one, unless `-r` is specified)
145+
instead. Use `dvc pull` to get the data locally.
146+
147+
- `-r <name>`, `--remote <name>` - name of the
148+
[remote storage](/doc/command-reference/remote) (can only be used with
149+
`--to-remote`).
150+
134151
- `-j <number>`, `--jobs <number>` - parallelism level for DVC to download data
135152
from the source. The default value is `4 * cpu_count()`. For SSH remotes, the
136153
default is `4`. Using more jobs may speed up the operation.
@@ -340,3 +357,47 @@ $ dvc repro
340357
Running stage 'prepare' with command:
341358
python src/prepare.py data/data.xml
342359
```
360+
361+
## Example: Transfer to remote storage
362+
363+
When you have a large dataset in an external location, you may want to import it
364+
to you project without downloading it to the local file system (for using it
365+
later/elsewhere). The `--to-remote` option lets you skip the download, while
366+
storing the imported data [remotely](/doc/command-reference/remote). Let's
367+
initialize a DVC project, and setup a remote:
368+
369+
```dvc
370+
$ mkdir example # workspace
371+
$ cd example
372+
$ git init
373+
$ dvc init
374+
$ mkdir /tmp/dvc-storage
375+
$ dvc remote add myremote /tmp/dvc-storage
376+
```
377+
378+
Now let's create an import `.dvc` file without downloading the target data,
379+
transferring it directly to remote storage instead:
380+
381+
```
382+
$ dvc import-url https://data.dvc.org/get-started/data.xml data.xml \
383+
--to-remote -r myremote
384+
...
385+
```
386+
387+
The only change in our local <abbr>workspace</abbr> is a newly created import
388+
`.dvc` file:
389+
390+
```dvc
391+
$ ls
392+
data.xml.dvc
393+
```
394+
395+
Whenever anyone wants to actually download the imported data (for example from a
396+
system that can handle it), they can use `dvc pull` as usual:
397+
398+
```
399+
$ dvc pull data.xml.dvc -r tmp_remote
400+
401+
A data.xml
402+
1 file added and 1 file fetched
403+
```

content/docs/command-reference/import.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -105,11 +105,11 @@ repo at `url`) are not supported.
105105
finish the operation(s)); or if the target data already exist locally and you
106106
want to "DVCfy" this state of the project (see also `dvc commit`).
107107

108-
- `-j <number>`, `--jobs <number>` - parallelism level for DVC to download data
109-
from the remote. The default value is `4 * cpu_count()`. For SSH remotes, the
110-
default is `4`. Using more jobs may speed up the operation. Note that the
111-
default value can be set in the source repo using the `jobs` config option of
112-
`dvc remote modify`.
108+
- `-j <number>`, `--jobs <number>` - number of threads to run simultaneously to
109+
handle the downloading of files from the remote. The default value is
110+
`4 * cpu_count()`. For SSH remotes, the default is just `4`. Using more jobs
111+
may improve the total download speed if a combination of small and large files
112+
are being fetched.
113113

114114
- `--desc <text>` - user description of the data (optional). This doesn't affect
115115
any DVC operations.

0 commit comments

Comments
 (0)