Skip to content

Commit c5a6e75

Browse files
authored
Merge pull request #1365 from imhardikj/dvc-pull
use example-get-started repo in pull examples
2 parents 6007482 + 9dbc8de commit c5a6e75

File tree

1 file changed

+82
-49
lines changed
  • content/docs/command-reference

1 file changed

+82
-49
lines changed

content/docs/command-reference/pull.md

Lines changed: 82 additions & 49 deletions
Original file line numberDiff line numberDiff line change
@@ -107,89 +107,122 @@ reflinks or hardlinks to put it in the workspace without copying. See
107107

108108
## Examples
109109

110-
For using the `dvc pull` command, a remote storage must be defined. (See
111-
`dvc remote add`.) For an existing <abbr>project</abbr>, remotes are usually
112-
already set up and you can use `dvc remote list` to check them. To remember how
113-
it's done, and set a context for the example, let's define a default SSH remote:
110+
Let's employ a simple <abbr>workspace</abbr> with some data, code, ML models,
111+
pipeline stages, such as the <abbr>DVC project</abbr> created for the
112+
[Get Started](/doc/tutorials/get-started). Then we can see what happens with
113+
`dvc pull`.
114+
115+
<details>
116+
117+
### Click and expand to setup the project
118+
119+
Start by cloning our example repo if you don't already have it:
114120

115121
```dvc
116-
$ dvc remote add -d r1 ssh://_username_@_host_/path/to/dvc/remote/storage
117-
$ dvc remote list
118-
r1 ssh://_username_@_host_/path/to/dvc/remote/storage
122+
$ git clone https://github.com/iterative/example-get-started
123+
$ cd example-get-started
119124
```
120125

121-
> DVC supports several
122-
> [remote types](/doc/command-reference/remote/add#supported-storage-types).
126+
</details>
123127

124-
Having some images and other files in remote storage, we can pull all changed
125-
files from the current Git branch:
128+
The workspace looks almost like in this
129+
[pipeline setup](/doc/tutorials/pipelines):
126130

127131
```dvc
128-
$ dvc pull --remote r1
132+
.
133+
├── data
134+
│   └── data.xml.dvc
135+
...
136+
└── train.dvc
129137
```
130138

131-
We can download specific files that are <abbr>outputs</abbr> of a specific
132-
DVC-file:
139+
We can now just run `dvc pull` to download the most recent `data/data.xml`,
140+
`model.pkl`, and other DVC-tracked files into the <abbr>workspace</abbr>:
133141

134142
```dvc
135-
$ dvc pull data.zip.dvc
143+
$ dvc pull
144+
145+
$ tree example-get-started/
146+
example-get-started/
147+
├── data
148+
│   ├── data.xml
149+
│   ├── data.xml.dvc
150+
...
151+
├── model.pkl
152+
└── train.dvc
136153
```
137154

138-
In this case we left off the `--remote` option, so it will have pulled from the
139-
default remote. The only files considered in this case are what is listed in the
140-
`out` field of the DVC-file `targets`.
155+
We can download specific <abbr>outputs</abbr> of a single DVC-file:
156+
157+
```dvc
158+
$ dvc pull train.dvc
159+
```
141160

142161
## Example: With dependencies
143162

144-
Demonstrating the `--with-deps` option requires a larger example. First, assume
145-
a [pipeline](/doc/command-reference/pipeline) has been setup with these
163+
> Please delete the `.dvc/cache` directory first (with `rm -Rf .dvc/cache`) to
164+
> follow this example if you tried the previous ones.
165+
166+
Our [pipeline](/doc/command-reference/pipeline) has been setup with these
146167
[stages](/doc/command-reference/run):
147168

148169
```dvc
149-
$ dvc pipeline show
150-
151-
data/Posts.xml.zip.dvc
152-
Posts.xml.dvc
153-
Posts.tsv.dvc
154-
Posts-test.tsv.dvc
155-
matrix-train.p.dvc
156-
model.p.dvc
157-
Dvcfile
170+
$ dvc pipeline show evaluate.dvc
171+
data/data.xml.dvc
172+
prepare.dvc
173+
featurize.dvc
174+
train.dvc
175+
evaluate.dvc
158176
```
159177

160-
Imagine the remote storage has been modified such that the data in some of these
161-
stages should be updated in the <abbr>workspace</abbr>.
178+
Imagine the [remote storage](/doc/command-reference/remote) has been modified
179+
such that the data in some of these stages should be updated in the
180+
<abbr>workspace</abbr>.
162181

163182
```dvc
164-
$ dvc status --cloud
165-
166-
deleted: data/model.p
167-
deleted: data/matrix-test.p
168-
deleted: data/matrix-train.p
183+
$ dvc status -c
184+
deleted: data/features/test.pkl
185+
deleted: data/features/train.pkl
186+
deleted: model.pkl
187+
...
169188
```
170189

171190
One could do a simple `dvc pull` to get all the data, but what if you only want
172191
to retrieve part of the data?
173192

174193
```dvc
175-
$ dvc pull --remote r1 --with-deps matrix-train.p.dvc
194+
$ dvc pull --with-deps featurize.dvc
176195
177-
... Do some work based on the partial update
196+
... Use the partial update, then pull the remaining data:
178197
179-
$ dvc pull --remote r1 --with-deps model.p.dvc
198+
$ dvc pull
199+
Everything is up to date.
200+
```
180201

181-
... Pull the rest of the data
202+
With the first `dvc pull` we specified a stage in the middle of this pipeline
203+
(`featurize.dvc`) while using `--with-deps`. DVC started with that DVC-file and
204+
searched backwards through the pipeline for data files to download. Later we ran
205+
`dvc pull` to download all the remaining data files.
182206

183-
$ dvc pull --remote r1
207+
## Example: Download from specific remote storage
184208

185-
Everything is up to date.
209+
For using the `dvc pull` command, a remote storage must be defined. (See
210+
`dvc remote add`.) For an existing <abbr>project</abbr>, remotes are usually
211+
already set up and you can use `dvc remote list` to check them. To remember how
212+
it's done, and set a context for the example, let's define a default SSH remote:
213+
214+
```dvc
215+
$ dvc remote add -d r1 ssh://_username_@_host_/path/to/dvc/remote/storage
216+
$ dvc remote list
217+
r1 ssh://_username_@_host_/path/to/dvc/remote/storage
186218
```
187219

188-
With the first `dvc pull` we specified a stage in the middle of this pipeline
189-
(`matrix-train.p.dvc`) while using `--with-deps`. DVC started with that DVC-file
190-
and searched backwards through the pipeline for data files to download. Because
191-
the `model.p.dvc` stage occurs later, its data was not pulled.
220+
> DVC supports several
221+
> [remote types](/doc/command-reference/remote/add#supported-storage-types).
192222
193-
Then we ran `dvc pull` specifying the last stage, `model.p.dvc`, and its data
194-
was downloaded. Finally, we ran `dvc pull` with no flags to make sure that all
195-
data was already pulled with the previous commands.
223+
To download DVC-tracked data from a specific DVC remote, use the `--remote`
224+
(`-r`) option of `dvc pull`:
225+
226+
```dvc
227+
$ dvc pull --remote r1
228+
```

0 commit comments

Comments
 (0)