Skip to content

Commit d9fc7be

Browse files
author
Kaushik Ghose
authored
Merge pull request #114 from common-workflow-language/scatter
[WIP] Scatter guide
2 parents befb17c + d72542c commit d9fc7be

File tree

8 files changed

+285
-0
lines changed

8 files changed

+285
-0
lines changed

_episodes/23-scatter-workflow.md

Lines changed: 163 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,163 @@
1+
---
2+
title: "Scattering Workflows"
3+
teaching: 10
4+
exercises: 0
5+
questions:
6+
- "How do I run tools or workflows in parallel?"
7+
objectives:
8+
- "Learn how to create workflows that can run a step over a list of inputs."
9+
keypoints:
10+
- "A workflow can scatter over an input array in a step of a workflow, if the workflow engine
11+
supports the `ScatterFeatureRequirement`."
12+
- The `scatter` field is specified for each step you want to scatter
13+
- The `scatter` field references the step level inputs, not the workflow inputs
14+
- Scatter runs on each step specified independently
15+
---
16+
Now that we know how to write workflows, we can start utilizing the `ScatterFeatureRequirement`.
17+
This feature tells the runner that you wish to run a tool or workflow multiple times over a list
18+
of inputs. The workflow then takes the input(s) as an array and will run the specified step(s)
19+
on each element of the array as if it were a single input. This allows you to run the same workflow
20+
on multiple inputs without having to generate many different commands or input yaml files.
21+
22+
~~~
23+
requirements:
24+
- class: ScatterFeatureRequirement
25+
~~~
26+
{: .source}
27+
28+
The most common reason a new user might want to use scatter is to perform the same analysis on
29+
different samples. Let's start with a simple workflow that calls our first example and takes
30+
an array of strings as input to the workflow:
31+
32+
*scatter-workflow.cwl*
33+
34+
~~~
35+
{% include cwl/23-scatter-workflow/scatter-workflow.cwl %}
36+
~~~
37+
{: .source}
38+
39+
Aside from the `requirements` section including `ScatterFeatureRequirement`, what is
40+
going on here?
41+
42+
~~~
43+
inputs:
44+
message_array: string[]
45+
~~~
46+
47+
First of all, notice that the workflow level input here accepts an array of strings.
48+
49+
~~~
50+
steps:
51+
echo:
52+
run: 1st-tool.cwl
53+
scatter: message
54+
in:
55+
message: message_array
56+
out: []
57+
~~~
58+
59+
Here we've added a new field to the step `echo` called `scatter`. This field tells the
60+
runner that we'd like to scatter over this input for this particular step. Note that
61+
the input listed after scatter is the step's input, not the workflow input.
62+
63+
For our first scatter, it's as simple as that! Since our tool doesn't collect any outputs, we
64+
still use `outputs: []` in our workflow, but if you expect that the final output of your
65+
workflow will now have multiple outputs to collect, be sure to update that to an array type
66+
as well!
67+
68+
Using the following input file:
69+
70+
*scatter-job.yml*
71+
72+
~~~
73+
{% include cwl/23-scatter-workflow/scatter-job.cwl %}
74+
~~~
75+
{: .source}
76+
77+
As a reminder, `1st-tool.cwl` simply calls the command `echo` on a message. If we invoke
78+
`cwl-runner scatter-workflow.cwl scatter-job.yml` on the command line:
79+
80+
~~~
81+
$ cwl-runner scatter-workflow.cwl scatter-job.yml
82+
[workflow scatter-workflow.cwl] start
83+
[step echo] start
84+
[job echo] /tmp/tmp0hqmg400$ echo \
85+
'Hello world!'
86+
Hello world!
87+
[job echo] completed success
88+
[step echo] start
89+
[job echo_2] /tmp/tmpu65_m1zw$ echo \
90+
'Hola mundo!'
91+
Hola mundo!
92+
[job echo_2] completed success
93+
[step echo] start
94+
[job echo_3] /tmp/tmp5cs7a2wh$ echo \
95+
'Bonjour le monde!'
96+
Bonjour le monde!
97+
[job echo_3] completed success
98+
[step echo] start
99+
[job echo_4] /tmp/tmp301wo7p8$ echo \
100+
'Hallo welt!'
101+
Hallo welt!
102+
[job echo_4] completed success
103+
[step echo] completed success
104+
[workflow scatter-workflow.cwl] completed success
105+
{}
106+
Final process status is success
107+
~~~
108+
109+
You can see that the workflow calls echo multiple times on each element of our
110+
`message_array`. Ok, so how about if we want to scatter over two steps in a workflow?
111+
112+
Let's perform a simple echo like above, but capturing `stdout` by adding the following
113+
lines instead of `outputs: []`
114+
115+
*1st-tool-mod.cwl*
116+
117+
~~~
118+
outputs:
119+
echo_out:
120+
type: stdout
121+
~~~
122+
123+
And add a second step that uses `wc` to count the characters in each file. See the tool
124+
below:
125+
126+
*wc-tool.cwl*
127+
128+
~~~
129+
{% include cwl/23-scatter-workflow/wc-tool.cwl %}
130+
~~~
131+
132+
Now, how do we incorporate scatter? Remember the scatter field is under each step:
133+
134+
~~~
135+
{% include cwl/23-scatter-workflow/scatter-two-steps.cwl %}
136+
~~~
137+
138+
Here we have placed the scatter field under each step. This is fine for this example since
139+
it runs quickly, but if you're runnung many samples for a more complex workflow, you may
140+
wish to consider an alternative. Here we are running scatter on each step independently, but
141+
since the second step is not dependent on the first step completing all languages, we aren't
142+
using the scatter functionality efficiently. The second step expects an array as input from
143+
the first step, so it will wait until everything in step one is finished before doing anything.
144+
Pretend that `echo Hello World!` takes 1 minute to perform, `wc -c` on the output takes 3 minutes
145+
and that `echo Hallo welt!` takes 5 minutes to perform, and `wc` on that output takes 3 minutes.
146+
Even though `echo Hello World!` could finish in 4 minutes, it will actually finish in 8 minutes
147+
because the first step must wait on `echo Hallo welt!`. You can see how this might not scale
148+
well.
149+
150+
Ok, so how do we scatter on steps that can proceed independent of other samples? Remember from
151+
chapter 22, that we can make an entire workflow a single step in another workflow! Convert our
152+
two step workflow to a single step subworkflow:
153+
154+
*scatter-nested-workflow.cwl*
155+
156+
~~~
157+
{% include cwl/23-scatter-workflow/scatter-nested-workflow.cwl %}
158+
~~~
159+
160+
Now the scatter acts on a single step, but that step consists of two steps so each step is performed
161+
in parallel.
162+
163+
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
#!/usr/bin/env cwl-runner
2+
3+
cwlVersion: v1.0
4+
class: CommandLineTool
5+
baseCommand: echo
6+
inputs:
7+
message:
8+
type: string
9+
inputBinding:
10+
position: 1
11+
outputs:
12+
echo_out:
13+
type: stdout
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
#!/usr/bin/env cwl-runner
2+
3+
cwlVersion: v1.0
4+
class: CommandLineTool
5+
baseCommand: echo
6+
inputs:
7+
message:
8+
type: string
9+
inputBinding:
10+
position: 1
11+
outputs: []
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
message_array:
2+
- Hello world!
3+
- Hola mundo!
4+
- Bonjour le monde!
5+
- Hallo welt!
Lines changed: 35 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
#!/usr/bin/env cwl-runner
2+
3+
cwlVersion: v1.0
4+
class: Workflow
5+
6+
requirements:
7+
- class: ScatterFeatureRequirement
8+
- class: SubworkflowFeatureRequirement
9+
10+
inputs:
11+
message_array: string[]
12+
13+
steps:
14+
subworkflow:
15+
run:
16+
class: Workflow
17+
inputs:
18+
message: string
19+
outputs: []
20+
steps:
21+
echo:
22+
run: 1st-tool-mod.cwl
23+
in:
24+
message: message
25+
out: [echo_out]
26+
wc:
27+
run: wc-tool.cwl
28+
in:
29+
input_file: echo/echo_out
30+
out: []
31+
scatter: message
32+
in:
33+
message: message_array
34+
out: []
35+
outputs: []
Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
#!/usr/bin/env cwl-runner
2+
3+
cwlVersion: v1.0
4+
class: Workflow
5+
6+
requirements:
7+
- class: ScatterFeatureRequirement
8+
9+
inputs:
10+
message_array: string[]
11+
12+
steps:
13+
echo:
14+
run: 1st-tool-mod.cwl
15+
scatter: message
16+
in:
17+
message: message_array
18+
out: [echo_out]
19+
wc:
20+
run: wc-tool.cwl
21+
scatter: input_file
22+
in:
23+
input_file: echo/echo_out
24+
out: []
25+
26+
outputs: []
Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
#!/usr/bin/env cwl-runner
2+
3+
cwlVersion: v1.0
4+
class: Workflow
5+
6+
requirements:
7+
- class: ScatterFeatureRequirement
8+
9+
inputs:
10+
message_array: string[]
11+
12+
steps:
13+
echo:
14+
run: 1st-tool.cwl
15+
scatter: message
16+
in:
17+
message: message_array
18+
out: []
19+
20+
outputs: []
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
#!/usr/bin/env cwl-runner
2+
3+
cwlVersion: v1.0
4+
class: CommandLineTool
5+
baseCommand: wc
6+
arguments: ["-c"]
7+
inputs:
8+
input_file:
9+
type: File
10+
inputBinding:
11+
position: 1
12+
outputs: []

0 commit comments

Comments
 (0)