|
| 1 | +--- |
| 2 | +title: "Scattering Workflows" |
| 3 | +teaching: 10 |
| 4 | +exercises: 0 |
| 5 | +questions: |
| 6 | +- "How do I run tools or workflows in parallel?" |
| 7 | +objectives: |
| 8 | +- "Learn how to create workflows that can run a step over a list of inputs." |
| 9 | +keypoints: |
| 10 | +- "A workflow can scatter over an input array in a step of a workflow, if the workflow engine |
| 11 | +supports the `ScatterFeatureRequirement`." |
| 12 | +- The `scatter` field is specified for each step you want to scatter |
| 13 | +- The `scatter` field references the step level inputs, not the workflow inputs |
| 14 | +- Scatter runs on each step specified independently |
| 15 | +--- |
| 16 | +Now that we know how to write workflows, we can start utilizing the `ScatterFeatureRequirement`. |
| 17 | +This feature tells the runner that you wish to run a tool or workflow multiple times over a list |
| 18 | +of inputs. The workflow then takes the input(s) as an array and will run the specified step(s) |
| 19 | +on each element of the array as if it were a single input. This allows you to run the same workflow |
| 20 | +on multiple inputs without having to generate many different commands or input yaml files. |
| 21 | + |
| 22 | +~~~ |
| 23 | +requirements: |
| 24 | + - class: ScatterFeatureRequirement |
| 25 | +~~~ |
| 26 | +{: .source} |
| 27 | + |
| 28 | +The most common reason a new user might want to use scatter is to perform the same analysis on |
| 29 | +different samples. Let's start with a simple workflow that calls our first example and takes |
| 30 | +an array of strings as input to the workflow: |
| 31 | + |
| 32 | +*scatter-workflow.cwl* |
| 33 | + |
| 34 | +~~~ |
| 35 | +{% include cwl/23-scatter-workflow/scatter-workflow.cwl %} |
| 36 | +~~~ |
| 37 | +{: .source} |
| 38 | + |
| 39 | +Aside from the `requirements` section including `ScatterFeatureRequirement`, what is |
| 40 | +going on here? |
| 41 | + |
| 42 | +~~~ |
| 43 | +inputs: |
| 44 | + message_array: string[] |
| 45 | +~~~ |
| 46 | + |
| 47 | +First of all, notice that the workflow level input here accepts an array of strings. |
| 48 | + |
| 49 | +~~~ |
| 50 | +steps: |
| 51 | + echo: |
| 52 | + run: 1st-tool.cwl |
| 53 | + scatter: message |
| 54 | + in: |
| 55 | + message: message_array |
| 56 | + out: [] |
| 57 | +~~~ |
| 58 | + |
| 59 | +Here we've added a new field to the step `echo` called `scatter`. This field tells the |
| 60 | +runner that we'd like to scatter over this input for this particular step. Note that |
| 61 | +the input listed after scatter is the step's input, not the workflow input. |
| 62 | + |
| 63 | +For our first scatter, it's as simple as that! Since our tool doesn't collect any outputs, we |
| 64 | +still use `outputs: []` in our workflow, but if you expect that the final output of your |
| 65 | +workflow will now have multiple outputs to collect, be sure to update that to an array type |
| 66 | +as well! |
| 67 | + |
| 68 | +Using the following input file: |
| 69 | + |
| 70 | +*scatter-job.yml* |
| 71 | + |
| 72 | +~~~ |
| 73 | +{% include cwl/23-scatter-workflow/scatter-job.cwl %} |
| 74 | +~~~ |
| 75 | +{: .source} |
| 76 | + |
| 77 | +As a reminder, `1st-tool.cwl` simply calls the command `echo` on a message. If we invoke |
| 78 | +`cwl-runner scatter-workflow.cwl scatter-job.yml` on the command line: |
| 79 | + |
| 80 | +~~~ |
| 81 | +$ cwl-runner scatter-workflow.cwl scatter-job.yml |
| 82 | +[workflow scatter-workflow.cwl] start |
| 83 | +[step echo] start |
| 84 | +[job echo] /tmp/tmp0hqmg400$ echo \ |
| 85 | + 'Hello world!' |
| 86 | +Hello world! |
| 87 | +[job echo] completed success |
| 88 | +[step echo] start |
| 89 | +[job echo_2] /tmp/tmpu65_m1zw$ echo \ |
| 90 | + 'Hola mundo!' |
| 91 | +Hola mundo! |
| 92 | +[job echo_2] completed success |
| 93 | +[step echo] start |
| 94 | +[job echo_3] /tmp/tmp5cs7a2wh$ echo \ |
| 95 | + 'Bonjour le monde!' |
| 96 | +Bonjour le monde! |
| 97 | +[job echo_3] completed success |
| 98 | +[step echo] start |
| 99 | +[job echo_4] /tmp/tmp301wo7p8$ echo \ |
| 100 | + 'Hallo welt!' |
| 101 | +Hallo welt! |
| 102 | +[job echo_4] completed success |
| 103 | +[step echo] completed success |
| 104 | +[workflow scatter-workflow.cwl] completed success |
| 105 | +{} |
| 106 | +Final process status is success |
| 107 | +~~~ |
| 108 | + |
| 109 | +You can see that the workflow calls echo multiple times on each element of our |
| 110 | +`message_array`. Ok, so how about if we want to scatter over two steps in a workflow? |
| 111 | + |
| 112 | +Let's perform a simple echo like above, but capturing `stdout` by adding the following |
| 113 | +lines instead of `outputs: []` |
| 114 | + |
| 115 | +*1st-tool-mod.cwl* |
| 116 | + |
| 117 | +~~~ |
| 118 | +outputs: |
| 119 | + echo_out: |
| 120 | + type: stdout |
| 121 | +~~~ |
| 122 | + |
| 123 | +And add a second step that uses `wc` to count the characters in each file. See the tool |
| 124 | +below: |
| 125 | + |
| 126 | +*wc-tool.cwl* |
| 127 | + |
| 128 | +~~~ |
| 129 | +{% include cwl/23-scatter-workflow/wc-tool.cwl %} |
| 130 | +~~~ |
| 131 | + |
| 132 | +Now, how do we incorporate scatter? Remember the scatter field is under each step: |
| 133 | + |
| 134 | +~~~ |
| 135 | +{% include cwl/23-scatter-workflow/scatter-two-steps.cwl %} |
| 136 | +~~~ |
| 137 | + |
| 138 | +Here we have placed the scatter field under each step. This is fine for this example since |
| 139 | +it runs quickly, but if you're runnung many samples for a more complex workflow, you may |
| 140 | +wish to consider an alternative. Here we are running scatter on each step independently, but |
| 141 | +since the second step is not dependent on the first step completing all languages, we aren't |
| 142 | +using the scatter functionality efficiently. The second step expects an array as input from |
| 143 | +the first step, so it will wait until everything in step one is finished before doing anything. |
| 144 | +Pretend that `echo Hello World!` takes 1 minute to perform, `wc -c` on the output takes 3 minutes |
| 145 | +and that `echo Hallo welt!` takes 5 minutes to perform, and `wc` on that output takes 3 minutes. |
| 146 | +Even though `echo Hello World!` could finish in 4 minutes, it will actually finish in 8 minutes |
| 147 | +because the first step must wait on `echo Hallo welt!`. You can see how this might not scale |
| 148 | +well. |
| 149 | + |
| 150 | +Ok, so how do we scatter on steps that can proceed independent of other samples? Remember from |
| 151 | +chapter 22, that we can make an entire workflow a single step in another workflow! Convert our |
| 152 | +two step workflow to a single step subworkflow: |
| 153 | + |
| 154 | +*scatter-nested-workflow.cwl* |
| 155 | + |
| 156 | +~~~ |
| 157 | +{% include cwl/23-scatter-workflow/scatter-nested-workflow.cwl %} |
| 158 | +~~~ |
| 159 | + |
| 160 | +Now the scatter acts on a single step, but that step consists of two steps so each step is performed |
| 161 | +in parallel. |
| 162 | + |
| 163 | + |
0 commit comments