From 8489e7485eca60d241e5046a6f464eba6ef27142 Mon Sep 17 00:00:00 2001 From: kristen Date: Tue, 31 Jul 2018 10:07:48 -0500 Subject: [PATCH 1/2] Scatter first commit This is the first commit for a *simple* scatter lesson. This is just a gentle introduction that scatters over the 'hello world' example in the original user guide. --- _episodes/23-scatter-workflow.md | 109 ++++++++++++++++++ .../cwl/23-scatter-workflow/1st-tool.cwl | 11 ++ .../cwl/23-scatter-workflow/scatter-job.yml | 5 + .../23-scatter-workflow/scatter-workflow.cwl | 20 ++++ 4 files changed, 145 insertions(+) create mode 100644 _episodes/23-scatter-workflow.md create mode 100755 _includes/cwl/23-scatter-workflow/1st-tool.cwl create mode 100755 _includes/cwl/23-scatter-workflow/scatter-job.yml create mode 100644 _includes/cwl/23-scatter-workflow/scatter-workflow.cwl diff --git a/_episodes/23-scatter-workflow.md b/_episodes/23-scatter-workflow.md new file mode 100644 index 00000000..7630b77f --- /dev/null +++ b/_episodes/23-scatter-workflow.md @@ -0,0 +1,109 @@ +--- +title: "Scatter Workflows" +teaching: 10 +exercises: 0 +questions: +- "How do I run tools or workflows in parallel?" +objectives: +- "Learn how to create workflows that can run a step over a list of inputs." +keypoints: +- "A workflow can scatter over an input array in a step of a workflow, if the workflow engine +supports the `ScatterFeatureRequirement`." +- The `scatter` field is specified for each step you want to scatter +- The `scatter` field references the step level inputs, not the workflow inputs +--- +Now that we know how to write workflows, we can start utilizing the `ScatterFeatureRequirement`. +This feature tells the runner that you wish to run a tool or workflow multiple times over a list +of inputs. The workflow then takes the input(s) as an array and will run the specified step(s) +on each element of the array as if it were a single input. This allows you to run the same workflow +on multiple inputs without having to generate many different commands or input yaml files. + +~~~ +requirements: + - class: ScatterFeatureRequirement +~~~ +{: .source} + +The most common reason a new user might want to use scatter is to perform the same analysis on +different samples. Let's start with a simple workflow that calls our first example and takes +an array of strings as input to the workflow: + +*scatter-workflow.cwl* + +~~~ +{% include cwl/23-scatter-workflow/scatter-workflow.cwl %} +~~~ +{: .source} + +Aside from the `requirements` section including `ScatterFeatureRequirement`, what is +going on here? + +~~~ +inputs: + message_array: string[] +~~~ + +First of all, notice that the workflow level input here accepts an array of strings. + +~~~ +steps: + echo: + run: 1st-tool.cwl + scatter: message + in: + message: message_array + out: [] +~~~ + +Here we've added a new field to the step `echo` called `scatter`. This field tells the +runner that we'd like to scatter over this input for this particular step. Note that +the input listed after scatter is the step's input, not the workflow input. + +For our first scatter, it's as simple as that! Since our tool doesn't collect any outputs, we +still use `outputs: []` in our workflow, but if you expect that the final output of your +workflow will now have multiple outputs to collect, be sure to update that to an array type +as well! + +Using the following input file: + +*scatter-job.yml* + +~~~ +{% include cwl/23-scatter-workflow/scatter-job.cwl %} +~~~ +{: .source} + +As a reminder, `1st-tool.cwl` simply calls the command `echo` on a message. If we invoke +`cwl-runner scatter-workflow.cwl scatter-job.yml` on the command line: + +~~~ +$ cwl-runner scatter-workflow.cwl scatter-job.yml +[workflow scatter-workflow.cwl] start +[step echo] start +[job echo] /tmp/tmp0hqmg400$ echo \ + 'Hello world!' +Hello world! +[job echo] completed success +[step echo] start +[job echo_2] /tmp/tmpu65_m1zw$ echo \ + 'Hola mundo!' +Hola mundo! +[job echo_2] completed success +[step echo] start +[job echo_3] /tmp/tmp5cs7a2wh$ echo \ + 'Bonjour le monde!' +Bonjour le monde! +[job echo_3] completed success +[step echo] start +[job echo_4] /tmp/tmp301wo7p8$ echo \ + 'Hallo welt!' +Hallo welt! +[job echo_4] completed success +[step echo] completed success +[workflow scatter-workflow.cwl] completed success +{} +Final process status is success +~~~ + +You can see that the workflow calls echo multiple times on each element of our +`message_array`. diff --git a/_includes/cwl/23-scatter-workflow/1st-tool.cwl b/_includes/cwl/23-scatter-workflow/1st-tool.cwl new file mode 100755 index 00000000..1d550fc2 --- /dev/null +++ b/_includes/cwl/23-scatter-workflow/1st-tool.cwl @@ -0,0 +1,11 @@ +#!/usr/bin/env cwl-runner + +cwlVersion: v1.0 +class: CommandLineTool +baseCommand: echo +inputs: + message: + type: string + inputBinding: + position: 1 +outputs: [] diff --git a/_includes/cwl/23-scatter-workflow/scatter-job.yml b/_includes/cwl/23-scatter-workflow/scatter-job.yml new file mode 100755 index 00000000..566d5724 --- /dev/null +++ b/_includes/cwl/23-scatter-workflow/scatter-job.yml @@ -0,0 +1,5 @@ +message_array: + - Hello world! + - Hola mundo! + - Bonjour le monde! + - Hallo welt! diff --git a/_includes/cwl/23-scatter-workflow/scatter-workflow.cwl b/_includes/cwl/23-scatter-workflow/scatter-workflow.cwl new file mode 100644 index 00000000..3fb436b3 --- /dev/null +++ b/_includes/cwl/23-scatter-workflow/scatter-workflow.cwl @@ -0,0 +1,20 @@ +#!/usr/bin/env cwl-runner + +cwlVersion: v1.0 +class: Workflow + +requirements: +- class: ScatterFeatureRequirement + +inputs: + message_array: string[] + +steps: + echo: + run: 1st-tool.cwl + scatter: message + in: + message: message_array + out: [] + +outputs: [] From 1f15eb0df971ea062ac023cac15b1cbf8e7bff5b Mon Sep 17 00:00:00 2001 From: kristen Date: Tue, 21 Aug 2018 13:22:19 -0600 Subject: [PATCH 2/2] Added subworkflow scatter Added an example that uses a second step in a workflow, scattering per step and as a subworkflow to illustrate that scattering over a subworkflow may be more efficient in some circumstances. --- _episodes/23-scatter-workflow.md | 58 ++++++++++++++++++- .../cwl/23-scatter-workflow/1st-tool-mod.cwl | 13 +++++ .../scatter-nested-workflow.cwl | 35 +++++++++++ .../23-scatter-workflow/scatter-two-steps.cwl | 26 +++++++++ _includes/cwl/23-scatter-workflow/wc-tool.cwl | 12 ++++ 5 files changed, 142 insertions(+), 2 deletions(-) create mode 100755 _includes/cwl/23-scatter-workflow/1st-tool-mod.cwl create mode 100644 _includes/cwl/23-scatter-workflow/scatter-nested-workflow.cwl create mode 100644 _includes/cwl/23-scatter-workflow/scatter-two-steps.cwl create mode 100755 _includes/cwl/23-scatter-workflow/wc-tool.cwl diff --git a/_episodes/23-scatter-workflow.md b/_episodes/23-scatter-workflow.md index 7630b77f..1c839037 100644 --- a/_episodes/23-scatter-workflow.md +++ b/_episodes/23-scatter-workflow.md @@ -1,5 +1,5 @@ --- -title: "Scatter Workflows" +title: "Scattering Workflows" teaching: 10 exercises: 0 questions: @@ -11,6 +11,7 @@ keypoints: supports the `ScatterFeatureRequirement`." - The `scatter` field is specified for each step you want to scatter - The `scatter` field references the step level inputs, not the workflow inputs +- Scatter runs on each step specified independently --- Now that we know how to write workflows, we can start utilizing the `ScatterFeatureRequirement`. This feature tells the runner that you wish to run a tool or workflow multiple times over a list @@ -106,4 +107,57 @@ Final process status is success ~~~ You can see that the workflow calls echo multiple times on each element of our -`message_array`. +`message_array`. Ok, so how about if we want to scatter over two steps in a workflow? + +Let's perform a simple echo like above, but capturing `stdout` by adding the following +lines instead of `outputs: []` + +*1st-tool-mod.cwl* + +~~~ +outputs: + echo_out: + type: stdout +~~~ + +And add a second step that uses `wc` to count the characters in each file. See the tool +below: + +*wc-tool.cwl* + +~~~ +{% include cwl/23-scatter-workflow/wc-tool.cwl %} +~~~ + +Now, how do we incorporate scatter? Remember the scatter field is under each step: + +~~~ +{% include cwl/23-scatter-workflow/scatter-two-steps.cwl %} +~~~ + +Here we have placed the scatter field under each step. This is fine for this example since +it runs quickly, but if you're runnung many samples for a more complex workflow, you may +wish to consider an alternative. Here we are running scatter on each step independently, but +since the second step is not dependent on the first step completing all languages, we aren't +using the scatter functionality efficiently. The second step expects an array as input from +the first step, so it will wait until everything in step one is finished before doing anything. +Pretend that `echo Hello World!` takes 1 minute to perform, `wc -c` on the output takes 3 minutes +and that `echo Hallo welt!` takes 5 minutes to perform, and `wc` on that output takes 3 minutes. +Even though `echo Hello World!` could finish in 4 minutes, it will actually finish in 8 minutes +because the first step must wait on `echo Hallo welt!`. You can see how this might not scale +well. + +Ok, so how do we scatter on steps that can proceed independent of other samples? Remember from +chapter 22, that we can make an entire workflow a single step in another workflow! Convert our +two step workflow to a single step subworkflow: + +*scatter-nested-workflow.cwl* + +~~~ +{% include cwl/23-scatter-workflow/scatter-nested-workflow.cwl %} +~~~ + +Now the scatter acts on a single step, but that step consists of two steps so each step is performed +in parallel. + + diff --git a/_includes/cwl/23-scatter-workflow/1st-tool-mod.cwl b/_includes/cwl/23-scatter-workflow/1st-tool-mod.cwl new file mode 100755 index 00000000..4ebbdf38 --- /dev/null +++ b/_includes/cwl/23-scatter-workflow/1st-tool-mod.cwl @@ -0,0 +1,13 @@ +#!/usr/bin/env cwl-runner + +cwlVersion: v1.0 +class: CommandLineTool +baseCommand: echo +inputs: + message: + type: string + inputBinding: + position: 1 +outputs: + echo_out: + type: stdout diff --git a/_includes/cwl/23-scatter-workflow/scatter-nested-workflow.cwl b/_includes/cwl/23-scatter-workflow/scatter-nested-workflow.cwl new file mode 100644 index 00000000..45b6bcdb --- /dev/null +++ b/_includes/cwl/23-scatter-workflow/scatter-nested-workflow.cwl @@ -0,0 +1,35 @@ +#!/usr/bin/env cwl-runner + +cwlVersion: v1.0 +class: Workflow + +requirements: +- class: ScatterFeatureRequirement +- class: SubworkflowFeatureRequirement + +inputs: + message_array: string[] + +steps: + subworkflow: + run: + class: Workflow + inputs: + message: string + outputs: [] + steps: + echo: + run: 1st-tool-mod.cwl + in: + message: message + out: [echo_out] + wc: + run: wc-tool.cwl + in: + input_file: echo/echo_out + out: [] + scatter: message + in: + message: message_array + out: [] +outputs: [] diff --git a/_includes/cwl/23-scatter-workflow/scatter-two-steps.cwl b/_includes/cwl/23-scatter-workflow/scatter-two-steps.cwl new file mode 100644 index 00000000..127e6221 --- /dev/null +++ b/_includes/cwl/23-scatter-workflow/scatter-two-steps.cwl @@ -0,0 +1,26 @@ +#!/usr/bin/env cwl-runner + +cwlVersion: v1.0 +class: Workflow + +requirements: +- class: ScatterFeatureRequirement + +inputs: + message_array: string[] + +steps: + echo: + run: 1st-tool-mod.cwl + scatter: message + in: + message: message_array + out: [echo_out] + wc: + run: wc-tool.cwl + scatter: input_file + in: + input_file: echo/echo_out + out: [] + +outputs: [] diff --git a/_includes/cwl/23-scatter-workflow/wc-tool.cwl b/_includes/cwl/23-scatter-workflow/wc-tool.cwl new file mode 100755 index 00000000..c3064634 --- /dev/null +++ b/_includes/cwl/23-scatter-workflow/wc-tool.cwl @@ -0,0 +1,12 @@ +#!/usr/bin/env cwl-runner + +cwlVersion: v1.0 +class: CommandLineTool +baseCommand: wc +arguments: ["-c"] +inputs: + input_file: + type: File + inputBinding: + position: 1 +outputs: []