Scattering CWL across tabular samples and Interactive computing environments from CWL
Nathan Sheffield, PhD
www.databio.org/slides
Problem 1
I got a sample table from my collaborator.
How do I run a CWL workflow on it?
A simple CWL job
cwl-runner wc-tool.cwl wc-job.yml
cwlVersion: v1.0
class: CommandLineTool
baseCommand: [wc, -l]
inputs:
file:
type: File
inputBinding:
position: 1
outputs: []
file:
class: File
path: data/frog1_data.txt
Scatter across multiple samples...
Scattering with nested workflows
main.cwl
steps:
alignment:
run: alignment.cwl
scatter: fq
in:
fq: fq
genome: genome
gtf: gtf
out: [qc_html, bam]
featureCounts:
requirements:
ResourceRequirement:
ramMin: 500
run: featureCounts.cwl
in:
n_input_bam: aln/bam
gtf: gtf
out: [featurecounts]
inputs.yaml
fq:
- class: File
location: rnaseq/raw_fastq/s1.fq
format: http://edamontology.org/format_1930
- class: File
location: rnaseq/raw_fastq/s2.fq
format: http://edamontology.org/format_1930
- class: File
location: rnaseq/raw_fastq/s3.fq
format: http://edamontology.org/format_1930
genome:
class: Directory
location: hg19-chr1-STAR-index
gtf:
class: File
location: rnaseq/ref/genes.gtf
But I have a CSV sample table
sample_name,library,file
frog_1,anySampleType,data/frog1_data.txt
frog_2,anySampleType,data/frog2_data.txt
Introduction to looper
looper run config.yaml
cwl_interface.yaml:
pipeline_name: count_lines
pipeline_type: sample
input_schema: input_schema.yaml
command_template: >
cwl-runner wc-tool.cwl {sample.sample_yaml_cwl}
pre_submit:
python_functions:
- looper.write_sample_yaml_cwl
project_config.yaml:
pep_version: 2.0.0
sample_table: file_list.csv
sample_modifiers:
append:
pipeline_interfaces: cwl_interface.yaml
looper:
output_dir: pipeline_results
Scattering across samples using looper
> looper run project_config.yaml
Looper version: 1.3.1-dev
Command: run
## [1 of 2] sample: frog_1; pipeline: count_lines
Calling pre-submit function: looper.write_sample_yaml_cwl
Writing sample yaml to pipeline_results/submission/frog_1_sample_cwl.yaml
Writing script to /home/nsheff/code/incubator/learn_cwl/cwl-pep/simple_demo/pipeline_results/submission/count_lines_frog_1.sub
Job script (n=1; 0.00Gb): pipeline_results/submission/count_lines_frog_1.sub
Compute node: zither
Start time: 2021-01-26 14:54:50
INFO /home/nsheff/.local/bin/cwl-runner 3.0.20200807132242
INFO Resolved 'wc-tool.cwl' to 'file:///home/nsheff/code/incubator/learn_cwl/cwl-pep/simple_demo/wc-tool.cwl'
INFO [job wc-tool.cwl] /tmp/7vhoojf2$ wc \
-l \
/tmp/tmpxcekd0he/stg6b7f7559-6e4f-409b-8b9f-adc73dd5ca82/frog1_data.txt
4 /tmp/tmpxcekd0he/stg6b7f7559-6e4f-409b-8b9f-adc73dd5ca82/frog1_data.txt
INFO [job wc-tool.cwl] completed success
{}
INFO Final process status is success
## [2 of 2] sample: frog_2; pipeline: count_lines
Calling pre-submit function: looper.write_sample_yaml_cwl
Writing sample yaml to pipeline_results/submission/frog_2_sample_cwl.yaml
Writing script to /home/nsheff/code/incubator/learn_cwl/cwl-pep/simple_demo/pipeline_results/submission/count_lines_frog_2.sub
Job script (n=1; 0.00Gb): pipeline_results/submission/count_lines_frog_2.sub
Compute node: zither
Start time: 2021-01-26 14:54:51
INFO /home/nsheff/.local/bin/cwl-runner 3.0.20200807132242
INFO Resolved 'wc-tool.cwl' to 'file:///home/nsheff/code/incubator/learn_cwl/cwl-pep/simple_demo/wc-tool.cwl'
INFO [job wc-tool.cwl] /tmp/96ojstvh$ wc \
-l \
/tmp/tmp09syks28/stgd2a8068d-a68f-4759-9cf6-0f359aa49740/frog2_data.txt
7 /tmp/tmp09syks28/stgd2a8068d-a68f-4759-9cf6-0f359aa49740/frog2_data.txt
INFO [job wc-tool.cwl] completed success
{}
INFO Final process status is success
Looper finished
Samples valid for job generation: 2 of 2
Commands submitted: 2 of 2
Jobs submitted: 2
Looper uses a generic input format.
looper run config.yaml
install.packages("pepr")
library("pepr")
p = pepr::Project("config.yaml")
projConfig = config(p)
mySamples = sampleTable(p)
pip install peppy
import peppy
prj = peppy.Project("config.yaml")
samples = prj.samples
sample_table = prj.sample_table
Problem 2
I want to test something in a workflow computing environment:
- I'm troubleshooting a failing command
- I want to try a step interactively with other data
- I want to demo a different approach using the same tools
How do I run interactive code at the terminal...
as if I were a workflow?
Docker
#!/usr/bin/env cwl-runner
cwlVersion: v1.0
class: CommandLineTool
baseCommand: node
hints:
DockerRequirement:
dockerPull: node:slim
inputs:
src:
type: File
inputBinding:
position: 1
outputs:
example_out:
type: stdout
stdout: output.txt
docker run -i --rm \
--volume /home:/home \
--volume /tmp:/tmp \
--volume /ext:/ext \
--env=TMPDIR
--workdir `pwd` \
--user=1000:1000 \
--network="host" \
--docker-arg \
--another-docker-arg \
--yet-another-docker-arg \
node:slim node ... command
Intro to Bulker
Simple commands run in containers behind-the-scenes
bulker.io
Bulker basics
pip install bulker
bulker load demo
bulker activate demo
cowsay Hello world! <- actually runs in docker
Bulker + CWL = cwl2man
bulker cwl2man -c workflow.cwl -m manifest.yaml
bulker load my-interactive-env -m manifest.yaml
bulker activate my-interactive-env
voila!
Summary
looper : scattering CWL workflows across tabular data
bulker : portable, interactive environments from CWL
Thank You
Collaborators
Vince Reuter
Andre Rendeiro
Levi Waldron
Sheffield lab
Michal Stolarczyk
John Lawson
Jason Smith
Kristyna Kupkova
Aaron Gu
John Stubbs
Bingjie Xue
nsheff ·
databio.org ·
nsheffield@virginia.edu