Scattering CWL across tabular samples and Interactive computing environments from CWL

Nathan Sheffield, PhD

www.databio.org/slides

Problem 1

I got a sample table from my collaborator.
How do I run a CWL workflow on it?

A simple CWL job

cwl-runner wc-tool.cwl wc-job.yml

cwlVersion: v1.0
class: CommandLineTool
baseCommand: [wc, -l]
inputs:
  file:
    type: File
    inputBinding:
      position: 1
outputs: []

file:
  class: File
  path: data/frog1_data.txt

Scatter across multiple samples...

Scattering with nested workflows

main.cwl

steps:
  alignment:
    run: alignment.cwl
    scatter: fq
    in:
      fq: fq
      genome: genome
      gtf: gtf
    out: [qc_html, bam]
  featureCounts:
    requirements:
      ResourceRequirement:
        ramMin: 500
    run: featureCounts.cwl
    in:
      n_input_bam: aln/bam
      gtf: gtf
    out: [featurecounts]

inputs.yaml

fq:
  - class: File
    location: rnaseq/raw_fastq/s1.fq
    format: http://edamontology.org/format_1930
  - class: File
    location: rnaseq/raw_fastq/s2.fq
    format: http://edamontology.org/format_1930
  - class: File
    location: rnaseq/raw_fastq/s3.fq
    format: http://edamontology.org/format_1930
genome:
  class: Directory
  location: hg19-chr1-STAR-index
gtf:
  class: File
  location: rnaseq/ref/genes.gtf

Adapted from Peter Amstutz
github.com/common-workflow-library/rnaseq-cwl-training

But I have a CSV sample table

sample_name,library,file
frog_1,anySampleType,data/frog1_data.txt
frog_2,anySampleType,data/frog2_data.txt

Introduction to looper

looper run config.yaml

cwl_interface.yaml:


pipeline_name: count_lines
pipeline_type: sample
input_schema: input_schema.yaml
command_template: >
  cwl-runner wc-tool.cwl {sample.sample_yaml_cwl}
pre_submit:
  python_functions:
    - looper.write_sample_yaml_cwl

project_config.yaml:


pep_version: 2.0.0
sample_table: file_list.csv
sample_modifiers:
  append:
    pipeline_interfaces: cwl_interface.yaml
looper:
  output_dir: pipeline_results

Scattering across samples using looper


> looper run project_config.yaml


Looper version: 1.3.1-dev
Command: run
## [1 of 2] sample: frog_1; pipeline: count_lines
Calling pre-submit function: looper.write_sample_yaml_cwl
Writing sample yaml to pipeline_results/submission/frog_1_sample_cwl.yaml
Writing script to /home/nsheff/code/incubator/learn_cwl/cwl-pep/simple_demo/pipeline_results/submission/count_lines_frog_1.sub
Job script (n=1; 0.00Gb): pipeline_results/submission/count_lines_frog_1.sub
Compute node: zither
Start time: 2021-01-26 14:54:50
INFO /home/nsheff/.local/bin/cwl-runner 3.0.20200807132242
INFO Resolved 'wc-tool.cwl' to 'file:///home/nsheff/code/incubator/learn_cwl/cwl-pep/simple_demo/wc-tool.cwl'
INFO [job wc-tool.cwl] /tmp/7vhoojf2$ wc \
    -l \
    /tmp/tmpxcekd0he/stg6b7f7559-6e4f-409b-8b9f-adc73dd5ca82/frog1_data.txt
4 /tmp/tmpxcekd0he/stg6b7f7559-6e4f-409b-8b9f-adc73dd5ca82/frog1_data.txt
INFO [job wc-tool.cwl] completed success
{}
INFO Final process status is success
## [2 of 2] sample: frog_2; pipeline: count_lines
Calling pre-submit function: looper.write_sample_yaml_cwl
Writing sample yaml to pipeline_results/submission/frog_2_sample_cwl.yaml
Writing script to /home/nsheff/code/incubator/learn_cwl/cwl-pep/simple_demo/pipeline_results/submission/count_lines_frog_2.sub
Job script (n=1; 0.00Gb): pipeline_results/submission/count_lines_frog_2.sub
Compute node: zither
Start time: 2021-01-26 14:54:51
INFO /home/nsheff/.local/bin/cwl-runner 3.0.20200807132242
INFO Resolved 'wc-tool.cwl' to 'file:///home/nsheff/code/incubator/learn_cwl/cwl-pep/simple_demo/wc-tool.cwl'
INFO [job wc-tool.cwl] /tmp/96ojstvh$ wc \
    -l \
    /tmp/tmp09syks28/stgd2a8068d-a68f-4759-9cf6-0f359aa49740/frog2_data.txt
7 /tmp/tmp09syks28/stgd2a8068d-a68f-4759-9cf6-0f359aa49740/frog2_data.txt
INFO [job wc-tool.cwl] completed success
{}
INFO Final process status is success

Looper finished
Samples valid for job generation: 2 of 2
Commands submitted: 2 of 2
Jobs submitted: 2

Looper uses a generic input format.


looper run config.yaml


install.packages("pepr")

library("pepr")

p = pepr::Project("config.yaml")
projConfig = config(p)
mySamples = sampleTable(p)


pip install peppy

import peppy

prj = peppy.Project("config.yaml")
samples = prj.samples
sample_table = prj.sample_table

Features of PEP

Project modifiers

Sample modifiers

Schema validation

Learn more:
http://pep.databio.org

Problem 2

I want to test something in a workflow computing environment:

I'm troubleshooting a failing command
I want to try a step interactively with other data
I want to demo a different approach using the same tools

How do I run interactive code at the terminal...
as if I were a workflow?

Docker

#!/usr/bin/env cwl-runner

cwlVersion: v1.0
class: CommandLineTool
baseCommand: node
hints:
  DockerRequirement:
    dockerPull: node:slim
inputs:
  src:
    type: File
    inputBinding:
      position: 1
outputs:
  example_out:
    type: stdout
stdout: output.txt

docker run -i --rm \
  --volume /home:/home \
  --volume /tmp:/tmp \
  --volume /ext:/ext \
  --env=TMPDIR
  --workdir `pwd` \
  --user=1000:1000 \
  --network="host" \
  --docker-arg \
  --another-docker-arg \
  --yet-another-docker-arg \
node:slim node ... command

Intro to Bulker

Simple commands run in containers behind-the-scenes

bulker.io

Bulker basics


pip install bulker
bulker load demo
bulker activate demo
cowsay Hello world!   <- actually runs in docker

Bulker + CWL = cwl2man


bulker cwl2man -c workflow.cwl -m manifest.yaml
bulker load my-interactive-env -m manifest.yaml
bulker activate my-interactive-env

voila!

Summary

looper : scattering CWL workflows across tabular data

bulker : portable, interactive environments from CWL

Thank You

Collaborators
Vince Reuter
Andre Rendeiro
Levi Waldron

Sheffield lab
Michal Stolarczyk
John Lawson
Jason Smith
Kristyna Kupkova
Aaron Gu
John Stubbs
Bingjie Xue

Funding:

NIGMS R35-GM128636

nsheff ·

databio.org ·

nsheffield@virginia.edu