Bioinformatics pipeline development and deploy with Pypiper and Looper

by Nathan Sheffield

Slides at http://databio.org/slides/pypiper_looper.html
Jordan got data from the sequencer today.
He sits down at the terminal to process it.
bowtie2 sample.fastq ...
trimmomatic ...
macs2 output.fastq ...
Nice!
Two weeks later...
Jordan got data from the sequencer today.
He sits down at the terminal to process it.
Hmm... what did I do last time?
Jordan has an idea.
He sits down at the terminal to process script it.
Nice!
Two weeks later...
Jordan got data from the sequencer today.
He gets the script running...
The server crashes. It would be nice if the script could pick up where it left off...
Now Jordan has 500 samples for a time series experiment.
He starts writing some looping functions to handle cluster submission.
This is going to take awhile...
In the meantime, Jordan generates other samples requiring slightly different parameters.
No problem, I'll just duplicate this script...
Stop! There is a better way...

Challenges with shell pipelines

No record of the output of the tools
Failed steps do not halt the entire pipeline
Difficult to scale to 500 samples
Two pipelines running simultaneously may interfere
Tracking which version was used with which samples
Memory use is left unmonitored and unchecked
Requires custom parsers to extract results
Python modules

Pypiper

Builds a pipeline for a single sample.

Looper

Deploys pipelines across samples.


Comprehensive pipeline management system

Pypiper

Builds a pipeline for a single sample.

Pypiper features

Simplicity
Restartability
File integrity lock
Memory monitoring
Job monitoring
Robust error handling
Automatic logging
Easy result reports
Simplicity
Bash script:
shuf -i 1-500000000 -n 10000000 > outfile.txt
Pypiper script:
pm.run("shuf -i 1-500000000 -n 10000000 > outfile.txt")
Using pypiper is as easy as writing a shell script.

Additional options provide power on demand.
Restartability
target = os.path.join(outfolder, "outfile.txt")  # output file
cmd = "shuf -i 1-500000000 -n 10000000 > " + target
pm.run(command, target)
Commands (optionally) only run if target does not already exist.

Pipeline will thus pick up where it left off.

File integrity lock
Lock files ensure commands only run if the target is unlocked:
  • pipelines will not proceed with incomplete files
  • multiple pipelines can create/use the same files
Job monitoring
Pypiper uses a flag system to track status
Job running   Job completed   Job failed

Summarizing jobs is easy: just count the flags


= 2   = 17  =1
Robust error handling

If a process fails, the pipeline fails.
Automatic logging

Output is automatically split to screen and file.
Easy result reports
reads = count_reads(unaligned_file)
aligned = count_reads(aligned_file)
pm.report_result("aligned_reads", aligned)
pm.report_result("alignment_rate", aligned/reads)
Output:
aligned_reads	2526232
alignment_rate	0.64234
Example pipeline
	import pypiper, os
	outfolder = "pipeline_output/"  # folder for results
	pm = pypiper.PipelineManager(name="shuf", outfolder)

	target = os.path.join(outfolder, "outfile.txt")  # output file
	command = "shuf -i 1-500000000 -n 10000000 > " + target
	pm.run(command, target)

	pm.stop_pipeline()
	

Looper

Deploys pipelines across samples.
Pipeline Users:
Looper needs to know:

What samples?
Where to store results?
What pipelines to run?

You provide:

sample_annotation.csv
project_config.yaml
project_config.yaml


Pipeline Developers:
Looper needs to know:

What protocol?
What resources or arguments?
You provide:

protocol_mapping.yaml
pipeline_interface.yaml

Looper features

Flexibile pipelines
Flexible resources
Project API
Single-input runs
Flexible compute
Job status-aware
Subprojects
Collate input files
Flexibile pipelines
RRBS: rrbs.py
WGBS: wgbs.py
EG: wgbs.py
SMART-seq:  >
  rnaBitSeq.py -f;
  rnaTopHat.py -f
ATAC-SEQ: atacseq.py
CHIP-SEQ: chipseq.py
protocol_mappings.yaml maps protocols to pipelines
Flexible resources
pipeline_script.py:  # this is variable (script filename)
  name: value  # used for assessing pipeline flags (optional)
  looper_args: True
  arguments:
    "-k" : value
    "--key2" : null # value-less argument flags
  resources:
    default:
      file_size: "0"
      cores: "4"
      mem: "6000"
      time: "2-00:00:00"
    resource_package_name:
      file_size: "2000"
      cores: "4"
      mem: "6000"
      time: "2-00:00:00"
pipeline_interface.yaml resources vary by input file size
Project API
Single-input runs
looper run  project_config.yaml
metadata:
  sample_annotation: table_experiments.csv
  output_dir: /groups/lab/projects/example
  pipelines_dir: /groups/lab/projects/example/pipelines
All you need is love project_config.yaml
Flexible compute
compute:
  # Use this to change your cluster manager (SLURM, SGE, LFS, etc)
  submission_template: templates/slurm_template.sub
  submission_command: sbatch
  # To run on the localhost:
  # submission_template: templates/localhost_template.sub
  # submission_command: sh
Job status-aware
Looper only submits jobs for samples not already flagged as running, completed, or failed.
looper summarize project_config.yaml
looper check project_config.yaml
Subprojects
subprojects:
  diverse:
    metadata:
      sample_annotation: psa_rrbs_diverse.csv
  cancer:
    metadata:
      sample_annotation: psa_rrbs_intracancer.csv
Hierarchical replacement
Lets you define multiple projects in a single file
looper run project_config.yaml --sp cancer
Collate input files
data_sources:
  # specify the ABSOLUTE PATH of input files using variable paths
  # entries correspond to values in data columns
  # {variable} identifies sample annotation columns
  my_samples: "{RAWDATA}/{flowcell}_{lane}/{name}.bam"
  encode_rrbs: "/lab/projects/encode/fastq/{sample_name}.fastq.gz"
Combine for a complete pipelining system

How is this better than _____ ?

  • low barrier to entry (ie, language)
  • decoupled single-sample (pypiper) from deploy (looper)
  • simplified parallelism

Parallelism Philosophy


Parallel
by sample
Parallel
by process
Parallel
by dependence

Very Effective
Quite Effective
Kind-of Effective
Easy
Easy
Hard

Getting started

Read the docs!

Using a pipeline

Create a sample_annotation.csv

Create a project_config.yaml

Templates exist for both, follow tutorials for Looper.

Building a pipeline

Follow tutorials for Pypiper .

Write a Pypiper pipeline to handle a single sample.

Connect to looper with protocol mapping and pipeline interface.
Thanks for listening!

Slides at http://databio.org/slides/pypiper_looper.html