Bioinformatics pipeline development and deployment with Pypiper and Looper

Nathan Sheffield, PhD

www.databio.org/slides

Jordan got data from the sequencer today.

He sits down at the terminal to process it.

bowtie2 sample.fastq ...
trimmomatic ...
macs2 output.fastq ...

Nice!

Two weeks later...

Jordan got data from the sequencer today.

He sits down at the terminal to process it.

Hmm... what did I do last time?

Jordan has an idea.

He sits down at the terminal to process script it.

Nice!

Two weeks later...

Jordan got data from the sequencer today.

He gets the script running...

The server crashes. It would be nice if the script could pick up where it left off...

Now Jordan has 500 samples for a time series experiment.

He starts writing some looping functions to handle cluster submission.

This is going to take awhile...

In the meantime, Jordan generates other samples requiring slightly different parameters.

No problem, I'll just duplicate this script...

Stop! There is a better way...

Challenges with shell pipelines

	No record of the output of the tools
	Failed steps do not halt the entire pipeline
	Difficult to scale to 500 samples
	Two pipelines running simultaneously may interfere
	Tracking which version was used with which samples
	Memory use is left unmonitored and unchecked
	Requires custom parsers to extract results

Python modules

Pypiper

Builds a pipeline for a single sample.

Looper

Deploys pipelines across samples.

Comprehensive pipeline management system

Pypiper

Builds a pipeline for a single sample.

GitHub: http://github.com/epigen/pypiper
Documentation: http://pypiper.readthedocs.io

Pypiper features

Simplicity

Restartability

File integrity lock

Memory monitoring

Job monitoring

Robust error handling

Automatic logging

Easy result reports

Collate input files

Simplicity
Bash script:

shuf -i 1-500000000 -n 10000000 > outfile.txt

Pypiper script:

pm.run("shuf -i 1-500000000 -n 10000000 > outfile.txt")

Using pypiper is as easy as writing a shell script.

Additional options provide power on demand.

Restartability

target = os.path.join(outfolder, "outfile.txt")  # output file
cmd = "shuf -i 1-500000000 -n 10000000 > " + target
pm.run(command, target)

Commands (optionally) only run if target does not already exist.

Pipeline will thus pick up where it left off.

File integrity lock
Lock files ensure commands only run if the target is unlocked:

pipelines will not proceed with incomplete files
multiple pipelines can create/use the same files

Job monitoring
Pypiper uses a flag system to track status

Job running

Job completed

Job failed

Summarizing jobs is easy: just count the flags

= 2

= 17

Robust error handling

If a process fails, the pipeline fails.

Automatic logging

Output is automatically split to screen and file.

Easy result reports

reads = count_reads(unaligned_file)
aligned = count_reads(aligned_file)
pm.report_result("aligned_reads", aligned)
pm.report_result("alignment_rate", aligned/reads)

Output:

aligned_reads	2526232
alignment_rate	0.64234

Example pipeline

	import pypiper, os
	outfolder = "pipeline_output/"  # folder for results
	pm = pypiper.PipelineManager(name="shuf", outfolder)

	target = os.path.join(outfolder, "outfile.txt")  # output file
	command = "shuf -i 1-500000000 -n 10000000 > " + target
	pm.run(command, target)

	pm.stop_pipeline()

Looper

Deploys pipelines across samples by connecting
samples to any command-line tool

https://looper.databio.org

pipeline_interface.yaml

protocol_mappings:
  RNA-seq: rna-seq 

pipelines:
  rna-seq:
    name: RNA-seq_pipeline
    path: path/to/rna-seq.py
    arguments:
      "--option1": sample_attribute
      "--option2": sample_attribute2

maps protocols to pipelines

maps sample attributes (columns) to pipeline arguments

Looper features

Single-input runs

Flexible pipelines

Flexible resources

Flexible compute

Job status-aware

Single-input runs
Run your entire project with one line:

looper run project_config.yaml

Flexible pipelines

protocol_mappings:
  RRBS: rrbs
  WGBS: wgbs
  EG: wgbs.py
  SMART-seq: rnaBitSeq -f; rnaTopHat -f
  ATAC-SEQ: atacseq
  DNase-seq: atacseq
  CHIP-SEQ: chipseq

Many-to-many mappings

Flexible resources

pipeline_key:
  name: pipeline_name
  arguments:
    "--option" : value
  resources:
    default:
      file_size: "0"
      cores: "2"
      mem: "6000"
      time: "01:00:00"
    large_input:
      file_size: "2000"
      cores: "4"
      mem: "12000"
      time: "08:00:00"

Resources can vary by input file size

Flexible compute

compute:
  slurm:
    submission_template: templates/slurm_template.sub
    submission_command: sbatch
  localhost:
    submission_template: templates/localhost_template.sub
    submission_command: sh

Adjust compute package on-the-fly:

> looper run project_config.yaml --compute localhost

Job status-aware
Looper only submits jobs for samples not already flagged as running, completed, or failed.

looper check project_config.yaml

looper summarize project_config.yaml

Combine for a complete pipelining system

How is this better than _____ ?

low barrier to entry (ie, language)
decoupled single-sample (pypiper) from deploy (looper)
simplified parallelism

Parallelism Philosophy

by process

by sample

by dependence

Very easy

Easy

Hard

Getting started

Read the docs!

Using a pipeline

Create a sample_annotation.csv

Create a project_config.yaml

Templates exist for both, follow tutorials for Looper.

Building a pipeline

Follow tutorials for Pypiper .

Write a Pypiper pipeline to handle a single sample.

Connect to looper with protocol mapping and pipeline interface.

Thanks for listening!

Slides at http://databio.org/slides/pypiper_looper.html