Bioinformatics pipeline development and deployment with Pypiper and Looper

Nathan Sheffield, PhD
www.databio.org/slides
Jordan got data from the sequencer today.
He sits down at the terminal to process it.
bowtie2 sample.fastq ...
trimmomatic ...
macs2 output.fastq ...
Nice!
Two weeks later...
Jordan got data from the sequencer today.
He sits down at the terminal to process it.
Hmm... what did I do last time?
Jordan has an idea.
He sits down at the terminal to process script it.
Nice!
Two weeks later...
Jordan got data from the sequencer today.
He gets the script running...
The server crashes. It would be nice if the script could pick up where it left off...
Now Jordan has 500 samples for a time series experiment.
He starts writing some looping functions to handle cluster submission.
This is going to take awhile...
In the meantime, Jordan generates other samples requiring slightly different parameters.
No problem, I'll just duplicate this script...
Stop! There is a better way...

Challenges with shell pipelines

No record of the output of the tools
Failed steps do not halt the entire pipeline
Difficult to scale to 500 samples
Two pipelines running simultaneously may interfere
Tracking which version was used with which samples
Memory use is left unmonitored and unchecked
Requires custom parsers to extract results
Python modules

Pypiper

Builds a pipeline for a single sample.

Looper

Deploys pipelines across samples.


Comprehensive pipeline management system

Pypiper

Builds a pipeline for a single sample.

Pypiper features

Simplicity
Restartability
File integrity lock
Memory monitoring
Job monitoring
Robust error handling
Automatic logging
Easy result reports
Collate input files
Simplicity
Bash script:
shuf -i 1-500000000 -n 10000000 > outfile.txt
Pypiper script:
pm.run("shuf -i 1-500000000 -n 10000000 > outfile.txt")
Using pypiper is as easy as writing a shell script.

Additional options provide power on demand.
Restartability
target = os.path.join(outfolder, "outfile.txt")  # output file
cmd = "shuf -i 1-500000000 -n 10000000 > " + target
pm.run(command, target)
Commands (optionally) only run if target does not already exist.

Pipeline will thus pick up where it left off.

File integrity lock
Lock files ensure commands only run if the target is unlocked:
  • pipelines will not proceed with incomplete files
  • multiple pipelines can create/use the same files
Job monitoring
Pypiper uses a flag system to track status
Job running   Job completed   Job failed

Summarizing jobs is easy: just count the flags


= 2   = 17  =1
Robust error handling

If a process fails, the pipeline fails.
Automatic logging

Output is automatically split to screen and file.
Easy result reports
reads = count_reads(unaligned_file)
aligned = count_reads(aligned_file)
pm.report_result("aligned_reads", aligned)
pm.report_result("alignment_rate", aligned/reads)
Output:
aligned_reads	2526232
alignment_rate	0.64234
Example pipeline
	import pypiper, os
	outfolder = "pipeline_output/"  # folder for results
	pm = pypiper.PipelineManager(name="shuf", outfolder)

	target = os.path.join(outfolder, "outfile.txt")  # output file
	command = "shuf -i 1-500000000 -n 10000000 > " + target
	pm.run(command, target)

	pm.stop_pipeline()
	

Looper

Deploys pipelines across samples by connecting
samples to any command-line tool
pipeline_interface.yaml
protocol_mappings:
  RNA-seq: rna-seq 

pipelines:
  rna-seq:
    name: RNA-seq_pipeline
    path: path/to/rna-seq.py
    arguments:
      "--option1": sample_attribute
      "--option2": sample_attribute2
  • maps protocols to pipelines
  • maps sample attributes (columns) to pipeline arguments
  • Looper features

    Single-input runs
    Flexible pipelines
    Flexible resources
    Flexible compute
    Job status-aware
    Single-input runs
    Run your entire project with one line:
    looper run project_config.yaml
    Flexible pipelines
    protocol_mappings:
      RRBS: rrbs
      WGBS: wgbs
      EG: wgbs.py
      SMART-seq: rnaBitSeq -f; rnaTopHat -f
      ATAC-SEQ: atacseq
      DNase-seq: atacseq
      CHIP-SEQ: chipseq
    Many-to-many mappings
    Flexible resources
    pipeline_key:
      name: pipeline_name
      arguments:
        "--option" : value
      resources:
        default:
          file_size: "0"
          cores: "2"
          mem: "6000"
          time: "01:00:00"
        large_input:
          file_size: "2000"
          cores: "4"
          mem: "12000"
          time: "08:00:00"
    Resources can vary by input file size
    Flexible compute
    compute:
      slurm:
        submission_template: templates/slurm_template.sub
        submission_command: sbatch
      localhost:
        submission_template: templates/localhost_template.sub
        submission_command: sh

    Adjust compute package on-the-fly:
    > looper run project_config.yaml --compute localhost
    Job status-aware
    Looper only submits jobs for samples not already flagged as running, completed, or failed.
    looper check project_config.yaml
    looper summarize project_config.yaml
    Combine for a complete pipelining system

    How is this better than _____ ?

    • low barrier to entry (ie, language)
    • decoupled single-sample (pypiper) from deploy (looper)
    • simplified parallelism

    Parallelism Philosophy


    by process
    by sample
    by dependence

    Very easy
    Easy
    Hard

    Getting started

    Read the docs!

    Using a pipeline

    Create a sample_annotation.csv

    Create a project_config.yaml

    Templates exist for both, follow tutorials for Looper.

    Building a pipeline

    Follow tutorials for Pypiper .

    Write a Pypiper pipeline to handle a single sample.

    Connect to looper with protocol mapping and pipeline interface.
    Thanks for listening!

    Slides at http://databio.org/slides/pypiper_looper.html