Tools for epigenome analysis of genomic regions and data-intensive project management

Nathan Sheffield, PhD
www.databio.org/slides

Overview

LOLA
Locus Overlap Analysis
MIRA
Methylation-based Inference of Regulatory Activity
PEP
Portable Encapsulated Projects




www.databio.org/slides

Locus Overlap Analysis

Sheffield and Bock (2016). Bioinformatics.
Nagraj, Magee, and Sheffield (2018). Nucleic Acids Research.

A shiny app and server for interactive LOLA analysis.
Public server: http://lolaweb.databio.org
GitHub: https://github.com/databio/LOLAweb

DEMO

Methylation-based Inference of Regulatory Activity (MIRA)

Lawson et al. (2018). Bioinformatics.

DNA methylation

DNA methylation

Bisulfite-seq

Region pooling


Sheffield et al. (2017). Nature Medicine.

Organizing large-scale biological data around standardized projects

Data is becoming more...

abundant
available
powerful
So why are the world's problems not solved?


First step in bioinformatics analysis:

pipeline

Papers with
"bioinformatics pipeline"
in title
Problem solved?
Problem solved?
Problem solved?

Data munging

Then, downstream tools need a different organization

What if?

PEP: Portable Encapsulated Projects

PEP format

project_config.yaml
metadata:
  sample_annotation: /path/to/samples.csv
  output_dir: /path/to/output/folder

samples.csv
sample_name, protocol, organism, data_source
frog_0h, RNA-seq, frog, /path/to/frog0.gz
frog_1h, RNA-seq, frog, /path/to/frog1.gz
frog_2h, RNA-seq, frog, /path/to/frog2.gz
frog_3h, RNA-seq, frog, /path/to/frog3.gz

PEP portability features

Derived attributes
Implied attributes
Subprojects
Derived attributes
Automatically build new sample attributes from existing attributes.
Without derived attribute:
| sample_name | t | protocol | organism | data_source | | ------------- | ---- | :-------------: | -------- | ---------------------- | | frog_0h | 0 | RNA-seq | frog | /path/to/frog0.gz | | frog_1h | 1 | RNA-seq | frog | /path/to/frog1.gz | | frog_2h | 2 | RNA-seq | frog | /path/to/frog2.gz | | frog_3h | 3 | RNA-seq | frog | /path/to/frog3.gz |
Using derived attribute:
| sample_name | t | protocol | organism | data_source | | ------------- | ---- | :-------------: | -------- | ---------------------- | | frog_0h | 0 | RNA-seq | frog | my_samples | | frog_1h | 1 | RNA-seq | frog | my_samples | | frog_2h | 2 | RNA-seq | frog | my_samples | | frog_3h | 3 | RNA-seq | frog | my_samples | | crab_0h | 0 | RNA-seq | crab | your_samples | | crab_3h | 3 | RNA-seq | crab | your_samples |
| sample_name | t | protocol | organism | data_source | | ------------- | ---- | :-------------: | -------- | ---------------------- | | frog_0h | 0 | RNA-seq | frog | my_samples | | frog_1h | 1 | RNA-seq | frog | my_samples | | frog_2h | 2 | RNA-seq | frog | my_samples | | frog_3h | 3 | RNA-seq | frog | my_samples | | crab_0h | 0 | RNA-seq | crab | your_samples | | crab_3h | 3 | RNA-seq | crab | your_samples |
Project config file:
derived_columns: [data_source]
data_sources:
  my_samples: "/path/to/my/samples/{organism}_{t}h.gz"
  your_samples: "/path/to/your/samples/{organism}_{t}h.gz"
{variable} identifies sample annotation columns
Benefit: Enables distributed files, portability
Implied attributes
Add new sample attributes conditioned on values of existing attributes
Before:
| sample_name | protocol | organism | | ------------- | :-------------: | -------- | | human_1 | RNA-seq | human | | human_2 | RNA-seq | human | | human_3 | RNA-seq | human | | mouse_1 | RNA-seq | mouse |
After:
| sample_name | protocol | organism | genome | | ------------- | :-------------: | -------- | ------ | | human_1 | RNA-seq | human | hg38 | | human_2 | RNA-seq | human | hg38 | | human_3 | RNA-seq | human | hg38 | | mouse_1 | RNA-seq | mouse | mm10 |
| sample_name | protocol | organism | | ------------- | :-------------: | -------- | | human_1 | RNA-seq | human | | human_2 | RNA-seq | human | | human_3 | RNA-seq | human | | mouse_1 | RNA-seq | mouse |
Project config file:
implied_columns:
  organism:
    human:
      genome: hg38
    mouse:
      genome: mm10
Benefit: Divides project from sample metadata
Subprojects
Define activatable project attributes.
subprojects:
  diverse:
    metadata:
      sample_annotation: psa_rrbs_diverse.csv
  cancer:
    metadata:
      sample_annotation: psa_rrbs_intracancer.csv
Benefit: Defines multiple similar projects in a single file
How is this portable and encapsulated?

Encapsulated:
A project is an extensible object, with samples and settings as attributes.
Portable:
  1. A project can be moved from one analysis tool to another
  2. A project can be moved from one computing environment to another
peppy package
import peppy

prj = Project("pep_config.yaml")
samples = prj.get_samples()

for sample in samples:
	print(sample.name)
	# do further analysis to each sample
Project API
pepr package
library("pepr")

prj = pepr::Project("pep_config.yaml")
samples = pepr::pepSamples(prj)

for (sample in samples) {
	message(pepr::sampleName(sample))
	# do further analysis to each sample
	}

Looper

Deploys pipelines across samples by connecting
samples to any command-line tool
pipeline_interface.yaml
protocol_mappings:
  RNA-seq: rna-seq 

pipelines:
  rna-seq:
    name: RNA-seq_pipeline
    path: path/to/rna-seq.py
    arguments:
      "--option1": sample_attribute
      "--option2": sample_attribute2
  • maps protocols to pipelines
  • maps sample attributes (columns) to pipeline arguments
  • Looper features

    Single-input runs
    Flexible pipelines
    Flexible resources
    Flexible compute
    Job status-aware
    Single-input runs
    Run your entire project with one line:
    looper run project_config.yaml
    Flexible pipelines
    protocol_mappings:
      RRBS: rrbs
      WGBS: wgbs
      EG: wgbs.py
      SMART-seq: rnaBitSeq -f; rnaTopHat -f
      ATAC-SEQ: atacseq
      DNase-seq: atacseq
      CHIP-SEQ: chipseq
    Many-to-many mappings
    Flexible resources
    pipeline_key:
      name: pipeline_name
      arguments:
        "--option" : value
      resources:
        default:
          file_size: "0"
          cores: "2"
          mem: "6000"
          time: "01:00:00"
        large_input:
          file_size: "2000"
          cores: "4"
          mem: "12000"
          time: "08:00:00"
    Resources can vary by input file size
    Flexible compute
    compute:
      slurm:
        submission_template: templates/slurm_template.sub
        submission_command: sbatch
      localhost:
        submission_template: templates/localhost_template.sub
        submission_command: sh

    Adjust compute package on-the-fly:
    > looper run project_config.yaml --compute localhost
    Job status-aware
    Looper only submits jobs for samples not already flagged as running, completed, or failed.
    looper check project_config.yaml
    looper summarize project_config.yaml

    Conclusion

    • PEP format is a novel approach to standardize projects.
    • Initial tools like geofetch and looper build PEP projects and connect them to pipelines
    • Python and R packages provide a universal interface to PEP metadata for tools and analysis
    More information at pepkit.github.io.

    Thank You


    nsheff · databio.org · nsheffield@virginia.edu