Tools for epigenome analysis of genomic regions and data-intensive project management

Nathan Sheffield, PhD

www.databio.org/slides

Overview

LOLA
Locus Overlap Analysis

MIRA
Methylation-based Inference of Regulatory Activity

PEP
Portable Encapsulated Projects

www.databio.org/slides

Locus Overlap Analysis

http://code.databio.org/LOLA/

Sheffield and Bock (2016). Bioinformatics.

Nagraj, Magee, and Sheffield (2018). Nucleic Acids Research.

A shiny app and server for interactive LOLA analysis.
Public server: http://lolaweb.databio.org
GitHub: https://github.com/databio/LOLAweb

DEMO

Methylation-based Inference of Regulatory Activity (MIRA)

http://code.databio.org/MIRA/

Lawson et al. (2018). Bioinformatics.

DNA methylation

Bisulfite-seq

Region pooling

Sheffield et al. (2017). Nature Medicine.

Organizing large-scale biological data around standardized projects

Data is becoming more...

abundant

available

powerful

So why are the world's problems not solved?

First step in bioinformatics analysis:

pipeline

Papers with
"bioinformatics pipeline"
in title

Problem solved?

Data munging

Then, downstream tools need a different organization

What if?

PEP: Portable Encapsulated Projects

PEP format

project_config.yaml

metadata:
  sample_annotation: /path/to/samples.csv
  output_dir: /path/to/output/folder

samples.csv

sample_name, protocol, organism, data_source
frog_0h, RNA-seq, frog, /path/to/frog0.gz
frog_1h, RNA-seq, frog, /path/to/frog1.gz
frog_2h, RNA-seq, frog, /path/to/frog2.gz
frog_3h, RNA-seq, frog, /path/to/frog3.gz

PEP portability features

Derived attributes

Implied attributes

Subprojects

Derived attributes

Automatically build new sample attributes from existing attributes.

Without derived attribute:



| sample_name   | t    | protocol        | organism | data_source            |
| ------------- | ---- | :-------------: | -------- | ---------------------- |
| frog_0h       | 0    | RNA-seq         | frog     | /path/to/frog0.gz      |
| frog_1h       | 1    | RNA-seq         | frog     | /path/to/frog1.gz      |
| frog_2h       | 2    | RNA-seq         | frog     | /path/to/frog2.gz      |
| frog_3h       | 3    | RNA-seq         | frog     | /path/to/frog3.gz      |

Using derived attribute:



| sample_name   | t    | protocol        | organism | data_source            |
| ------------- | ---- | :-------------: | -------- | ---------------------- |
| frog_0h       | 0    | RNA-seq         | frog     | my_samples             |
| frog_1h       | 1    | RNA-seq         | frog     | my_samples             |
| frog_2h       | 2    | RNA-seq         | frog     | my_samples             |
| frog_3h       | 3    | RNA-seq         | frog     | my_samples             |
| crab_0h       | 0    | RNA-seq         | crab     | your_samples           |
| crab_3h       | 3    | RNA-seq         | crab     | your_samples           |



| sample_name   | t    | protocol        | organism | data_source            |
| ------------- | ---- | :-------------: | -------- | ---------------------- |
| frog_0h       | 0    | RNA-seq         | frog     | my_samples             |
| frog_1h       | 1    | RNA-seq         | frog     | my_samples             |
| frog_2h       | 2    | RNA-seq         | frog     | my_samples             |
| frog_3h       | 3    | RNA-seq         | frog     | my_samples             |
| crab_0h       | 0    | RNA-seq         | crab     | your_samples           |
| crab_3h       | 3    | RNA-seq         | crab     | your_samples           |

Project config file:

derived_columns: [data_source]
data_sources:
  my_samples: "/path/to/my/samples/{organism}_{t}h.gz"
  your_samples: "/path/to/your/samples/{organism}_{t}h.gz"

{variable} identifies sample annotation columns

Benefit: Enables distributed files, portability

Implied attributes

Add new sample attributes conditioned on values of existing attributes

Before:



| sample_name   | protocol        | organism | 
| ------------- | :-------------: | -------- | 
| human_1       | RNA-seq         | human    | 
| human_2       | RNA-seq         | human    | 
| human_3       | RNA-seq         | human    | 
| mouse_1       | RNA-seq         | mouse    |

After:



| sample_name   | protocol        | organism | genome | 
| ------------- | :-------------: | -------- | ------ |
| human_1       | RNA-seq         | human    | hg38   |
| human_2       | RNA-seq         | human    | hg38   |
| human_3       | RNA-seq         | human    | hg38   |
| mouse_1       | RNA-seq         | mouse    | mm10   |



| sample_name   | protocol        | organism | 
| ------------- | :-------------: | -------- | 
| human_1       | RNA-seq         | human    | 
| human_2       | RNA-seq         | human    | 
| human_3       | RNA-seq         | human    | 
| mouse_1       | RNA-seq         | mouse    |

Project config file:

implied_columns:
  organism:
    human:
      genome: hg38
    mouse:
      genome: mm10

Benefit: Divides project from sample metadata

Subprojects

Define activatable project attributes.

subprojects:
  diverse:
    metadata:
      sample_annotation: psa_rrbs_diverse.csv
  cancer:
    metadata:
      sample_annotation: psa_rrbs_intracancer.csv

Benefit: Defines multiple similar projects in a single file

How is this portable and encapsulated?

Encapsulated:

A project is an extensible object, with samples and settings as attributes.

Portable:

A project can be moved from one analysis tool to another
A project can be moved from one computing environment to another

peppy package

import peppy

prj = Project("pep_config.yaml")
samples = prj.get_samples()

for sample in samples:
	print(sample.name)
	# do further analysis to each sample

Project API

pepr package

library("pepr")

prj = pepr::Project("pep_config.yaml")
samples = pepr::pepSamples(prj)

for (sample in samples) {
	message(pepr::sampleName(sample))
	# do further analysis to each sample
	}

Looper

Deploys pipelines across samples by connecting
samples to any command-line tool

https://looper.databio.org

pipeline_interface.yaml

protocol_mappings:
  RNA-seq: rna-seq 

pipelines:
  rna-seq:
    name: RNA-seq_pipeline
    path: path/to/rna-seq.py
    arguments:
      "--option1": sample_attribute
      "--option2": sample_attribute2

maps protocols to pipelines

maps sample attributes (columns) to pipeline arguments

Looper features

Single-input runs

Flexible pipelines

Flexible resources

Flexible compute

Job status-aware

Single-input runs
Run your entire project with one line:

looper run project_config.yaml

Flexible pipelines

protocol_mappings:
  RRBS: rrbs
  WGBS: wgbs
  EG: wgbs.py
  SMART-seq: rnaBitSeq -f; rnaTopHat -f
  ATAC-SEQ: atacseq
  DNase-seq: atacseq
  CHIP-SEQ: chipseq

Many-to-many mappings

Flexible resources

pipeline_key:
  name: pipeline_name
  arguments:
    "--option" : value
  resources:
    default:
      file_size: "0"
      cores: "2"
      mem: "6000"
      time: "01:00:00"
    large_input:
      file_size: "2000"
      cores: "4"
      mem: "12000"
      time: "08:00:00"

Resources can vary by input file size

Flexible compute

compute:
  slurm:
    submission_template: templates/slurm_template.sub
    submission_command: sbatch
  localhost:
    submission_template: templates/localhost_template.sub
    submission_command: sh

Adjust compute package on-the-fly:

> looper run project_config.yaml --compute localhost

Job status-aware
Looper only submits jobs for samples not already flagged as running, completed, or failed.

looper check project_config.yaml

looper summarize project_config.yaml

Conclusion

PEP format is a novel approach to standardize projects.
Initial tools like geofetch and looper build PEP projects and connect them to pipelines
Python and R packages provide a universal interface to PEP metadata for tools and analysis

More information at pepkit.github.io.

Thank You

nsheff ·

databio.org ·

nsheffield@virginia.edu