Tools for epigenome analysis of genomic regions and data-intensive project management

Nathan Sheffield, PhD
www.databio.org/slides

Overview

LOLA
Locus Overlap Analysis
MIRA
Methylation-based Inference of Regulatory Activity
PEP
Portable Encapsulated Projects




www.databio.org/slides

Locus Overlap Analysis

Sheffield and Bock (2016). Bioinformatics.
Nagraj, Magee, and Sheffield (2018). Nucleic Acids Research.

A shiny app and server for interactive LOLA analysis.
Public server: http://lolaweb.databio.org
GitHub: https://github.com/databio/LOLAweb

DEMO

Methylation-based Inference of Regulatory Activity (MIRA)

Lawson et al. (2018). Bioinformatics.

DNA methylation

DNA methylation

Bisulfite-seq

Region pooling


Sheffield et al. (2017). Nature Medicine.

Organizing large-scale biological data around standardized projects

Data is becoming more...

abundant
available
powerful
So why are the world's problems not solved?


First step in bioinformatics analysis:

pipeline

Papers with
"bioinformatics pipeline"
in title
Problem solved?
Problem solved?
Problem solved?

Data munging

Then, downstream tools need a different organization

What if?

PEP: Portable Encapsulated Projects

PEP format

Start with a simple CSV with tabular data.
samples.csv
sample_name,protocol,organism,input_file
frog_0h,RNA-seq,frog,/path/to/frog0.gz
frog_1h,RNA-seq,frog,/path/to/frog1.gz
frog_2h,RNA-seq,frog,/path/to/frog2.gz
frog_3h,RNA-seq,frog,/path/to/frog3.gz

PEP format

Add a YAML for project-level data.
samples.csv
sample_name,protocol,organism,input_file
frog_0h,RNA-seq,frog,/path/to/frog0.gz
frog_1h,RNA-seq,frog,/path/to/frog1.gz
frog_2h,RNA-seq,frog,/path/to/frog2.gz
frog_3h,RNA-seq,frog,/path/to/frog3.gz

project_config.yaml
sample_table: /path/to/samples.csv
output_dir: /path/to/output/folder
other_variable: value

Add programmatic sample and project modifiers.

Derived attributes
Implied attributes
Subprojects
Derived attributes
Automatically build new sample attributes from existing attributes.
Without derived attribute:
| sample_name | t | protocol | organism | input_file | | ------------- | ---- | :-------------: | -------- | ---------------------- | | frog_0h | 0 | RNA-seq | frog | /path/to/frog0.gz | | frog_1h | 1 | RNA-seq | frog | /path/to/frog1.gz | | frog_2h | 2 | RNA-seq | frog | /path/to/frog2.gz | | frog_3h | 3 | RNA-seq | frog | /path/to/frog3.gz |
Using derived attribute:
| sample_name | t | protocol | organism | input_file | | ------------- | ---- | :-------------: | -------- | ---------------------- | | frog_0h | 0 | RNA-seq | frog | my_samples | | frog_1h | 1 | RNA-seq | frog | my_samples | | frog_2h | 2 | RNA-seq | frog | my_samples | | frog_3h | 3 | RNA-seq | frog | my_samples | | crab_0h | 0 | RNA-seq | crab | your_samples | | crab_3h | 3 | RNA-seq | crab | your_samples |
| sample_name | t | protocol | organism | input_file | | ------------- | ---- | :-------------: | -------- | ---------------------- | | frog_0h | 0 | RNA-seq | frog | my_samples | | frog_1h | 1 | RNA-seq | frog | my_samples | | frog_2h | 2 | RNA-seq | frog | my_samples | | frog_3h | 3 | RNA-seq | frog | my_samples | | crab_0h | 0 | RNA-seq | crab | your_samples | | crab_3h | 3 | RNA-seq | crab | your_samples |
Project config file:
sample_modifiers:
  derive:
    attributes: [input_file]
    sources:
      my_samples: "/path/to/my/samples/{organism}_{t}h.gz"
      your_samples: "/path/to/your/samples/{organism}_{t}h.gz"
{variable} identifies sample annotation columns
Benefit: Enables distributed files, portability
Implied attributes
Add new sample attributes conditioned on values of existing attributes
Before:
| sample_name | protocol | organism | | ------------- | :-------------: | -------- | | human_1 | RNA-seq | human | | human_2 | RNA-seq | human | | human_3 | RNA-seq | human | | mouse_1 | RNA-seq | mouse |
After:
| sample_name | protocol | organism | genome | | ------------- | :-------------: | -------- | ------ | | human_1 | RNA-seq | human | hg38 | | human_2 | RNA-seq | human | hg38 | | human_3 | RNA-seq | human | hg38 | | mouse_1 | RNA-seq | mouse | mm10 |
| sample_name | protocol | organism | | ------------- | :-------------: | -------- | | human_1 | RNA-seq | human | | human_2 | RNA-seq | human | | human_3 | RNA-seq | human | | mouse_1 | RNA-seq | mouse |
Project config file:
sample_modifiers:
  imply:
    - if: 
        organism: human
      then:
        genome: hg38
    - if:
        organism: mouse
      then:
        genome: mm10
Benefit: Divides project from sample metadata
Subprojects
Define activatable project attributes.
project_modifiers:
  amendments:
    diverse:
      metadata:
        sample_annotation: psa_rrbs_diverse.csv
    cancer:
      metadata:
        sample_annotation: psa_rrbs_intracancer.csv
Benefit: Defines multiple similar projects in a single file
peppy package
import peppy

prj = Project("pep_config.yaml")
samples = prj.get_samples()

for sample in samples:
	print(sample.name)
	# do further analysis to each sample
Project API
pepr package
library("pepr")

prj = pepr::Project("pep_config.yaml")
samples = pepr::pepSamples(prj)

for (sample in samples) {
	message(pepr::sampleName(sample))
	# do further analysis to each sample
	}

Looper

Deploys pipelines across samples by connecting
samples to any command-line tool
pipeline_interface.yaml
protocol_mappings:
  RNA-seq: rna-seq 

pipelines:
  rna-seq:
    name: RNA-seq_pipeline
    path: path/to/rna-seq.py
    arguments:
      "--option1": sample_attribute
      "--option2": sample_attribute2
  • maps protocols to pipelines
  • maps sample attributes (columns) to pipeline arguments
  • Looper features

    Single-input runs
    Flexible pipelines
    Flexible resources
    Flexible compute
    Job status-aware
    Single-input runs
    Run your entire project with one line:
    looper run project_config.yaml
    Flexible pipelines
    protocol_mappings:
      RRBS: rrbs
      WGBS: wgbs
      EG: wgbs.py
      SMART-seq: rnaBitSeq -f; rnaTopHat -f
      ATAC-SEQ: atacseq
      DNase-seq: atacseq
      CHIP-SEQ: chipseq
    Many-to-many mappings
    Flexible resources
    pipeline_key:
      name: pipeline_name
      arguments:
        "--option" : value
      resources:
        default:
          file_size: "0"
          cores: "2"
          mem: "6000"
          time: "01:00:00"
        large_input:
          file_size: "2000"
          cores: "4"
          mem: "12000"
          time: "08:00:00"
    Resources can vary by input file size
    Flexible compute
    compute:
      slurm:
        submission_template: templates/slurm_template.sub
        submission_command: sbatch
      localhost:
        submission_template: templates/localhost_template.sub
        submission_command: sh

    Adjust compute package on-the-fly:
    > looper run project_config.yaml --compute localhost
    Job status-aware
    Looper only submits jobs for samples not already flagged as running, completed, or failed.
    looper check project_config.yaml
    looper summarize project_config.yaml

    Conclusion

    • PEP format is a novel approach to standardize projects.
    • Initial tools like geofetch and looper build PEP projects and connect them to pipelines
    • Python and R packages provide a universal interface to PEP metadata for tools and analysis
    More information at pepkit.github.io.

    Thank You


    nsheff · databio.org · nsheffield@virginia.edu