Tools for epigenome analysis of genomic regions and data-intensive project management


Nathan Sheffield, PhD
Center for Public Health Genomics

Overview

LOLA
Locus Overlap Analysis
MIRA
Methylation-based Inference of Regulatory Activity
PEP
Portable Encapsulated Projects




These slides are posted at www.databio.org/slides

Locus Overlap Analysis (LOLA)

Sheffield and Bock (2016). Bioinformatics.

LOLA Web

A shiny app and server for interactive LOLA analysis.
Public server: http://lola.databio.org/shinyLOLA
GitHub: https://github.com/databio/shinyLOLA

DEMO

Methylation-based Inference of Regulatory Activity (MIRA)

Lawson et al. (2017). Bioinformatics (accepted).

DNA methylation

DNA methylation

Bisulfite-seq

Region pooling


Sheffield et al. (2017). Nature Medicine.

Organizing large-scale biological data around standardized projects

Data is becoming more...

abundant
available
powerful
So why are the world's problems not solved?


First step in bioinformatics analysis:

pipeline

Papers with
"bioinformatics pipeline"
in title
Problem solved?
Problem solved?
Problem solved?

Data munging

What if?

PEP format

PEP format

A modern structure for organizing large-scale,
sample-intensive biological research projects
Home & Documentation: http://pepkit.github.io/
GitHub: http://github.com/pepkit

PEP format

project_config.yaml
metadata:
  sample_annotation: /path/to/samples.csv
  output_dir: /path/to/output/folder

samples.csv
sample_name, protocol, organism, data_source
frog_0h, RNA-seq, frog, /path/to/frog0.gz
frog_1h, RNA-seq, frog, /path/to/frog1.gz
frog_2h, RNA-seq, frog, /path/to/frog2.gz
frog_3h, RNA-seq, frog, /path/to/frog3.gz

PEP features

Derived columns enable distributed files, portability
Implied columns divide project and sample metadata
Subprojects store related concepts in one file
Derived columns enable distributed files, portability
derived_columns: [data_source]
data_sources:
my_samples: "${RAWDATA}/{organism}_{t}h.gz"
{variable} identifies sample annotation columns
Without derived column:
| sample_name | t | protocol | organism | data_source | | ------------- | ---- | :-------------: | -------- | ---------------------- | | frog_0h | 0 | RNA-seq | frog | /path/to/frog0.gz | | frog_1h | 1 | RNA-seq | frog | /path/to/frog1.gz | | frog_2h | 2 | RNA-seq | frog | /path/to/frog2.gz | | frog_3h | 3 | RNA-seq | frog | /path/to/frog3.gz |
Using derived column plus distributed data:
| sample_name | t | protocol | organism | data_source | | ------------- | ---- | :-------------: | -------- | ---------------------- | | frog_0h | 0 | RNA-seq | frog | my_samples | | frog_1h | 1 | RNA-seq | frog | my_samples | | frog_2h | 2 | RNA-seq | frog | my_samples | | frog_3h | 3 | RNA-seq | frog | my_samples | | crab_0h | 0 | RNA-seq | crab | your_samples | | crab_3h | 3 | RNA-seq | crab | your_samples |
Project config file:
derived_columns: [data_source]
data_sources:
my_samples: "/path/to/my/samples/{organism}_{t}h.gz"
your_samples: "/path/to/your/samples/{organism}_{t}h.gz"
Implied columns divide project and sample metadata
implied_columns:
organism:
human:
genome: hg38
macs_genome_size: "hs"
All samples with `human` under the `organism` attribute will add a `genome` attribute with value `hg38`, etc.
Subprojects
subprojects:
diverse:
metadata:
sample_annotation: psa_rrbs_diverse.csv
cancer:
metadata:
sample_annotation: psa_rrbs_intracancer.csv
Hierarchical replacement
Lets you define multiple projects in a single file
looper run project_config.yaml --sp cancer
How is this portable and encapsulated?
Encapsulated: The vision of a project as an extensible object, with samples, configurations, etc. as members of the Project object.
Portable in two senses:
  1. A project should be easily moved from one analysis tool to another
  2. A project can be moved from one computing environment to another

geofetch

Connects the Gene Expression Omnibus (GEO)
and Sequence Read Archive (SRA)
with PEP format

$ python geofetch.py -i GSE502503
pep package
import pep

prj = Project("pep_config.yaml")
samples = prj.get_samples()

for sample in samples:
	print(sample.name)
	# do further analysis to each sample
Project API
pepr package
library("pepr")

prj = pepr::Project("pep_config.yaml")
samples = pepr::pepSamples(prj)

for (sample in samples) {
	message(pepr::sampleName(sample))
	# do further analysis to each sample
	}

Looper

Connects samples to any command-line tool
Deploys pipelines across samples
pipeline_interface.yaml
protocol_mappings:
RNA-seq: rna-seq.py

pipelines:
rna-seq.py:
name: RNA-seq_pipeline
path: path/to/rna-seq.py
arguments:
"--option1": value_attribute
"--option2": value_attribute2
  • maps protocols to pipelines
  • maps sample attributes (columns) to pipeline arguments
  • Looper features

    Single-input runs
    Flexible pipelines
    Flexible resources
    Flexible compute
    Job status-aware
    Single-input runs
    Run your entire project with one line:
    looper run  project_config.yaml
    Flexible pipelines
    protocol_mappings:
    RRBS: rrbs.py
    WGBS: wgbs.py
    EG: wgbs.py
    SMART-seq: >
    rnaBitSeq.py -f;
    rnaTopHat.py -f
    ATAC-SEQ: atacseq.py
    CHIP-SEQ: chipseq.py
    Many-to-many mappings
    Flexible resources
    pipeline_script:
    name: pipeline_name
    arguments:
    "--option" : value
    resources:
    default:
    file_size: "0"
    cores: "2"
    mem: "6000"
    time: "01:00:00"
    large_input:
    file_size: "2000"
    cores: "4"
    mem: "12000"
    time: "08:00:00"
    Resources can vary by input file size
    Flexible compute
    compute:
    slurm:
    submission_template: templates/slurm_template.sub
    submission_command: sbatch
    localhost:
    submission_template: templates/localhost_template.sub
    submission_command: sh

    Adjust compute package on-the-fly:
    > looper run project_config.yaml --compute localhost
    Job status-aware
    Looper only submits jobs for samples not already flagged as running, completed, or failed.
    looper summarize project_config.yaml
    looper check project_config.yaml

    Conclusion

    • PEP format is a novel approach to standardize projects.
    • Initial tools like geofetch and looper build PEP projects and connect them to pipelines
    • Python and R packages provide a universal interface to PEP metadata for tools and analysis
    More information at pepkit.github.io.

    Thank You


    Sheffield lab
    John Lawson
    Vince Reuter

    Christoph Bock

    Heinrich Kovar
    Eleni Tomazou

    databio.org
    github.com/nsheff