Tools for epigenome analysis of genomic regions and data-intensive project management
Nathan Sheffield, PhD
www.databio.org/slides
Overview
LOLA
Locus Overlap Analysis
MIRA
Methylation-based Inference of Regulatory Activity
PEP
Portable Encapsulated Projects
Locus Overlap Analysis
Sheffield and Bock (2016). Bioinformatics .
Nagraj, Magee, and Sheffield (2018). Nucleic Acids Research .
DEMO
Your browser does not support the video tag.
Methylation-based Inference of Regulatory Activity (MIRA)
Lawson et al. (2018). Bioinformatics .
DNA methylation
DNA methylation
Bisulfite-seq
Region pooling
Sheffield et al. (2017). Nature Medicine .
Organizing large-scale biological data around standardized projects
Data is becoming more...
abundant
available
powerful
So why are the world's problems not solved?
First step in bioinformatics analysis:
pipeline
Papers with
"bioinformatics pipeline"
in title
Problem solved?
Problem solved?
Problem solved?
Then, downstream tools need a different organization
PEP: Portable Encapsulated Projects
PEP format
Start with a simple CSV with tabular data.
samples.csv
sample_name,protocol,organism,input_file
frog_0h,RNA-seq,frog,/path/to/frog0.gz
frog_1h,RNA-seq,frog,/path/to/frog1.gz
frog_2h,RNA-seq,frog,/path/to/frog2.gz
frog_3h,RNA-seq,frog,/path/to/frog3.gz
PEP format
Add a YAML for project-level data.
samples.csv
sample_name,protocol,organism,input_file
frog_0h,RNA-seq,frog,/path/to/frog0.gz
frog_1h,RNA-seq,frog,/path/to/frog1.gz
frog_2h,RNA-seq,frog,/path/to/frog2.gz
frog_3h,RNA-seq,frog,/path/to/frog3.gz
project_config.yaml
sample_table: /path/to/samples.csv
output_dir: /path/to/output/folder
other_variable: value
Add programmatic sample and project modifiers.
Derived attributes
Implied attributes
Subprojects
Derived attributes
Automatically build new sample attributes from existing attributes.
Without derived attribute:
| sample_name | t | protocol | organism | input_file |
| ------------- | ---- | :-------------: | -------- | ---------------------- |
| frog_0h | 0 | RNA-seq | frog | /path/to/frog0.gz |
| frog_1h | 1 | RNA-seq | frog | /path/to/frog1.gz |
| frog_2h | 2 | RNA-seq | frog | /path/to/frog2.gz |
| frog_3h | 3 | RNA-seq | frog | /path/to/frog3.gz |
Using derived attribute:
| sample_name | t | protocol | organism | input_file |
| ------------- | ---- | :-------------: | -------- | ---------------------- |
| frog_0h | 0 | RNA-seq | frog | my_samples |
| frog_1h | 1 | RNA-seq | frog | my_samples |
| frog_2h | 2 | RNA-seq | frog | my_samples |
| frog_3h | 3 | RNA-seq | frog | my_samples |
| crab_0h | 0 | RNA-seq | crab | your_samples |
| crab_3h | 3 | RNA-seq | crab | your_samples |
| sample_name | t | protocol | organism | input_file |
| ------------- | ---- | :-------------: | -------- | ---------------------- |
| frog_0h | 0 | RNA-seq | frog | my_samples |
| frog_1h | 1 | RNA-seq | frog | my_samples |
| frog_2h | 2 | RNA-seq | frog | my_samples |
| frog_3h | 3 | RNA-seq | frog | my_samples |
| crab_0h | 0 | RNA-seq | crab | your_samples |
| crab_3h | 3 | RNA-seq | crab | your_samples |
Project config file:
sample_modifiers :
derive :
attributes : [ input_file ]
sources :
my_samples : " /path/to/my/samples/{organism}_{t}h.gz"
your_samples : " /path/to/your/samples/{organism}_{t}h.gz"
{variable} identifies sample annotation columns
Benefit: Enables distributed files, portability
Implied attributes
Add new sample attributes conditioned on values of existing attributes
Before:
| sample_name | protocol | organism |
| ------------- | :-------------: | -------- |
| human_1 | RNA-seq | human |
| human_2 | RNA-seq | human |
| human_3 | RNA-seq | human |
| mouse_1 | RNA-seq | mouse |
After:
| sample_name | protocol | organism | genome |
| ------------- | :-------------: | -------- | ------ |
| human_1 | RNA-seq | human | hg38 |
| human_2 | RNA-seq | human | hg38 |
| human_3 | RNA-seq | human | hg38 |
| mouse_1 | RNA-seq | mouse | mm10 |
| sample_name | protocol | organism |
| ------------- | :-------------: | -------- |
| human_1 | RNA-seq | human |
| human_2 | RNA-seq | human |
| human_3 | RNA-seq | human |
| mouse_1 | RNA-seq | mouse |
Project config file:
sample_modifiers :
imply :
- if :
organism : human
then :
genome : hg38
- if :
organism : mouse
then :
genome : mm10
Benefit: Divides project from sample metadata
Subprojects
Define activatable project attributes.
project_modifiers :
amendments :
diverse :
metadata :
sample_annotation : psa_rrbs_diverse.csv
cancer :
metadata :
sample_annotation : psa_rrbs_intracancer.csv
Benefit: Defines multiple similar projects in a single file
peppy package
import peppy
prj = Project("pep_config.yaml")
samples = prj.get_samples()
for sample in samples:
print(sample.name)
# do further analysis to each sample
Project API
pepr package
library("pepr")
prj = pepr::Project("pep_config.yaml")
samples = pepr::pepSamples(prj)
for (sample in samples) {
message(pepr::sampleName(sample))
# do further analysis to each sample
}
Looper
Deploys pipelines across samples by connecting
samples to any command-line tool
pipeline_interface.yaml
protocol_mappings :
RNA-seq : rna-seq
pipelines :
rna-seq :
name : RNA-seq_pipeline
path : path/to/rna-seq.py
arguments :
" --option1" : sample_attribute
"--option2" : sample_attribute2
maps protocols to pipelines
maps sample attributes (columns) to pipeline arguments
Looper features
Single-input runs
Flexible pipelines
Flexible resources
Flexible compute
Job status-aware
Single-input runs
Run your entire project with one line:
looper run project_config.yaml
Flexible pipelines
protocol_mappings :
RRBS : rrbs
WGBS : wgbs
EG : wgbs.py
SMART-seq : rnaBitSeq -f; rnaTopHat -f
ATAC-SEQ : atacseq
DNase-seq : atacseq
CHIP-SEQ : chipseq
Many-to-many mappings
Flexible resources
pipeline_key :
name : pipeline_name
arguments :
" --option" : value
resources :
default :
file_size : " 0"
cores : " 2"
mem : " 6000"
time : " 01:00:00"
large_input :
file_size : " 2000"
cores : " 4"
mem : " 12000"
time : " 08:00:00"
Resources can vary by input file size
Flexible compute
compute :
slurm :
submission_template : templates/slurm_template.sub
submission_command : sbatch
localhost :
submission_template : templates/localhost_template.sub
submission_command : sh
Adjust compute package on-the-fly:
> looper run project_config.yaml --compute localhost
Job status-aware
Looper only submits jobs for samples not already flagged as running, completed, or failed.
looper check project_config.yaml
looper summarize project_config.yaml
Conclusion
PEP format is a novel approach to standardize projects .
Initial tools like geofetch
and looper
build PEP projects and connect them to pipelines
Python and R packages provide a universal interface to PEP metadata for tools and analysis
More information at pepkit.github.io .