outline
Motivation and background
Epigenome analysis
|
|
20%
50%
30%
|
Data management
◁ Questions ▷
Most pipelines require individual metadata organization
What if?
Why is this hard to do?
Because of microwave syndrome ....
Microwave syndrome
In user interface design, prioritizing easy access to integrated functions over their individual components.
The UNIX philosophy
[T]he power of a system comes more from the relationships among programs than from the programs themselves.
Many UNIX programs do quite trivial tasks in isolation, but, combined with other programs, become general and useful tools.
- Kernighan and Pike, The UNIX Programming Environment (1983, p. viii)
Problem
Solution
PEP: Portable Encapsulated Projects
PEP format
project_config.yaml
metadata:
sample_annotation: /path/to/samples.csv
output_dir: /path/to/output/folder
samples.csv
sample_name, protocol, organism, data_source
frog_0h, RNA-seq, frog, /path/to/frog0.gz
frog_1h, RNA-seq, frog, /path/to/frog1.gz
frog_2h, RNA-seq, frog, /path/to/frog2.gz
frog_3h, RNA-seq, frog, /path/to/frog3.gz
PEP portability features
Derived attributes
Implied attributes
Subprojects
Derived attributes
Automatically build new sample attributes from existing attributes.
Without derived attribute:
| sample_name | t | protocol | organism | data_source |
| ------------- | ---- | :-------------: | -------- | ---------------------- |
| frog_0h | 0 | RNA-seq | frog | /path/to/frog0.gz |
| frog_1h | 1 | RNA-seq | frog | /path/to/frog1.gz |
| frog_2h | 2 | RNA-seq | frog | /path/to/frog2.gz |
| frog_3h | 3 | RNA-seq | frog | /path/to/frog3.gz |
Using derived attribute:
| sample_name | t | protocol | organism | data_source |
| ------------- | ---- | :-------------: | -------- | ---------------------- |
| frog_0h | 0 | RNA-seq | frog | my_samples |
| frog_1h | 1 | RNA-seq | frog | my_samples |
| frog_2h | 2 | RNA-seq | frog | my_samples |
| frog_3h | 3 | RNA-seq | frog | my_samples |
| crab_0h | 0 | RNA-seq | crab | your_samples |
| crab_3h | 3 | RNA-seq | crab | your_samples |
| sample_name | t | protocol | organism | data_source |
| ------------- | ---- | :-------------: | -------- | ---------------------- |
| frog_0h | 0 | RNA-seq | frog | my_samples |
| frog_1h | 1 | RNA-seq | frog | my_samples |
| frog_2h | 2 | RNA-seq | frog | my_samples |
| frog_3h | 3 | RNA-seq | frog | my_samples |
| crab_0h | 0 | RNA-seq | crab | your_samples |
| crab_3h | 3 | RNA-seq | crab | your_samples |
Project config file:
derived_columns : [ data_source ]
data_sources :
my_samples : " /path/to/my/samples/{organism}_{t}h.gz"
your_samples : " /path/to/your/samples/{organism}_{t}h.gz"
{variable} identifies sample annotation columns
Benefit: Enables distributed files, portability
Implied attributes
Add new sample attributes conditioned on values of existing attributes
Before:
| sample_name | protocol | organism |
| ------------- | :-------------: | -------- |
| human_1 | RNA-seq | human |
| human_2 | RNA-seq | human |
| human_3 | RNA-seq | human |
| mouse_1 | RNA-seq | mouse |
After:
| sample_name | protocol | organism | genome |
| ------------- | :-------------: | -------- | ------ |
| human_1 | RNA-seq | human | hg38 |
| human_2 | RNA-seq | human | hg38 |
| human_3 | RNA-seq | human | hg38 |
| mouse_1 | RNA-seq | mouse | mm10 |
| sample_name | protocol | organism |
| ------------- | :-------------: | -------- |
| human_1 | RNA-seq | human |
| human_2 | RNA-seq | human |
| human_3 | RNA-seq | human |
| mouse_1 | RNA-seq | mouse |
Project config file:
implied_columns :
organism :
human :
genome : hg38
mouse :
genome : mm10
Benefit: Divides project from sample metadata
Subprojects
Define activatable project attributes.
subprojects :
diverse :
metadata :
sample_annotation : psa_rrbs_diverse.csv
cancer :
metadata :
sample_annotation : psa_rrbs_intracancer.csv
Benefit: Defines multiple similar projects in a single file
How is this portable and encapsulated?
Encapsulated :
A project is an extensible object, with samples and settings as attributes.
Portable :
A project can be moved from one analysis tool to another
A project can be moved from one computing environment to another
Looper
Deploys pipelines across samples by connecting
samples to any command-line tool
pipeline_interface.yaml
protocol_mappings :
RNA-seq : rna-seq
pipelines :
rna-seq :
name : RNA-seq_pipeline
path : path/to/rna-seq.py
arguments :
" --option1" : sample_attribute
"--option2" : sample_attribute2
maps protocols to pipelines
maps sample attributes (columns) to pipeline arguments
Looper features
Single-input runs
Flexible pipelines
Flexible resources
Flexible compute
Job status-aware
Single-input runs
Run your entire project with one line:
looper run project_config.yaml
Flexible pipelines
protocol_mappings :
RRBS : rrbs
WGBS : wgbs
EG : wgbs.py
SMART-seq : rnaBitSeq -f; rnaTopHat -f
ATAC-SEQ : atacseq
DNase-seq : atacseq
CHIP-SEQ : chipseq
Many-to-many mappings
Flexible resources
pipeline_key :
name : pipeline_name
arguments :
" --option" : value
resources :
default :
file_size : " 0"
cores : " 2"
mem : " 6000"
time : " 01:00:00"
large_input :
file_size : " 2000"
cores : " 4"
mem : " 12000"
time : " 08:00:00"
Resources can vary by input file size
Flexible compute
compute :
slurm :
submission_template : templates/slurm_template.sub
submission_command : sbatch
localhost :
submission_template : templates/localhost_template.sub
submission_command : sh
Adjust compute package on-the-fly:
> looper run project_config.yaml --compute localhost
Job status-aware
Looper only submits jobs for samples not already flagged as running, completed, or failed.
looper check project_config.yaml
looper summarize project_config.yaml
A robust ATAC-seq pipeline
built on the PEP toolkit
Comparison
Prealignments
Nuclear-mitochondrial DNA (NuMts) confuse aligners
Inaccurate alignment statistics
Requires pre-defined NuMt locations
Wastes compute power
### Advantages of serial alignments
- Accuracy (better rates plus no blacklist needed).
- Speed.
- Modular reference assemblies.
Output
Epigenome analysis methods
LOLA
Locus Overlap Analysis
MIRA
Methylation-based Inference of Regulatory Activity
Locus Overlap Analysis
Sheffield and Bock (2016). Bioinformatics .
Nagraj, Magee, and Sheffield (2018). Nucleic Acids Research .
DEMO
Your browser does not support the video tag.
Methylation-based Inference of Regulatory Activity (MIRA)
Lawson et al. (2018). Bioinformatics .
DNA methylation
DNA methylation
Bisulfite-seq
Region pooling
Sheffield et al. (2017). Nature Medicine .
Thank You
Sheffield lab
John Lawson
Vince Reuter
Ognen Duzlevski
Jason Smith
Jianglin Feng
Michal Stolarczyk
Aaron Gu
Anant Tewari
Christoph Bock
Andre Rendeiro
Johanna Klughammer
Howard Chang
Ryan Corces
Yuning Wei
Jin Xu
nsheff ·
databio.org ·
nsheffield@virginia.edu