Organizing large-scale biological data around standardized projects
Nathan Sheffield, PhD
www.databio.org/slides
Data is becoming more...
abundant
available
powerful
So why are the world's problems not solved?
First step in bioinformatics analysis:
pipeline
Papers with
"bioinformatics pipeline"
in title
Problem solved?
Problem solved?
Problem solved?
Then, downstream tools need a different organization
What if?
Why is this hard to do?
Because of microwave syndrome ....
Microwave syndrome
In user interface design, prioritizing easy access to integrated functions over their individual components.
The UNIX philosophy
[T]he power of a system comes more from the relationships among programs than from the programs themselves.
Many UNIX programs do quite trivial tasks in isolation, but, combined with other programs, become general and useful tools.
- Kernighan and Pike, The UNIX Programming Environment (1983, p. viii)
Problem
Solution
PEP: Portable Encapsulated Projects
PEP format
project_config.yaml
metadata:
sample_annotation: /path/to/samples.csv
output_dir: /path/to/output/folder
samples.csv
sample_name, protocol, organism, data_source
frog_0h, RNA-seq, frog, /path/to/frog0.gz
frog_1h, RNA-seq, frog, /path/to/frog1.gz
frog_2h, RNA-seq, frog, /path/to/frog2.gz
frog_3h, RNA-seq, frog, /path/to/frog3.gz
PEP portability features
Derived attributes
Implied attributes
Subprojects
Derived attributes
Automatically build new sample attributes from existing attributes.
Without derived attribute:
| sample_name | t | protocol | organism | data_source |
| ------------- | ---- | :-------------: | -------- | ---------------------- |
| frog_0h | 0 | RNA-seq | frog | /path/to/frog0.gz |
| frog_1h | 1 | RNA-seq | frog | /path/to/frog1.gz |
| frog_2h | 2 | RNA-seq | frog | /path/to/frog2.gz |
| frog_3h | 3 | RNA-seq | frog | /path/to/frog3.gz |
Using derived attribute:
| sample_name | t | protocol | organism | data_source |
| ------------- | ---- | :-------------: | -------- | ---------------------- |
| frog_0h | 0 | RNA-seq | frog | my_samples |
| frog_1h | 1 | RNA-seq | frog | my_samples |
| frog_2h | 2 | RNA-seq | frog | my_samples |
| frog_3h | 3 | RNA-seq | frog | my_samples |
| crab_0h | 0 | RNA-seq | crab | your_samples |
| crab_3h | 3 | RNA-seq | crab | your_samples |
| sample_name | t | protocol | organism | data_source |
| ------------- | ---- | :-------------: | -------- | ---------------------- |
| frog_0h | 0 | RNA-seq | frog | my_samples |
| frog_1h | 1 | RNA-seq | frog | my_samples |
| frog_2h | 2 | RNA-seq | frog | my_samples |
| frog_3h | 3 | RNA-seq | frog | my_samples |
| crab_0h | 0 | RNA-seq | crab | your_samples |
| crab_3h | 3 | RNA-seq | crab | your_samples |
Project config file:
derived_columns : [ data_source ]
data_sources :
my_samples : " /path/to/my/samples/{organism}_{t}h.gz"
your_samples : " /path/to/your/samples/{organism}_{t}h.gz"
{variable} identifies sample annotation columns
Benefit: Enables distributed files, portability
Implied attributes
Add new sample attributes conditioned on values of existing attributes
Before:
| sample_name | protocol | organism |
| ------------- | :-------------: | -------- |
| human_1 | RNA-seq | human |
| human_2 | RNA-seq | human |
| human_3 | RNA-seq | human |
| mouse_1 | RNA-seq | mouse |
After:
| sample_name | protocol | organism | genome |
| ------------- | :-------------: | -------- | ------ |
| human_1 | RNA-seq | human | hg38 |
| human_2 | RNA-seq | human | hg38 |
| human_3 | RNA-seq | human | hg38 |
| mouse_1 | RNA-seq | mouse | mm10 |
| sample_name | protocol | organism |
| ------------- | :-------------: | -------- |
| human_1 | RNA-seq | human |
| human_2 | RNA-seq | human |
| human_3 | RNA-seq | human |
| mouse_1 | RNA-seq | mouse |
Project config file:
implied_columns :
organism :
human :
genome : hg38
mouse :
genome : mm10
Benefit: Divides project from sample metadata
Subprojects
Define activatable project attributes.
subprojects :
diverse :
metadata :
sample_annotation : psa_rrbs_diverse.csv
cancer :
metadata :
sample_annotation : psa_rrbs_intracancer.csv
Benefit: Defines multiple similar projects in a single file
How is this portable and encapsulated?
Encapsulated :
A project is an extensible object, with samples and settings as attributes.
Portable :
A project can be moved from one analysis tool to another
A project can be moved from one computing environment to another
peppy package
import peppy
prj = Project("pep_config.yaml")
samples = prj.get_samples()
for sample in samples:
print(sample.name)
# do further analysis to each sample
Project API
pepr package
library("pepr")
prj = pepr::Project("pep_config.yaml")
samples = pepr::pepSamples(prj)
for (sample in samples) {
message(pepr::sampleName(sample))
# do further analysis to each sample
}
Looper
Deploys pipelines across samples by connecting
samples to any command-line tool
pipeline_interface.yaml
protocol_mappings :
RNA-seq : rna-seq
pipelines :
rna-seq :
name : RNA-seq_pipeline
path : path/to/rna-seq.py
arguments :
" --option1" : sample_attribute
"--option2" : sample_attribute2
maps protocols to pipelines
maps sample attributes (columns) to pipeline arguments
Looper features
Single-input runs
Flexible pipelines
Flexible resources
Flexible compute
Job status-aware
Single-input runs
Run your entire project with one line:
looper run project_config.yaml
Flexible pipelines
protocol_mappings :
RRBS : rrbs
WGBS : wgbs
EG : wgbs.py
SMART-seq : rnaBitSeq -f; rnaTopHat -f
ATAC-SEQ : atacseq
DNase-seq : atacseq
CHIP-SEQ : chipseq
Many-to-many mappings
Flexible resources
pipeline_key :
name : pipeline_name
arguments :
" --option" : value
resources :
default :
file_size : " 0"
cores : " 2"
mem : " 6000"
time : " 01:00:00"
large_input :
file_size : " 2000"
cores : " 4"
mem : " 12000"
time : " 08:00:00"
Resources can vary by input file size
Flexible compute
compute :
slurm :
submission_template : templates/slurm_template.sub
submission_command : sbatch
localhost :
submission_template : templates/localhost_template.sub
submission_command : sh
Adjust compute package on-the-fly:
> looper run project_config.yaml --compute localhost
Job status-aware
Looper only submits jobs for samples not already flagged as running, completed, or failed.
looper check project_config.yaml
looper summarize project_config.yaml
Conclusion
PEP format is a novel approach to standardize projects .
Initial tools like geofetch
and looper
build PEP projects and connect them to pipelines
Python and R packages provide a universal interface to PEP metadata for tools and analysis
More information at pepkit.github.io .