We are now in the
Era of Large Biomedical Data
Hypothesis:
The most important advances of the future will come from studies that can integrate data from lots of sources
Integrating data introduces 2 major challenges:
Data scale
Data harmonization
Why is data harmonization hard?
Because it's exponential.
Each new dataset adds N additional pairwise comparisons.
The conundrum
We stand to benefit immensely from integrating broader and broader data sources.
BUT...the wider our integration effort, the more challenging the integration.
Pepkit
A structure and toolkit for organizing large-scale,
sample-intensive biological research projects
Sheffield et al. (2021). GigaScience .
1. Metadata management
2. Pipeline development
3. Reproducible computing environments
PEP: Portable Encapsulated Projects
PEP format
Start with a simple CSV with tabular data.
samples.csv
sample_name,protocol,organism,input_file
frog_0h,RNA-seq,frog,/path/to/frog0.gz
frog_1h,RNA-seq,frog,/path/to/frog1.gz
frog_2h,RNA-seq,frog,/path/to/frog2.gz
frog_3h,RNA-seq,frog,/path/to/frog3.gz
PEP format
Add a YAML for project-level data.
samples.csv
sample_name,protocol,organism,input_file
frog_0h,RNA-seq,frog,/path/to/frog0.gz
frog_1h,RNA-seq,frog,/path/to/frog1.gz
frog_2h,RNA-seq,frog,/path/to/frog2.gz
frog_3h,RNA-seq,frog,/path/to/frog3.gz
project_config.yaml
sample_table: /path/to/samples.csv
output_dir: /path/to/output/folder
other_variable: value
Add programmatic sample and project modifiers.
Derived attributes
Implied attributes
Subprojects
Derived attributes
Automatically build new sample attributes from existing attributes.
Without derived attribute:
| sample_name | t | protocol | organism | input_file |
| ------------- | ---- | :-------------: | -------- | ---------------------- |
| frog_0h | 0 | RNA-seq | frog | /path/to/frog0.gz |
| frog_1h | 1 | RNA-seq | frog | /path/to/frog1.gz |
| frog_2h | 2 | RNA-seq | frog | /path/to/frog2.gz |
| frog_3h | 3 | RNA-seq | frog | /path/to/frog3.gz |
Using derived attribute:
| sample_name | t | protocol | organism | input_file |
| ------------- | ---- | :-------------: | -------- | ---------------------- |
| frog_0h | 0 | RNA-seq | frog | my_samples |
| frog_1h | 1 | RNA-seq | frog | my_samples |
| frog_2h | 2 | RNA-seq | frog | my_samples |
| frog_3h | 3 | RNA-seq | frog | my_samples |
| crab_0h | 0 | RNA-seq | crab | your_samples |
| crab_3h | 3 | RNA-seq | crab | your_samples |
| sample_name | t | protocol | organism | input_file |
| ------------- | ---- | :-------------: | -------- | ---------------------- |
| frog_0h | 0 | RNA-seq | frog | my_samples |
| frog_1h | 1 | RNA-seq | frog | my_samples |
| frog_2h | 2 | RNA-seq | frog | my_samples |
| frog_3h | 3 | RNA-seq | frog | my_samples |
| crab_0h | 0 | RNA-seq | crab | your_samples |
| crab_3h | 3 | RNA-seq | crab | your_samples |
Project config file:
sample_modifiers :
derive :
attributes : [ input_file ]
sources :
my_samples : " /path/to/my/samples/{organism}_{t}h.gz"
your_samples : " /path/to/your/samples/{organism}_{t}h.gz"
{variable} identifies sample annotation columns
Benefit: Enables distributed files, portability
Implied attributes
Add new sample attributes conditioned on values of existing attributes
Before:
| sample_name | protocol | organism |
| ------------- | :-------------: | -------- |
| human_1 | RNA-seq | human |
| human_2 | RNA-seq | human |
| human_3 | RNA-seq | human |
| mouse_1 | RNA-seq | mouse |
After:
| sample_name | protocol | organism | genome |
| ------------- | :-------------: | -------- | ------ |
| human_1 | RNA-seq | human | hg38 |
| human_2 | RNA-seq | human | hg38 |
| human_3 | RNA-seq | human | hg38 |
| mouse_1 | RNA-seq | mouse | mm10 |
| sample_name | protocol | organism |
| ------------- | :-------------: | -------- |
| human_1 | RNA-seq | human |
| human_2 | RNA-seq | human |
| human_3 | RNA-seq | human |
| mouse_1 | RNA-seq | mouse |
Project config file:
sample_modifiers :
imply :
- if :
organism : human
then :
genome : hg38
- if :
organism : mouse
then :
genome : mm10
Benefit: Divides project from sample metadata
Subprojects
Define activatable project attributes.
project_modifiers :
amendments :
diverse :
metadata :
sample_annotation : psa_rrbs_diverse.csv
cancer :
metadata :
sample_annotation : psa_rrbs_intracancer.csv
Benefit: Defines multiple similar projects in a single file
Thank You
Collaborators
Vince Reuter
Andre Rendeiro
Levi Waldron
Alumni
Aaron Gu
Jianglin Feng
Ognen Duzlevski
Tessa Danehy
Sheffield lab
Erfaneh Gharavi
Michal Stolarczyk
John Lawson
Jason Smith
Kristyna Kupkova
John Stubbs
Bingjie Xue
Jose Verdezoto
Nathan LeRoy
Oleksandr Khoroshevskyi
nsheff ·
databio.org ·
nsheffield@virginia.edu