Introduction to PEPkit

Nathan Sheffield, PhD

www.databio.org/slides

We are now in the

Era of Large Biomedical Data

Hypothesis:

The most important advances of the future will come from studies that can integrate data from lots of sources

Integrating data introduces 2 major challenges:

Data scale
Data harmonization

Why is data harmonization hard?

Because it's exponential.
Each new dataset adds N additional pairwise comparisons.

The conundrum

We stand to benefit immensely
from integrating broader and broader data sources.

BUT...the wider our integration effort,
the more challenging the integration.

Pepkit

A structure and toolkit for organizing large-scale,
sample-intensive biological research projects

http://pepkit.github.io/

Sheffield et al. (2021). GigaScience.

1. Metadata management
2. Pipeline development
3. Reproducible computing environments

PEP: Portable Encapsulated Projects

PEP format

Start with a simple CSV with tabular data.

samples.csv

sample_name,protocol,organism,input_file
frog_0h,RNA-seq,frog,/path/to/frog0.gz
frog_1h,RNA-seq,frog,/path/to/frog1.gz
frog_2h,RNA-seq,frog,/path/to/frog2.gz
frog_3h,RNA-seq,frog,/path/to/frog3.gz

PEP format

Add a YAML for project-level data.

samples.csv

sample_name,protocol,organism,input_file
frog_0h,RNA-seq,frog,/path/to/frog0.gz
frog_1h,RNA-seq,frog,/path/to/frog1.gz
frog_2h,RNA-seq,frog,/path/to/frog2.gz
frog_3h,RNA-seq,frog,/path/to/frog3.gz

project_config.yaml

sample_table: /path/to/samples.csv
output_dir: /path/to/output/folder
other_variable: value

Add programmatic sample and project modifiers.

Derived attributes

Implied attributes

Subprojects

Derived attributes

Automatically build new sample attributes from existing attributes.

Without derived attribute:



| sample_name   | t    | protocol        | organism | input_file             |
| ------------- | ---- | :-------------: | -------- | ---------------------- |
| frog_0h       | 0    | RNA-seq         | frog     | /path/to/frog0.gz      |
| frog_1h       | 1    | RNA-seq         | frog     | /path/to/frog1.gz      |
| frog_2h       | 2    | RNA-seq         | frog     | /path/to/frog2.gz      |
| frog_3h       | 3    | RNA-seq         | frog     | /path/to/frog3.gz      |

Using derived attribute:



| sample_name   | t    | protocol        | organism | input_file             |
| ------------- | ---- | :-------------: | -------- | ---------------------- |
| frog_0h       | 0    | RNA-seq         | frog     | my_samples             |
| frog_1h       | 1    | RNA-seq         | frog     | my_samples             |
| frog_2h       | 2    | RNA-seq         | frog     | my_samples             |
| frog_3h       | 3    | RNA-seq         | frog     | my_samples             |
| crab_0h       | 0    | RNA-seq         | crab     | your_samples           |
| crab_3h       | 3    | RNA-seq         | crab     | your_samples           |



| sample_name   | t    | protocol        | organism | input_file             |
| ------------- | ---- | :-------------: | -------- | ---------------------- |
| frog_0h       | 0    | RNA-seq         | frog     | my_samples             |
| frog_1h       | 1    | RNA-seq         | frog     | my_samples             |
| frog_2h       | 2    | RNA-seq         | frog     | my_samples             |
| frog_3h       | 3    | RNA-seq         | frog     | my_samples             |
| crab_0h       | 0    | RNA-seq         | crab     | your_samples           |
| crab_3h       | 3    | RNA-seq         | crab     | your_samples           |

Project config file:

sample_modifiers:
  derive:
    attributes: [input_file]
    sources:
      my_samples: "/path/to/my/samples/{organism}_{t}h.gz"
      your_samples: "/path/to/your/samples/{organism}_{t}h.gz"

{variable} identifies sample annotation columns

Benefit: Enables distributed files, portability

Implied attributes

Add new sample attributes conditioned on values of existing attributes

Before:



| sample_name   | protocol        | organism | 
| ------------- | :-------------: | -------- | 
| human_1       | RNA-seq         | human    | 
| human_2       | RNA-seq         | human    | 
| human_3       | RNA-seq         | human    | 
| mouse_1       | RNA-seq         | mouse    |

After:



| sample_name   | protocol        | organism | genome | 
| ------------- | :-------------: | -------- | ------ |
| human_1       | RNA-seq         | human    | hg38   |
| human_2       | RNA-seq         | human    | hg38   |
| human_3       | RNA-seq         | human    | hg38   |
| mouse_1       | RNA-seq         | mouse    | mm10   |



| sample_name   | protocol        | organism | 
| ------------- | :-------------: | -------- | 
| human_1       | RNA-seq         | human    | 
| human_2       | RNA-seq         | human    | 
| human_3       | RNA-seq         | human    | 
| mouse_1       | RNA-seq         | mouse    |

Project config file:

sample_modifiers:
  imply:
    - if: 
        organism: human
      then:
        genome: hg38
    - if:
        organism: mouse
      then:
        genome: mm10

Benefit: Divides project from sample metadata

Subprojects

Define activatable project attributes.

project_modifiers:
  amendments:
    diverse:
      metadata:
        sample_annotation: psa_rrbs_diverse.csv
    cancer:
      metadata:
        sample_annotation: psa_rrbs_intracancer.csv

Benefit: Defines multiple similar projects in a single file

Thank You

Collaborators
Vince Reuter
Andre Rendeiro
Levi Waldron

Alumni
Aaron Gu
Jianglin Feng
Ognen Duzlevski
Tessa Danehy

Sheffield lab
Erfaneh Gharavi
Michal Stolarczyk
John Lawson
Jason Smith
Kristyna Kupkova
John Stubbs
Bingjie Xue
Jose Verdezoto
Nathan LeRoy
Oleksandr Khoroshevskyi

Funding:

NIGMS R35-GM128636

nsheff ·

databio.org ·

nsheffield@virginia.edu