Refgenie, PEP, and bulker

Nathan Sheffield, PhD

www.databio.org/slides

A full-service reference genome manager.

http://refgenie.databio.org

Stolarczyk et al. (2020). GigaScience.

Stolarczyk, Xue, and Sheffield (2021). NAR Genomics and Bioinformatics.

The problem

Many tools require genome-related assets (like indexes).
How should we organize these on disk?

## A standard organization simplifies tool interface Flexible paths must be passed individually: ``` pipeline.py --bowtie2-index path/to/hg38/bowtie2-index \ --tss_annotation path/to/hg38/tss_annotation.bed \ --ensembl_anno path/to/hg38/ensembl_v86.gtf ``` A standard establishes expectations: ``` pipeline.py --genome hg38 ```

## Illumina's [iGenomes](https://support.illumina.com/sequencing/sequencing_software/igenome.html) is one answer iGenomes is *a collection of reference sequences and annotation files for commonly analyzed organisms*. You download a tarball of a standard structure for your genome of interest, then write tools off that.

## The 'central repository' approach is limited - *Not scripted.* No iGenomes for an arbitrary genome/asset. - *Not modular*. No access to individual assets. - *Not programmatic*. Can't access data/metadata via API.

## Refgenie solves these limitations - *Two ways to retrieve an asset.* - `build` any asset from a recipe. - `pull` any individual asset from a server - *Better discoverability and modularity*. - `list/listr` shows assets - `refgenieserver` is a browseable web interface and API - *Managed locations*. - `seek` returns the local path to assets - `add/remove` to manage your own assets

Refgenie consists of 3 components

Refgenie splits tasks between CLI and server

## Refgenie CLI example [http://refgenie.databio.org](http://refgenie.databio.org)

The build/pull method needs provenance checks

Asset provenance:

Genome provenance:

Refget

Refget enables access to reference sequences
using an identifier derived from the sequence itself.

http://samtools.github.io/hts-specs/refget.html

From GA4GH

How refget works

## Refgenie implements (collection) refget-like You first have to have the `fasta` asset: ``` refgenie pull -g hg38 -a fasta ``` Then you can use `getseq`: ``` refgenie getseq -g hg38 -l chr1:50000-50400 AAACAGGTTAATCGCCACGACATAGTAGTATTTAGAGTTACTAGTAAGCCTGATGCCACTACACAATTCTAGCTTTTCTCTTTAGGATGATTGTTTCATTCAGTCTTATCTCTTTTAGAAAACATAGGAAAAAATTATTTAATAATAAAATTTAATTGGCAAAATGAAGGTATGGCTTATAAGAGTGTTTTCCTATTGTTTTCAGTGTAGGACTCACTGTTCTAAATAACTGGGACACCCAAGGATTCTGTAAAATGCCATCCAGTTATCATTTATATTCCCTAACTCAAAATTCATTCACATGTATTCATTTTTTTCTAAACAAATTAGCATGTAGAATTCTGGTTAAAATTTGGCATAGAACACCCGGGTATTTTTTCATAATGCACCCAATAACTGT ```

Refget v2.0: Collections for genome provenance

Recursive checksums have advantages

Allows getting content list only

Preserves chromosome order

Re-uses the checksum function

Duplicates are stored only once

Go one step further for...

It keeps going... and going...

Asset provenance:

Recipes + containers?

Genome provenance:

Solved by refget v2.0?

Tying human identifiers to a digest:


hg38:
  refget_digest: 32a37a52a377d95bfd4b3d66763e1396a4480f34ab5c318a

Pepkit

A structure and toolkit for organizing large-scale,
sample-intensive biological research projects

http://pepkit.github.io/

Sheffield et al. (2021). GigaScience.

Research is organized in projects

How do we conceptualize a research project?

Each project has 3 components

Organizing multiple projects is a challenge

How do I re-use a component?

A project is a set of edges in a tripartite graph

Enable linking with interfaces

We are building a modular ecosystem

pepkit · geofetch · looper · caravel · pypiper · divvy

PEP: Portable Encapsulated Projects

PEP format

Start with a simple CSV with tabular data.

samples.csv

sample_name,protocol,organism,input_file
frog_0h,RNA-seq,frog,/path/to/frog0.gz
frog_1h,RNA-seq,frog,/path/to/frog1.gz
frog_2h,RNA-seq,frog,/path/to/frog2.gz
frog_3h,RNA-seq,frog,/path/to/frog3.gz

PEP format

Add a YAML for project-level data.

samples.csv

sample_name,protocol,organism,input_file
frog_0h,RNA-seq,frog,/path/to/frog0.gz
frog_1h,RNA-seq,frog,/path/to/frog1.gz
frog_2h,RNA-seq,frog,/path/to/frog2.gz
frog_3h,RNA-seq,frog,/path/to/frog3.gz

project_config.yaml

sample_table: /path/to/samples.csv
output_dir: /path/to/output/folder
other_variable: value

Add programmatic sample and project modifiers.

Derived attributes

Implied attributes

Subprojects

Derived attributes

Automatically build new sample attributes from existing attributes.

Without derived attribute:



| sample_name   | t    | protocol        | organism | input_file             |
| ------------- | ---- | :-------------: | -------- | ---------------------- |
| frog_0h       | 0    | RNA-seq         | frog     | /path/to/frog0.gz      |
| frog_1h       | 1    | RNA-seq         | frog     | /path/to/frog1.gz      |
| frog_2h       | 2    | RNA-seq         | frog     | /path/to/frog2.gz      |
| frog_3h       | 3    | RNA-seq         | frog     | /path/to/frog3.gz      |

Using derived attribute:



| sample_name   | t    | protocol        | organism | input_file             |
| ------------- | ---- | :-------------: | -------- | ---------------------- |
| frog_0h       | 0    | RNA-seq         | frog     | my_samples             |
| frog_1h       | 1    | RNA-seq         | frog     | my_samples             |
| frog_2h       | 2    | RNA-seq         | frog     | my_samples             |
| frog_3h       | 3    | RNA-seq         | frog     | my_samples             |
| crab_0h       | 0    | RNA-seq         | crab     | your_samples           |
| crab_3h       | 3    | RNA-seq         | crab     | your_samples           |



| sample_name   | t    | protocol        | organism | input_file             |
| ------------- | ---- | :-------------: | -------- | ---------------------- |
| frog_0h       | 0    | RNA-seq         | frog     | my_samples             |
| frog_1h       | 1    | RNA-seq         | frog     | my_samples             |
| frog_2h       | 2    | RNA-seq         | frog     | my_samples             |
| frog_3h       | 3    | RNA-seq         | frog     | my_samples             |
| crab_0h       | 0    | RNA-seq         | crab     | your_samples           |
| crab_3h       | 3    | RNA-seq         | crab     | your_samples           |

Project config file:

sample_modifiers:
  derive:
    attributes: [input_file]
    sources:
      my_samples: "/path/to/my/samples/{organism}_{t}h.gz"
      your_samples: "/path/to/your/samples/{organism}_{t}h.gz"

{variable} identifies sample annotation columns

Benefit: Enables distributed files, portability

Implied attributes

Add new sample attributes conditioned on values of existing attributes

Before:



| sample_name   | protocol        | organism | 
| ------------- | :-------------: | -------- | 
| human_1       | RNA-seq         | human    | 
| human_2       | RNA-seq         | human    | 
| human_3       | RNA-seq         | human    | 
| mouse_1       | RNA-seq         | mouse    |

After:



| sample_name   | protocol        | organism | genome | 
| ------------- | :-------------: | -------- | ------ |
| human_1       | RNA-seq         | human    | hg38   |
| human_2       | RNA-seq         | human    | hg38   |
| human_3       | RNA-seq         | human    | hg38   |
| mouse_1       | RNA-seq         | mouse    | mm10   |



| sample_name   | protocol        | organism | 
| ------------- | :-------------: | -------- | 
| human_1       | RNA-seq         | human    | 
| human_2       | RNA-seq         | human    | 
| human_3       | RNA-seq         | human    | 
| mouse_1       | RNA-seq         | mouse    |

Project config file:

sample_modifiers:
  imply:
    - if: 
        organism: human
      then:
        genome: hg38
    - if:
        organism: mouse
      then:
        genome: mm10

Benefit: Divides project from sample metadata

Subprojects

Define activatable project attributes.

project_modifiers:
  amendments:
    diverse:
      metadata:
        sample_annotation: psa_rrbs_diverse.csv
    cancer:
      metadata:
        sample_annotation: psa_rrbs_intracancer.csv