Regenie and bioconductor

The problem

Many tools require genome-related assets (like indexes).
How should we organize these on disk?

## A standard organization simplifies tool interface ``` pipeline.py --genome hg38 ``` ``` pipeline.py --bowtie2-index path/to/hg38/bowtie2-index \ --tss_annotation path/to/hg38/tss_annotation.bed \ --ensembl_anno path/to/hg38/ensembl_v86.gtf ```

## Illumina's [iGenomes](https://support.illumina.com/sequencing/sequencing_software/igenome.html) is one answer iGenomes is *a collection of reference sequences and annotation files for commonly analyzed organisms*. You download a tarball of a standard structure for your genome of interest, then write tools off that.

## The 'central repository' approach is limited - *Not scripted.* No iGenomes for an arbitrary genome/asset. - *Not modular*. No access to individual assets. - *Not programmatic*. Can't access data/metadata via API.

## Refgenie solves these limitations - *Two ways to retrieve an asset.* - `build` any asset from a recipe. - `pull` any individual asset from a server - *Better discoverability*. - `list/listr` shows assets - `refgenieserver` is a browseable web interface and API - *Managed locations*. - `seek` returns the local path to assets - `add/remove` to manage your own assets

## Refgenie CLI example ``` refgenie pull hg38/fasta refgenie build hg38/bowtie2_index refgenie seek hg38/bowtie2_index ```

## Using Refgenie from R ``` mod = reticulate::import("refgenconf", convert=FALSE) rgc = mod$RefGenConf(Sys.getenv("REFGENIE")) rgc$pull("hg38", "bowtie2_index", "default") rgc$seek("hg38", "bowtie2_index")) ```

Thank You

These slides: databio.org/slides
Refgenie documentation: refgenie.databio.org
Refgenieserver instance: refgenomes.databio.org
GitHub: github.com/databio/refgenie

nsheff ·

databio.org ·

nsheffield@virginia.edu