Regenie and bioconductor
The problem
Many tools require genome-related assets (like indexes).
How should we organize these on disk?
## A standard organization simplifies tool interface
```
pipeline.py --genome hg38
```
```
pipeline.py --bowtie2-index path/to/hg38/bowtie2-index \
--tss_annotation path/to/hg38/tss_annotation.bed \
--ensembl_anno path/to/hg38/ensembl_v86.gtf
```
## Illumina's [iGenomes](https://support.illumina.com/sequencing/sequencing_software/igenome.html) is one answer
iGenomes is *a collection of reference sequences and annotation files for commonly analyzed organisms*.
You download a tarball of a standard structure for your genome of interest, then write tools off that.
## The 'central repository' approach is limited
- *Not scripted.* No iGenomes for an arbitrary genome/asset.
- *Not modular*. No access to individual assets.
- *Not programmatic*. Can't access data/metadata via API.
## Refgenie solves these limitations
- *Two ways to retrieve an asset.*
- `build` any asset from a recipe.
- `pull` any individual asset from a server
- *Better discoverability*.
- `list/listr` shows assets
- `refgenieserver` is a browseable web interface and API
- *Managed locations*.
- `seek` returns the local path to assets
- `add/remove` to manage your own assets
## Refgenie CLI example
```
refgenie pull hg38/fasta
refgenie build hg38/bowtie2_index
refgenie seek hg38/bowtie2_index
```
## Using Refgenie from R
```
mod = reticulate::import("refgenconf", convert=FALSE)
rgc = mod$RefGenConf(Sys.getenv("REFGENIE"))
rgc$pull("hg38", "bowtie2_index", "default")
rgc$seek("hg38", "bowtie2_index"))
```