Refgenie and refget

Nathan Sheffield, PhD
www.databio.org/slides

The problem

Many tools require genome-related assets (like indexes).
How should we organize these on disk?
## A standard organization simplifies tool interface ``` pipeline.py --genome hg38 ``` ``` pipeline.py --bowtie2-index path/to/hg38/bowtie2-index \ --tss_annotation path/to/hg38/tss_annotation.bed \ --ensembl_anno path/to/hg38/ensembl_v86.gtf ```
## Illumina's [iGenomes](https://support.illumina.com/sequencing/sequencing_software/igenome.html) is one answer iGenomes is *a collection of reference sequences and annotation files for commonly analyzed organisms*. You download a tarball of a standard structure for your genome of interest, then write tools off that.
## The 'central repository' approach is limited - *Not scripted.* No iGenomes for an arbitrary genome/asset. - *Not modular*. No access to individual assets. - *Not programmatic*. Can't access data/metadata via API.
## Refgenie solves these limitations - *Two ways to retrieve an asset.* - `build` any asset from a recipe. - `pull` any individual asset from a server - *Better discoverability*. - `list/listr` shows assets - `refgenieserver` is a browseable web interface and API - *Managed locations*. - `seek` returns the local path to assets - `add/remove` to manage your own assets

Refgenie consists of 3 components

Refgenie splits tasks between CLI and server

## Refgenie CLI example [http://refgenie.databio.org](http://refgenie.databio.org)
## Refgenie implements (collection) refget-like You first have to have the `fasta` asset: ``` refgenie pull -g hg38 -a fasta ``` Then you can use `getseq`: ``` refgenie getseq -g hg38 -l chr1:50000-50400 AAACAGGTTAATCGCCACGACATAGTAGTATTTAGAGTTACTAGTAAGCCTGATGCCACTACACAATTCTAGCTTTTCTCTTTAGGATGATTGTTTCATTCAGTCTTATCTCTTTTAGAAAACATAGGAAAAAATTATTTAATAATAAAATTTAATTGGCAAAATGAAGGTATGGCTTATAAGAGTGTTTTCCTATTGTTTTCAGTGTAGGACTCACTGTTCTAAATAACTGGGACACCCAAGGATTCTGTAAAATGCCATCCAGTTATCATTTATATTCCCTAACTCAAAATTCATTCACATGTATTCATTTTTTTCTAAACAAATTAGCATGTAGAATTCTGGTTAAAATTTGGCATAGAACACCCGGGTATTTTTTCATAATGCACCCAATAACTGT ```

The build/pull method needs provenance checks

Asset provenance:

Genome provenance:

Collection checksums solve genome provenance

Recursive checksums have advantages

Allows getting content list only

Preserves chromosome order

Re-uses the checksum function

Duplicates are stored only once

Go one step further for...

It keeps going... and going...

## Final thoughts - Implementation of lookup algorithm: [github gist](https://gist.github.com/nsheff/3bbb96a6876234e758895e4a35c03dc7#file-refget-py-L51-L68) - If refget hosted collection checksums, then given any genome checksum, I could re-build the fasta asset for that genome automatically - Refgenieserver could provide a limited refget database for the genomes it has archived - Even without a central database, the genome checksums ensure that assets are built from the same base

Thank You


nsheff · databio.org · nsheffield@virginia.edu