Many tools require genome-related assets (like indexes).
How should we organize these on disk?
## A standard organization simplifies tool interface
```
pipeline.py --genome hg38
```
```
pipeline.py --bowtie2-index path/to/hg38/bowtie2-index \
--tss_annotation path/to/hg38/tss_annotation.bed \
--ensembl_anno path/to/hg38/ensembl_v86.gtf
```
## Illumina's [iGenomes](https://support.illumina.com/sequencing/sequencing_software/igenome.html) is one answer
iGenomes is *a collection of reference sequences and annotation files for commonly analyzed organisms*.
You download a tarball of a standard structure for your genome of interest, then write tools off that.
## The 'central repository' approach is limited
- *Not scripted.* No iGenomes for an arbitrary genome/asset.
- *Not modular*. No access to individual assets.
- *Not programmatic*. Can't access data/metadata via API.
## Refgenie solves these limitations
- *Two ways to retrieve an asset.*
- `build` any asset from a recipe.
- `pull` any individual asset from a server
- *Better discoverability*.
- `list/listr` shows assets
- `refgenieserver` is a browseable web interface and API
- *Managed locations*.
- `seek` returns the local path to assets
- `add/remove` to manage your own assets
Refgenie consists of 3 components
Refgenie splits tasks between CLI and server
## Refgenie CLI example
[http://refgenie.databio.org](http://refgenie.databio.org)
## Refgenie implements (collection) refget-like
You first have to have the `fasta` asset:
```
refgenie pull -g hg38 -a fasta
```
Then you can use `getseq`:
```
refgenie getseq -g hg38 -l chr1:50000-50400
AAACAGGTTAATCGCCACGACATAGTAGTATTTAGAGTTACTAGTAAGCCTGATGCCACTACACAATTCTAGCTTTTCTCTTTAGGATGATTGTTTCATTCAGTCTTATCTCTTTTAGAAAACATAGGAAAAAATTATTTAATAATAAAATTTAATTGGCAAAATGAAGGTATGGCTTATAAGAGTGTTTTCCTATTGTTTTCAGTGTAGGACTCACTGTTCTAAATAACTGGGACACCCAAGGATTCTGTAAAATGCCATCCAGTTATCATTTATATTCCCTAACTCAAAATTCATTCACATGTATTCATTTTTTTCTAAACAAATTAGCATGTAGAATTCTGGTTAAAATTTGGCATAGAACACCCGGGTATTTTTTCATAATGCACCCAATAACTGT
```
The build/pull method needs provenance checks
Asset provenance:
Genome provenance:
Collection checksums solve genome provenance
Recursive checksums have advantages
Allows getting content list only
Preserves chromosome order
Re-uses the checksum function
Duplicates are stored only once
Go one step further for...
It keeps going... and going...
## Final thoughts
- Implementation of lookup algorithm: [github gist](https://gist.github.com/nsheff/3bbb96a6876234e758895e4a35c03dc7#file-refget-py-L51-L68)
- If refget hosted collection checksums, then given any genome checksum, I could re-build the fasta asset for that genome automatically
- Refgenieserver could provide a limited refget database for the genomes it has archived
- Even without a central database, the genome checksums ensure that assets are built from the same base