RefGenie: Standardized reference genome folder structure for NGS pipelines
NGS pipelines typically require access to reference genome files for alignment. Usually, these files have to be indexed in some way for the aligner, and each aligner has its own indexing method. This leads to a bunch of related folders and files all belonging to a given reference, which must be organized in some way on disk – and each lab typically organizes these things differently (different folder and file names).
Since pipelines point to these reference files, it makes it hard to share pipelines across environments and labs, as they have different methods for organizing and passing index files to the pipelines.
If everyone built pipelines to follow a standard structure for reference genomes, then pipelines could just take a string describing that genome (e.g. hg38) and would be able to know how to find the indexes it needs for that genome. Pipeline interfaces would be simplified (no need to pass each index directly), it would be easy to switch environments because you would just have to point at a different parent folder, and the pipeline would know how to find indexes of all types.
Illumina’s iGenomes project sought to solve this problem by providing standardized reference genome file structure and indexes you can download for many public genomes.
However, there remains a problem: what if I want to run my pipeline on a genome that is not included in the iGenomes release? What if I want to run my pipeline on a custom genome, a spike-in control sample, a decoy sequence, or a contaminant assembly? These references would also need indexes for the aligners, and I would need these genomes organized in the same way to fit with the pipeline. It would be much more convenient to have a script that would produce this folder structure given an input fasta file, which I can just run on any custom genome I want.
Enter RefGenie. RefGenie is a python script that creates a standardized folder structure for reference genome files and indexes. Minimal input is just a fasta file. You just make sure your indexers are installed for any indexes you want generated, and produce standardized reference genome folders for all genomes you want to align against.
RefGenie uses Pypiper. You can find source code and instructions for RefGenie at GitHub. Contributions, pull requests, and comments welcome!