The default path for downloading SRA data
The SRA (Sequence Read Archive) is a public repository of DNA sequence data. When you run sam-dump
or fastq-dump
from the sratoolkit
, it will first actually use prefetch
to download a “temporary” .sra
file, which it then converts to either sam
or fastq
format. By default, sratoolkit
will download .sra
files to a subfolder in your home folder ($HOME/ncbi/public/sra
). This is a bad thing because 1) your home folder may have a space quota, and 2) these downloaded files won’t be useful for others in the group, since they’re in your personal space. It’s better to tell sratoolkit
to use a shared filesystem. The advertised way to change the default path uses a graphical interface called vdb-config -i
, which is not ideal. Luckily, all this GUI does is add a setting to a config file that sratoolkit reads, so we can bypass the GUI completely and edit the config file directly. Here’s how to change your default data storage path:
echo "/repository/user/main/public/root = \"$DATA\"" > $HOME/.ncbi/user-settings.mkfg
Now, the huge .sra
files will be stored in our shared, huge filesystem instead of in your home directory.
A second thing to keep in mind: these .sra
files aren’t really temporary; there is no system in place to delete them after a time. So they will just build up and be huge as you download more files from SRA until you delete them. Once you’ve converted to sam
, bam
, or fastq
format, the sra
files are no longer needed and can theoretically be purged.
We can delete all such .sra
files that have not been accessed in the past year like this:
find $DATA/sra -depth -type f -atime +365 -delete
Source: Thanks to piet in this question.