The SRA (Sequence Read Archive) is a public repository of DNA sequence data. When you run
fastq-dump from the
sratoolkit, it will first actually use
prefetch to download a “temporary”
.sra file, which it then converts to either
fastq format. By default,
sratoolkit will download
.sra files to a subfolder in your home folder (
$HOME/ncbi/public/sra). This is a bad thing because 1) your home folder may have a space quota, and 2) these downloaded files won’t be useful for others in the group, since they’re in your personal space. It’s better to tell
sratoolkit to use a shared filesystem. The advertised way to change the default path uses a graphical interface called
vdb-config -i, which is not ideal. Luckily, all this GUI does is add a setting to a config file that sratoolkit reads, so we can bypass the GUI completely and edit the config file directly. Here’s how to change your default data storage path:
echo "/repository/user/main/public/root = \"$DATA\"" > $HOME/.ncbi/user-settings.mkfg
Now, the huge
.sra files will be stored in our shared, huge filesystem instead of in your home directory.
A second thing to keep in mind: these
.sra files aren’t really temporary; there is no system in place to delete them after a time. So they will just build up and be huge as you download more files from SRA until you delete them. Once you’ve converted to
fastq format, the
sra files are no longer needed and can theoretically be purged.
We can delete all such
.sra files that have not been accessed in the past year like this:
find $DATA/sra -depth -type f -atime +365 -delete
Source: Thanks to piet in this question.