The curse of enormity

In machine learning, the curse of dimensionality refers to how geometric concepts that make sense in 3 dimensions (like distance) become unintuitive in high dimensional spaces. Analogously, I present the curse of enormity: storage habits that work for small datasets will break down in the world of big data. On your laptop or desktop workstation, you can afford to have a single hunk of disk space for everything. But when the size of data becomes large, it becomes inefficient to use a single class of disk storage, and it becomes imperative to match different storage systems to different data types.

In every professional large-scale genomic analysis environment, disk space is a major logistic challenge. The typical academic approach to managing data is ad-hoc; we stick data in the most convenient place at the time it arrives so we can get on with the time-sensitive analysis as quickly as possible. This works fine when data are reasonably small, but quickly leads to expensive problems with data organization if the size increases dramatically, which tends to happen in genomics. The result is a hodgepodge of distributed data in random folders across random filesystems, ill-matched and inefficient in the long term. I have learned by experience that disk space is among most limited resources, and careful organization is the exception rather than the rule; but a careful approach can mitigate these challenges. There are two key challenges that arise in the data-intensive genomics research environment:

Balancing cost and disk features

It doesn’t make sense to spend expensive high-performance enterprise disks on huge long-term archival data that is rarely accessed. It also doesn’t make sense to use slow, cheap filesystems for storing and processing terabytes of data. Instead, we need to use fast disks for high-IO tasks and cheap disks for archiving. And because mirrored storage costs twice as much as unmirrored storage, it saves resources to categorize big data into its priority for mirroring. We mirror the raw data we can’t afford to lose, but stuff we can regenerate is less critical. So, for big data, disk space is complicated by the different cost and performance characteristics of different types of disk hardware. This means: we will by necessity have data spread across several filesystems with different strengths and costs. We must learn to deal with this effectively.
Data movement

Because our data will be ever-increasing, data will by necessity end up moving around. We buy new disks, and disk technology changes. We move current projects into the archive, download new data, produce new data, and derive processed data from raw data. Moving big data around is also non-trivial; it can take hours or days to move files around where connections are slow, and disk characteristics again play a role. Altogether, this means a lot of data movement, which can be a challenge. We must think of a data location as a temporary path. We must structure data and build code in a way that can easily adapt to movement.

Overcoming the curse

The ultimate solution to this problem probably will involve some type of object store with per-object feature capability. But this is an evolving concept that still hasn’t become mainstream in genomics. In the meantime, both of these challenges can be mitigated somewhat in the current environment with a simple concept: environment variables. It’s really useful to define a set of environment variables that point to parent directories for data of different classes. If done thoughtfully, environment variables solve both issues: they make it easier to match different hardware to different data needs, and they adapt well to changes. This way, when disks get moved around or upgraded, changing pipelines is as easy as adjusting a single environmental pointer.

So: please don’t use hard-coded paths to root folders; instead, rely on static environment variables. They also have two side benefits: code and annotation becomes more portable (can be moved to another compute environment easily), and they also make it easier to navigate around when you’re exploring stuff in the shell.

Storage classes and variable ideas

To match data types with filesystem features, we need to classify the storage possibilities. For example, following the example above, we can classify data into whether or not it needs fast disks for high IO. Since more features mean more cost, we’d like to optimize and get filesystems with the minimum features we need for our data. However, there’s also a cost to dividing data type into different filesystems, and it only makes sense to do that where data are large enough to warrant it. Just like on a laptop, data are small enough that we can afford to buy a nice disk even to store less critical data, the benefits of dividing data into features pay out more as data size grows. Therefore, the bigger the data are, the more it makes sense to add new storage products with different features. So employing different filesystems with different charactersitcs only makes sense to a point; therefore, in any real situation, the answer is going to be some kind of a tradeoff, depending on the size of the data and how different its needs are.

A key requirement is therefore going to be an analysis of what data I have, and what my storage needs really are for that data. In the process of classifying the data I deal with on a daily basis, I’ve come up with these important attributes of data and disk that I need to align. Here’s a set of attributes of file systems that I need to ask of my data as I try to categorize it and determine what kind of storage it should use:

mirror: Should the files be stored in more than 1 copy?
IO: How much read/write is there?
cluster: Does the filesystem need to be mountable/visible on a cluster?
access: Does the filesystem need high-speed interconnect to compute?
size: How much data will be in this class?
export: Will this data need to be served externally via http or other protocols?
block: Does the system need to be POSIX-y (block storage) or is object storage acceptable

Now, given these options, the number of possible combinations (and therefore potential filesystems) becomes impossibly large, so we must consider data as classes that share some of these characteristics to balance cost and features. For my purposes, I’ve divided all the data I typically deal with into 5 different classes. Here are my 5 data classes:

class	description	mirror?	IO	cluster?	access	size	export?	block?
1	mirrored network	yes	low	ideally	high	med	no	yes
2	unmirrored network, high	no	high	required	highest	large	no	yes
3	unmirrored network, low	no	low	ideally	high	med	no	yes
4	unmirrored exported storage	no	low	no	low	med	yes	yes
5	long-term glacier archive	yes	low	no	low	large	no	no

Some examples of how I classify data of different types:

Class 1: Raw data, shared resources, software

Class 2: processing or processed data (output from my tools)

Class 3: $HOME folders, code, downloaded data

Class 4: web sites, any published data

Class 5: archived data

Environment variables

I then define different environment variables to refer to folders on these filesystems. This is not a one-to-one relationship; I have multiple environment variables pointing at subfolders within one filesystem if that makes sense. By using the environment variable, I’m immune to moving data around; nothing changes except my variable assignment. This provides me portability and flexibility for future changes. At first, several of these classes may live on the same file system, but as grow, they will be easy to divide if they start separate from the start.