Many common bioinformatic tasks are embarrassingly parallel, and there are many ways to parallelize. The way you decide to parallelize will affect both performance and developer cost, and choosing the best way in practice depends on the specifics of the project. To implement parallelism in a project usually requires significant developer resources, and unfortunately, in my opinion these resources are not always well-spent. This post summarizes my experience with parallelism in bioinformatics into a few useful concepts to guide how to employ it to the greatest benefit in terms of both computational efficiency and respect for developer time.
To set a context, the particular problem we want to solve is generic: to run a set of biological samples through a pipeline. By pipeline I mean a series of analysis steps; for example, DNA alignment, peak calling, quality control, what have you. Of course, the number and size of the samples varies by project, as does the complexity and resource cost of the analysis steps. In a parallel-naive project, we can simply run each sample through each task sequentially, but this quickly becomes impractical as both samples size and pipeline complexity increase. At some point our project becomes sufficiently complex that it makes sense to think about parallelism. How should we parallelize?
The possible ways to parallelize may be divided into three major strategies: parallel by sample, within-task parallelism, and parallel tasks. These concepts are not mutually exclusive, and typically a pipeline will employ some combination of them. But separating them conceptually reveals some interesting properties about the benefits and challenges of each and how they interact with one another. In particular, we’ll evaluate each strategy in terms of 2 properties: 1) scalability (how well it scales as our project size increases); and 2) ease of implementation.
Parallel by sample
For simplicity, let’s say each pipeline run corresponds to a single biological sample. Really, the concept of sample here can be generalized a bit; for example, in a differential analysis, we could consider each comparison as a separate unit, but for the sake of simplicity we’ll just stick with calling each unit a sample. For the basic preprocessing steps like sequence alignment, filtering, quality control, signal aggregation, peak calling, etc., each sample is processed independently, and can therefore be run in parallel. It seems obviously useful to run each sample in parallel, but what are some of the properties of that choice?
Scalability. Parallelization by sample scales well with project size: with increasing number of samples, parallelization by sample becomes increasingly useful. Each parallel process is also fully independent from the others, so different runs can be submitted across different compute nodes without issue.
Ease of implementation. Parallelizing by sample is also straightforward to implement. An outer loop around samples can simply run the same series of commands on each sample independently. But even more interesting, it is not pipeline-specific: a single implementation (that outer loop) could theoretically work with any pipeline. We could write a single sample-processing loop that could handle parallelization by sample for any number of pipelines. Furthermore, changes to the pipeline do not affect parallelization by sample, meaning pipeline updates need not account for it.
Parallel within individual tasks
Both compute power and data size are increasing, and in response, more tools are supporting multi-core processing of individual tasks. For instance, almost every major DNA aligner can make use of multiple CPU cores. Parallelizing individual tasks is obviously useful but has two major limitations: First, it is node-threaded, which limits parallelization to the number of cores on a single machine; and second, the pipeline developer may only use it in tools that have implemented it internally.
Scalability. Parallelizing an individual task can only scale to as many processors as are present on a single node. So, it scales to a point, but becomes quickly saturated. Using multiple processors requires the job to be run on a shared CPU and shared memory, so it cannot be parallelized across a large compute cluster. This makes parallelization within tasks less scalable than parallelization by sample.
Ease of implementation. For multicore tools, using parallelism is usually as simple as passing an argument with the number of cores. This makes it even simpler than parallelizing by sample. However, many tools do not implement it. This is out of the hands of the pipeline developer and in the hands of the tool author.
By parallel tasks I mean running two separate commands not sequentially, but simultaneously. In order for this to make sense, these two tasks cannot depend on one another. Many bioinformatics pipelines are a mix of dependent and independent tasks. For instance, most downstream tasks depend on sequence alignment, but some of these downstream steps may not depend on each other and could theoretically be processed in parallel. Encoding such task-level dependency comes at a cost: A pipeline author who wants optimal concurrency must encode the parallel task structure within the pipeline.
Scalability. As the sample size increases, parallelizing by task does not scale at all, since it depends completely on the structure of the pipeline dependencies and not on the size of the project. Furthermore, it becomes useless once limited compute resources are consumed by parallelizing by sample. Finally, because the most computationally intensive tasks are also most likely to make use of node-threaded task parallelism, the speed gains to be had from implementing parallel tasks are minimal.
Ease of implementation. While it may at first seem simple to encode task dependency, the complexity of independent and differently dependent tasks makes it more challenging than within-task parallelism. Implementing this requires writing multi-threaded code or a third-party framework, which leads to challenges. Even more problematic, parallelization by task is not generic, so it will require a specific implementation for each pipeline. Furthermore, changes to a pipeline could possibly affect task dependency, requiring tweaking the parallelization, increasing the cost to maintain a pipeline that has been parallelized by dependency. Taken together, parallel tasks is clearly the most expensive (in terms of total developer effort) of the three parallelization strategies.
Given the profile of cost to benefit, these properties lead to a some concrete conclusions:
Parallelizing within tasks is the simplest to use, and should be done wherever possible. However, it is neither universal nor scalable. Thus, it does not solve all our parallelization needs.
Parallelizing by sample is not difficult to implement, and provides infinite scalability. It should be implemented well a single time and then used across all pipelines.
Parallelizing by task is asymptotically useless as sample size increases. Furthermore, it is the most difficult to build and maintain. It should be used sparingly and only when absolutely necessary.
Our project management system, looper, adheres to this philosophy to encourage users to focus on parallelizing by sample, and using node-threaded parallelism within individual tasks, which provide the greatest speed benefit with the least development cost; and in turn, to completely ignore the possibility of encoding dependent tasks, giving up a relatively minor (and asymptotically vanishing) speed increase in exchange for vastly reduced pipeline development complexity.
When should I parallelize by task?. Well – I can’t think of many situations where it works out great in the long term, but maybe it’s useful if you can satisfy the following criteria:
- your project is medium sized (and you’re not planning to process lots of samples)
- you’re in a hurry, and it’s worth spending extra developer time to get results faster
- your compute resources are not restricting
- you do not intend for others to further develop or maintain your pipeline