Abstract

Want to switch from legacy HPC environment modules and upgrade to linux containers, so you can re-use them on multiple servers? Here's how I did it.

Modules are great…except when they aren’t

I’ve used HPCs at almost a dozen different academic institutions, and most of them employed some variation of environment modules, such as lmod. I love these environment modules systems because they make it so easy to share software among group members. I’ve written my own modules that load up sets of underlying modules provided by the HPC administrators to create a sort of “lab environment”, so my group members need to just say module load databio to get access to all the tools we use frequently. It’s so much easier than having every lab member have to sort through individual installs, and even better because we can rely on the expertise of university-wide installations provided by the HPC administrators.

But there’s one major inconvenience that has bothered me for years: the modules are specific to the particular cluster where they are installed. So, I found myself investing a lot of time learning about the modules available on a system, creating or organizing my own, and getting used to a setup on a particular HPC, but then I can’t take these with me when I change environments. For example, I can’t use those modules when I’m using a collaborator’s standalone server that’s not part of the cluster. I can’t use those modules on my desktop when I want to do a little test. I can’t use them on another HPC I have access to at a different university. I can’t use them on a cloud instance I’ve spun up to test. Modules, unfortunately aren’t a universal solution – they only solve the issue for a single computing environment, which is getting more problematic as my computing is becoming more diverse.

The problem gets even worse when you throw in the desire to share the environment with others outside my group. With environment modules, I can’t say to a collaborator at a different university who wants to use my software, “just run module load xyz, and then you’ll have everything you need for my pipeline to work!” All I can do is say, “I hope you have modules for all these things set up, too…can’t help you there.” I’d love to be able to just say, “here’s the computing environment you need to run this.” This problem kept me unsatisfied with sharing computing environments for years. Until now.

Introducing bulker

For the past year or so, I’ve been working on a tool called bulker (Sheffield, 2019) that is my attempt to solve this problem. After testing this out for several months now, I can say that bulker has finally completely replaced the need for environment modules for my lab. I don’t use environment modules anymore, and can rely completely on bulker to manage our computing software. In my opinion, bulker is superior to environment modules even for one computing environment – and the best part about it is that bulker works across platforms – so I can use the same system on my laptop, or in the cloud. I even have it sort-of working with GitHub Actions.

Bulker is built on top of linux containers (docker or singularity). It adds a few critical components that make it behave more like a modules system. If you’re familiar with docker, you are probably used to some of the things that make it hard to use. It’s awesome for the portability it provides, but it’s a pain to write docker run blahblahblahblah all the time, especially when you’re dealing with mounting volumes. Singularity simplifies some of that, but comes with its own set of challenges, like keeping track of the singularity images, and having to learn both singularity and docker since each one is deployed in different types of environments (e.g. HPC vs cloud). And both docker and singularity operate on a tool-by-tool bases; with environment modules, you can load an individual tool, but you can also write module load toolset and voila! I have available 16 different tools I need for a complicated interactive analysis or pipeline. So, after using both docker and singularity for awhile, it was clear to me that they have the potential to solve this cross-computing-environment conundrum, but they just seemed built for ephemeral computing and individual tools, which is great, but it’s not the same scenario that environment modules solve, which is also useful: interactive multi-tool computing environments on an HPC.

So, I realized what I was looking for was a combination of environment modules and linux containers: the ability to load a set of interactive, native installed tools with module load xyz, but also to be able to use the exact same command to create the same interactive environment on any computing environment – and to be able to share it with others!

Bulker is my solution to this need. Once it’s set up, all I have to do is say bulker activate toolset, and I have the magic environment I desire. I can then run samtools or refgenie or bowtie2 (or whatever) from the command line, just as if I had loaded a native tool with a module. Except it’s not running a native tool – under the hood, bulker is running it in a container for me, but bulker handles all the annoying docker --volume or singularity image stuff for me. But the best part is that a single configuration can be used in any environment. Yep, you read that right – so if I say bulker activate databio/toolset, it will give me that environment on my HPC…or on my desktop, or on a standalone server, or an a cloud instances or on my collaborator’s computer… or… you get it!

All you need to make this work is an underlying container system (either docker or singularity will do – bulker overlays either, which means I can use the same system on the singularity-aware HPC or my docker-aware desktop). Oh yeah, and you need python3 installed natively, because bulker is written in Python.

Quick guide on how to set up bulker on an HPC

Interested? If you want to try setting up a shared bulker system for your research group on an HPC, here’s all you have to do:

Satisfy the prereqs. The only prerequisites are: You’ll need a working, native installation of python3, and you’ll need either docker or singularity. If you can satisfy those 2 requirements, you can use bulker to recreate a portable computing environment.
Install bulker (that’s as easy as pip install bulker).
Set up a shared bulker configuration. In a group-writable space, initialize a bulker configuration file (with bulker init). Assuming you use singularity on a shared HPC, you’ll also need to dedicate a shared folder for you singularity image files, which bulker will manage. Set that up in your configuration file.
Load up a manifest. In bulker, a manifest is a description of which commands map to which containers. You can use existing manifests, which are available on bulkerhub, or you can create your own. If you’re just getting started, then just go ahead and load biobase, a default manifest with a bunch of common bioinformatics tools, which you can load by just typing bulker load biobase to load it, and then bulker activate biobase to activate it. Now, use the tools in your interactive environment, exactly as if you had used module load – the only difference is that bulker is using singularity under the hood, rather than native installs, so they’re portable.
Customize your manifests (if you want). You can create your own set of software for your customized computing environment. All you have to do is write a bulker manifest file. Containers for a lot of populare tools are already available in the biocontainers project (Veiga Leprevost et al., 2017). You can also create your own. The payoff you get from installing your custom tool into a container and adding it to a manifest is that you’ll only do that once – it will then work on any system.
Tell your friends!

That’s it! Now, you can do the same thing on your local computer, or another server, and you’ll benefit from being able to configure your software just once, and then re-using it wherever you go.

References

Sheffield,N.C. (2019) Bulker: A multi-container environment manager. OSF Preprints.

Veiga Leprevost,F. da et al. (2017) BioContainers: An open-source and community-driven framework for software standardization. Bioinformatics, 33, 2580–2582.