Using Docker Bioconductor containers for portable, sharable, and reproducible R analysis and package development

Nathan Sheffield, PhD
www.databio.org/slides

Outline


Docker basics
Browser does not support SVG

Bioconductor containers
Browser does not support SVG

Use Cases and Demo

What is Docker?

Docker allows you to package an application with all of its dependencies into a standardized unit ... that contains everything it needs to run: code, runtime, system tools, system libraries ... This guarantees that it will always run the same, regardless of the environment

Sounds like a virtual machine?

Docker vs Virtual Machines

Virtual machines include the application, the necessary binaries and libraries and an entire guest operating system - which may be tens of GBs in size.

Containers include the application and all of its dependencies, but share the kernel with other containers. They run as an isolated process in userspace on the host operating system.

How is Docker useful?

Version controlled environments
Increased reproducibility
Environment sharing and distribution (DockerHub)

Some Terminology

  • Image: A read-only template for containers. Think "Class"
  • Container: An instance of an image (it is created from an image). Think "Object" or "Instance"
  • Layer: An image consists of a series of layers, which are merged in the container.
  • Dockerfile: Instructions used to build an image.
  • Registry: An image storage center, holding public or private images which can be uploaded or downloaded (DockerHub).
  • Repository: A storage area of version-controlled images, like GitHub repositories.

Dockerfiles

Dockerfiles contain instructions for building an image. You can link Dockerfiles on GitHub to Dockerhub to trigger auto-builds.
FROM bioconductor/devel_core
MAINTAINER Nathan Sheffield <nathan@code.databio.org>

# Updating is required before any apt-gets
RUN sudo apt-get update && apt-get install -y --force-yes\
  # Required for R Package XML
  libxml2-dev \
  # Curl; required for RCurl; but present in upstream images
  # libcurl4-gnutls-dev \
   # GNU Scientific Library; required by MotIV
  libgsl0-dev \
  # Open SSL is used, for example, devtools dependency git2r
  libssl-dev \
   # CMD Check requires to check pdf size
  qpdf

# Boost libraries are helpful for some r packages
RUN sudo apt-get update && apt-get install -y --force-yes \
libboost-all-dev

COPY Rprofile .Rprofile

COPY Rsetup/install_fonts.R Rsetup/install_fonts.R
COPY Rsetup/fonts Rsetup/fonts
RUN Rscript Rsetup/install_fonts.R

# Install packages
COPY Rsetup/Rsetup.R Rsetup/Rsetup.R
RUN Rscript Rsetup/Rsetup.R
COPY Rsetup/rpack_basic.txt Rsetup/rpack_basic.txt
COPY Rsetup/rpack_bio.txt Rsetup/rpack_bio.txt
RUN Rscript Rsetup/Rsetup.R --packages=Rsetup/rpack_basic.txt
RUN Rscript Rsetup/Rsetup.R --packages=Rsetup/rpack_bio.txt

# If you want to develop R packages on this machine (need biocCheck):
COPY Rsetup/rpack_biodev.txt Rsetup/rpack_biodev.txt
RUN Rscript Rsetup/Rsetup.R --packages=Rsetup/rpack_biodev.txt


# CMD Check requires to check pdf size
RUN sudo apt-get install -y --force-yes qpdf

# Copy over the stuff in Rpack and add it to path
COPY Rpack/ Rpack/
ENV PATH Rpack:$PATH
You can find some examples in my Dockerfile repository on github

Some basic commands

user@host$ docker
Commands:
    build     Build an image from a Dockerfile
    commit    Create a new image from a container's changes
    images    List images
    info      Display system-wide information
    ps        List containers
    pull      Pull an image or a repository from a Docker registry server
    push      Push an image or a repository to a Docker registry server
    rm        Remove one or more containers
    rmi       Remove one or more images
    run       Run a command in a new container
    (and lots more)...

Docker meets Bioconductor

Browser does not support SVG
Examples of available images:
bioconductor/release_core
bioconductor/devel_core
bioconductor/release_sequencing
bioconductor/devel_sequencing
More information at bioconductor's docker page.

3 Example Use Cases

1. Containerize R CMD check and BiocCheck

2. Containerize an analysis as a deployable application

3. Maintain a personal/team R container to work from anywhere

Use case 1



An R CMD check container
Rpack.sh
roxygenize.sh -i $1

R --no-save <<END
devtools::install_deps("$1");
END

a=$(R CMD build $1)
echo "Building..$a"

# Get the name of the built tarball
regex="building '( .* )'"
[[ $a =~ $regex ]]
name="${BASH_REMATCH[1]}"

echo "R CMD check $name..."
R CMD check $name

echo "R CMD BiocCheck $name..."
R CMD BiocCheck $name
This script roxygenizes a package, builds it, then runs R CMD check and R CMD BiocCheck.
Browser does not support SVG
dockrpack.sh (outside the container)
#! /bin/bash
echo $1
docker run -it -v $1:$1 sheffien/rdev bash -c "Rpack.sh $1"
Now you can run R CMD check and BiocCheck in a container with all requirements, in a single command.
dockrpack.sh $HOME/code/LOLA
Building...
* checking for file ‘/home/nsheffield/code/LOLA/DESCRIPTION’ ... OK
* preparing ‘LOLA’:
* checking DESCRIPTION meta-information ... OK
* installing the package to build vignettes
* creating vignettes ... OK
* checking for LF line-endings in source and make files
* checking for empty or unneeded directories
* looking to see if a ‘data/datalist’ file should be added
* building ‘LOLA_0.99.9.tar.gz’
Built tarball: LOLA_0.99.9.tar.gz
R CMD check LOLA_0.99.9.tar.gz...
* using log directory ‘//LOLA.Rcheck’
* using R version 3.2.2 (2015-08-14)
* using platform: x86_64-pc-linux-gnu (64-bit)
* using session charset: UTF-8
* checking for file ‘LOLA/DESCRIPTION’ ... OK

...

R CMD BiocCheck LOLA_0.99.9.tar.gz...
* This is BiocCheck, version 1.5.8.
* BiocCheck is a work in progress. Output and severity of issues may
  change.
* Installing package...
* Checking for version number mismatch...

...

Summary:
REQUIRED count: 0
RECOMMENDED count: 0
CONSIDERATION count: 4

Use case 2



Package up your application to make distribution easy
Add an ENTRYPOINT to configure a container as an executable.
# Dockerfile for sheffien/lola
FROM sheffien/rdev

RUN wget http://big.databio.org/regionDB/LOLACoreCaches_latest.tgz
RUN tar -xf LOLACoreCaches_latest.tgz
RUN wget http://big.databio.org/regionDB/lola_vignette_data_150505.tgz
RUN tar -xf lola_vignette_data_150505.tgz

COPY LOLA bin/LOLA

ENTRYPOINT ["LOLA", "-d", "LOLACore/hg19", "-u", "data/activeDHS_universe.bed"]
Any additional command-line arguments to `docker run` are passed to the ENTRYPOINT executable, like so:
docker run -v $HOME:/data sheffien/lola -i /data/setA_100.bed -o /data
We're running a bioconductor package in a portable, version controlled, and self-contained environment (!)

Use case 3



Switch your R production environment to a container
There are two ways to do this:
1. Use a Dockerfile
Rebuild container with each Dockerfile update.
2. Commit changes github-style
Push interactive changes to DockerHub.

Both require your production compute environment to allow running docker

DEMO

Try it!

# Grab the latest Bioc devel image (may take awhile)
docker pull bioconductor/devel_base
# Create and start a container running R (starts instantly!)
docker run --name myR -it bioconductor/devel_base R --save --restore
Now, from inside R on in the container:
# Install some new packages, change the environment
> install.packages("data.table")
> biocLite("LOLA")
> variable = 12345
# Now, exit (Ctrl+D) and and view the containers (-n shows stopped)
docker ps -n 5

# start it up again and see your changes
docker start -i myR

# Commit and share!
docker commit -m "Added LOLA" myR sheffien/newrepo
docker images
docker push sheffien/newrepo
Thanks for listening!

Slides at http://databio.org/slides/docker_bioconductor.html