Using Docker Bioconductor containers for portable, sharable, and reproducible R analysis and package development
Nathan Sheffield, PhD
www.databio.org/slides
Outline
Docker basics
Browser does not support SVG
Bioconductor containers
Browser does not support SVG
Use Cases and Demo
What is Docker?
Docker allows you to package an application with all of its dependencies into a standardized unit ... that contains everything it needs to run : code, runtime, system tools, system libraries ... This guarantees that it will always run the same , regardless of the environment
Sounds like a virtual machine?
Docker vs Virtual Machines
Virtual machines include the application, the necessary binaries and libraries and an entire guest operating system - which may be tens of GBs in size.
Containers include the application and all of its dependencies, but share the kernel with other containers. They run as an isolated process in userspace on the host operating system.
How is Docker useful?
Version controlled environments
Increased reproducibility
Environment sharing and distribution (
DockerHub )
Some Terminology
Image: A read-only template for containers. Think "Class"
Container: An instance of an image (it is created from an image). Think "Object" or "Instance"
Layer: An image consists of a series of layers, which are merged in the container.
Dockerfile: Instructions used to build an image.
Registry: An image storage center, holding public or private images which can be uploaded or downloaded (DockerHub).
Repository: A storage area of version-controlled images, like GitHub repositories.
Dockerfiles
Dockerfiles contain instructions for building an image.
You can link Dockerfiles on GitHub to Dockerhub to trigger auto-builds.
FROM bioconductor/devel_core
MAINTAINER Nathan Sheffield <nathan@code.databio.org>
# Updating is required before any apt-gets
RUN sudo apt-get update && apt-get install -y --force-yes \
# Required for R Package XML
libxml2-dev \
# Curl; required for RCurl; but present in upstream images
# libcurl4-gnutls-dev \
# GNU Scientific Library; required by MotIV
libgsl0-dev \
# Open SSL is used, for example, devtools dependency git2r
libssl-dev \
# CMD Check requires to check pdf size
qpdf
# Boost libraries are helpful for some r packages
RUN sudo apt-get update && apt-get install -y --force-yes \
libboost-all-dev
COPY Rprofile .Rprofile
COPY Rsetup/install_fonts.R Rsetup/install_fonts.R
COPY Rsetup/fonts Rsetup/fonts
RUN Rscript Rsetup/install_fonts.R
# Install packages
COPY Rsetup/Rsetup.R Rsetup/Rsetup.R
RUN Rscript Rsetup/Rsetup.R
COPY Rsetup/rpack_basic.txt Rsetup/rpack_basic.txt
COPY Rsetup/rpack_bio.txt Rsetup/rpack_bio.txt
RUN Rscript Rsetup/Rsetup.R --packages = Rsetup/rpack_basic.txt
RUN Rscript Rsetup/Rsetup.R --packages = Rsetup/rpack_bio.txt
# If you want to develop R packages on this machine (need biocCheck):
COPY Rsetup/rpack_biodev.txt Rsetup/rpack_biodev.txt
RUN Rscript Rsetup/Rsetup.R --packages = Rsetup/rpack_biodev.txt
# CMD Check requires to check pdf size
RUN sudo apt-get install -y --force-yes qpdf
# Copy over the stuff in Rpack and add it to path
COPY Rpack/ Rpack/
ENV PATH Rpack:$PATH
Some basic commands
user@host$ docker
Commands:
build Build an image from a Dockerfile
commit Create a new image from a container's changes
images List images
info Display system-wide information
ps List containers
pull Pull an image or a repository from a Docker registry server
push Push an image or a repository to a Docker registry server
rm Remove one or more containers
rmi Remove one or more images
run Run a command in a new container
(and lots more)...
Docker meets Bioconductor
Browser does not support SVG
Examples of available images:
bioconductor/release_core
bioconductor/devel_core
bioconductor/release_sequencing
bioconductor/devel_sequencing
3 Example Use Cases
1. Containerize R CMD check and BiocCheck
2. Containerize an analysis as a deployable application
3. Maintain a personal/team R container to work from anywhere
Use case 1
An R CMD check container
Rpack.sh
roxygenize.sh -i $1
R --no-save << END
devtools::install_deps(" $1 ");
END
a = $( R CMD build $1 )
echo "Building.. $a "
# Get the name of the built tarball
regex = "building '( .* )'"
[[ $a = ~ $regex ]]
name = " ${ BASH_REMATCH [1] } "
echo "R CMD check $name ..."
R CMD check $name
echo "R CMD BiocCheck $name ..."
R CMD BiocCheck $name
This script roxygenizes a package, builds it, then runs R CMD check and R CMD BiocCheck.
Browser does not support SVG
dockrpack.sh (outside the container)
#! /bin/bash
echo $1
docker run -it -v $1 :$1 sheffien/rdev bash -c "Rpack.sh $1 "
Now you can run R CMD check and BiocCheck in a container with all requirements, in a single command.
dockrpack.sh $HOME /code/LOLA
Building...
* checking for file ‘/home/nsheffield/code/LOLA/DESCRIPTION’ ... OK
* preparing ‘LOLA’:
* checking DESCRIPTION meta-information ... OK
* installing the package to build vignettes
* creating vignettes ... OK
* checking for LF line-endings in source and make files
* checking for empty or unneeded directories
* looking to see if a ‘data/datalist’ file should be added
* building ‘LOLA_0.99.9.tar.gz’
Built tarball: LOLA_0.99.9.tar.gz
R CMD check LOLA_0.99.9.tar.gz...
* using log directory ‘//LOLA.Rcheck’
* using R version 3.2.2 ( 2015-08-14)
* using platform: x86_64-pc-linux-gnu ( 64-bit)
* using session charset: UTF-8
* checking for file ‘LOLA/DESCRIPTION’ ... OK
...
R CMD BiocCheck LOLA_0.99.9.tar.gz...
* This is BiocCheck, version 1.5.8.
* BiocCheck is a work in progress. Output and severity of issues may
change.
* Installing package...
* Checking for version number mismatch...
...
Summary:
REQUIRED count: 0
RECOMMENDED count: 0
CONSIDERATION count: 4
Use case 2 Package up your application to make distribution easy
Add an ENTRYPOINT to configure a container as an executable.
# Dockerfile for sheffien/lola
FROM sheffien/rdev
RUN wget http://big.databio.org/regionDB/LOLACoreCaches_latest.tgz
RUN tar -xf LOLACoreCaches_latest.tgz
RUN wget http://big.databio.org/regionDB/lola_vignette_data_150505.tgz
RUN tar -xf lola_vignette_data_150505.tgz
COPY LOLA bin/LOLA
ENTRYPOINT [ "LOLA" , "-d" , "LOLACore/hg19" , "-u" , "data/activeDHS_universe.bed" ]
Any additional command-line arguments to `docker run` are passed to the ENTRYPOINT executable, like so:
docker run -v $HOME :/data sheffien/lola -i /data/setA_100.bed -o /data
We're running a bioconductor package in a portable, version controlled, and self-contained environment (!)
Use case 3
Switch your R production environment to a container
There are two ways to do this:
1. Use a Dockerfile
Rebuild container with each Dockerfile update.
2. Commit changes github-style
Push interactive changes to DockerHub.
Both require your production compute environment to allow running docker
Try it!
# Grab the latest Bioc devel image (may take awhile)
docker pull bioconductor/devel_base
# Create and start a container running R (starts instantly!)
docker run --name myR -it bioconductor/devel_base R --save --restore
Now, from inside R on in the container:
# Install some new packages, change the environment
> install.packages ( "data.table" )
> biocLite ( "LOLA" )
> variable = 12345
# Now, exit (Ctrl+D) and and view the containers (-n shows stopped)
docker ps -n 5
# start it up again and see your changes
docker start -i myR
# Commit and share!
docker commit -m "Added LOLA" myR sheffien/newrepo
docker images
docker push sheffien/newrepo