You can find most of our software spread across several github organizations: databio, refgenie and pepkit. Here is a list of polished software aggregated in one place and sorted by purpose:

Project management and pipeline development

Pypiper is a development-oriented pipeline framework. It is a Python package that helps you write robust pipelines directly in Python, handling mundane tasks like restartability, monitoring for time and memory use, monitoring job status, copious log output, robust error handling, easy debugging tools, and guaranteed file output integrity. Language: Python
A pipeline submitting engine. Looper deploys any command-line pipeline for each sample in a project organized in standard sample metadata format (PEP). You can think of looper as providing a single user interface to running, summarizing, monitoring, and otherwise managing all of your sample-intensive research projects the same way, regardless of data type or pipeline used. Language: Python
Sample metadata validation and conversion tool. Eido provides an ability to take standardized metadata in PEP format and convert to any output format. It also provides a validation engine based on JSON-schema for sample metadata. Language: Python
A multi-container computing environment manager. A bulker environment consists of an individual container image for each command. Bulker environments are portable, interactive, and independent of any specific workflow. Bulker simplifies both interactive analysis and workflow development by building drop-in replacements to command-line tools that act like native tools, but run in containers. Think of bulker as a lightweight wrapper for docker/singularity to simplify sharing complete, containerized environments. Language: Python
Caravel provides a web interface to interact with your PEP-formatted projects. Caravel lets you submit jobs to any cluster resource manager, monitor jobs, summarize results, and browse project summary web pages. Caravel is a local web GUI for looper built with flask. Language: Python


A Python package that provides an API for handling standardized project and sample metadata. If you define your project in Portable Encapsulated Project (PEP) format, you can use the peppy package to instantiate an in-memory representation of your project and sample metadata. You can then use peppy for interactive analysis, or to develop Python tools so you don't have to handle sample processing. Peppy is useful to tool developers and data analysts who want a standard way of representing sample-intensive research project metadata. Language: Python


Pipestat Language: Python
Divvy is a computing resource configuration manager. It organizes your computing resources and populates job submission templates. It makes it easy for users to toggle among any computing resource (laptop, cluster, cloud). Divvy provides both an interactive Python API and a command-line interface. Language: Python


The formal PEP specification Language: various
A command-line tool that downloads sequencing data and metadata from GEO and SRA and creates standard sample metadata tables in PEP format. Language: Python


An R package for interfacing with sample metadata in PEP format. Language: R


An R package for integrating PEPs with other data structures Language: R


A project initialization manager Language: R


GA web API and database for biological sample metadata. Language: Python


A tool to provide Python and CLI interface and Python API for PEPhub. Language: Python


Database for storing pep projects. Language: Python

Data sharing and API software

Yacman is a YAML configuration manager. It provides convenience functions for Python developers dealing with YAML configuration files. Language: Python
Refgenie is full-service reference genome manager that organizes storage, access, and transfer of reference genomes. It provides command-line and Python interfaces to download pre-built reference genome "assets" like indexes used by bioinformatics tools. It can also build assets for custom genome assemblies. Language: Python


Refgenieserver is containerized code that hosts genome assets that can be automatically downloaded by the refgenie command-line interface. Language: Python


Henge is a Python package that builds backends for generic decomposable recursive unique identifiers, providing a way to store arbitrary data, defined with JSON-schema, and uses object-derived unique identifiers to retrieve the data. Language: Python


The refget package provides a Python interface to both remote and local use of the refget protocol. Language: Python


The seqcol package provides a Python interface to sequence collections. Language: Python


GA web API and database for biological sample metadata. Language: Python


A tool to provide Python and CLI interface and Python API for PEPhub. Language: Python


Database for storing pep projects. Language: Python

Data analysis software

Coordinate Covariation Analysis. Identifying sources of intersample variation using PCA and region sets for genomic coordinate-based data. Language: R


Calculate and plot distributions of genomic ranges. Language: R


A Bioconductor package for inferring regulatory activity from DNA methylation data. Language: R
Genomic Locus Overlap Analysis. An R package for enrichment analysis of genomic ranges. Given an input set of genomic regions and a database of genomic region sets, LOLA will compute overlaps and return a list of database region sets ranked by similarity. Language: R


An R package providing functions for caching R objects. Its purpose is to encourage writing reusable, restartable, and reproducible analysis pipelines for projects with massive data and computational requirements. Language: R


A Python package that simplifies parallel processing of DNA sequencing reads (BAM or SAM files), by parallelizing across chromosomes. Pararead is built for developers of Python scripts that process data read-by-read. It enables you to quickly and easily parallelize your script. Language: Python


Augmented Interval List is a data structure with the fastest currently known algorithm for searching for genomic overlaps between two sets of genomic ranges with high containment. Language: C


Genomic interval machine learning toolkit. Does interesting stuff to BED files. Language: Python


PEPPRO is a pipeline designed to process PRO-seq data. It is optimized on unique features of PRO-seq to be fast and accurate. It performs adapter removal, including UMI of variable length, read deduplication, trimming, mapping, and signal tracks (bigWig) for plus and minus strands using scaled (based on mappability information) or unscaled read count patterns. Language: Python
PEPATAC is an ATAC-seq pipeline. It trims adapters, maps reads, calls peaks, and creates bigwig tracks, TSS enrichment files, and other outputs. It is optimized on unique features of ATAC-seq data to be fast and accurate and provides several unique analytical approaches. Language: Python


Pipelines for Whole Genome and Reduced Representation Bisulfite-seq. Language: Python


Pipeline for RNA-seq data. Language: Python

Web resources and services

Papers that published raw or processed data

Year Journal Title Data
2017 Nature Medicine DNA methylation heterogeneity defines a disease spectrum in Ewing sarcoma Data site
2015 Cell Reports Epigenome mapping reveals distinct modes of gene regulation and widespread enhancer reprogramming by the oncogenic fusion protein EWS-FLI1 Data site
2015 Cell Reports Single-cell DNA methylome sequencing and bioinformatic inference of epigenomic cell-state dynamics GEO:GSE65196
2013 PLoS Genetics Extensive evolutionary changes in regulatory element activity during human origins are associated with altered gene expression and positive selection GEO:GSE54908

Training and skills resources