You can find most of our software spread across several github organizations: databio and pepkit. Here is a list of polished software aggregated in one place and sorted by purpose:

Project management and pipeline development

name language description source docs
pypiper python Pypiper is a development-oriented pipeline framework. It is a python package that helps you write robust pipelines directly in python, handling mundane tasks like restartability, monitoring for time and memory use, monitoring job status, copious log output, robust error handling, easy debugging tools, and guaranteed file output integrity.
looper python Looper is a pipeline submitting engine. Looper deploys any command-line pipeline for each sample in a project organized in standard PEP format. You can think of looper as providing a single user interface to running, summarizing, monitoring, and otherwise managing all of your sample-intensive research projects the same way, regardless of data type or pipeline used.
caravel python Caravel provides a web interface to interact with your PEP-formatted projects. Caravel lets you submit jobs to any cluster resource manager, monitor jobs, summarize results, and browse project summary web pages. Caravel is a local web GUI for looper built with flask.
peppy python peppy is a python package that provides an API for handling standardized project and sample metadata. If you define your project in Portable Encapsulated Project (PEP) format, you can use the peppy package to instantiate an in-memory representation of your project and sample metadata. You can then use peppy for interactive analysis, or to develop python tools so you don’t have to handle sample processing. peppy is useful to tool developers and data analysts who want a standard way of representing sample-intensive research project metadata.
pepkit various A software suite made up of various tools that create or read PEP projects
geofetch python geofetch is a command-line tool that downloads sequencing data and metadata from GEO and SRA and creates standard PEPs.
pepr R An R package for interfacing with PEPs.
BiocProject R An R package for integrating PEPs with other data structures
projectInit R A project initialization manager

Data sharing and API software

name language description source docs
refgenie python Refgenie is full-service reference genome manager that organizes storage, access, and transfer of reference genomes. It provides command-line and python interfaces to download pre-built reference genome “assets” like indexes used by bioinformatics tools. It can also build assets for custom genome assemblies.
refgenieserver python Refgenieserver is containerized code that hosts genome assets that can be automatically downloaded by the refgenie command-line interface.

Data analysis software

name language description source docs
COCOA R Coordinate Covariation Analysis. Identifying sources of intersample variation using PCA and region sets for genomic coordinate-based data.
MIRA R Bioconductor package for inferring regulatory activity from DNA methylation.
LOLA R Genomic Locus Overlap Analysis. Enrichment of genomic ranges.
simpleCache R simpleCache is an R package providing functions for caching R objects. Its purpose is to encourage writing reusable, restartable, and reproducible analysis pipelines for projects with massive data and computational requirements.
pararead python Pararead is a python package that simplifies parallel processing of DNA sequencing reads (BAM or SAM files), by parallelizing across chromosomes. Pararead is built for developers of python scripts that process data read-by-read. It enables you to quickly and easily parallelize your script.
AIList C Augmented Interval List is a data structure with the fastest currently known algorithm for searching for genomic overlaps between two sets of genomic ranges with high containment.


name language description source docs
PEPPRO python PEPPRO is a pipeline designed to process PRO-seq data. It is optimized on unique features of PRO-seq to be fast and accurate. It performs adapter removal, including UMI of variable length, read deduplication, trimming, mapping, and signal tracks (bigWig) for plus and minus strands using scaled (based on mappability information) or unscaled read count patterns.
PEPATAC python PEPATAC is an ATAC-seq pipeline. It trims adapters, maps reads, calls peaks, and creates bigwig tracks, TSS enrichment files, and other outputs. It is optimized on unique features of ATAC-seq data to be fast and accurate and provides several unique analytical approaches.
dnameth python Pipelines for Whole Genome and Reduced Representation Bisulfite-seq.  
rnapipe python Pipeline for RNA-seq data.  

Web resources and services

Papers that published raw or processed data

Year Journal Title Data
2017 Nature Medicine DNA methylation heterogeneity defines a disease spectrum in Ewing sarcoma Data site
2017 Nature Medicine DNA methylation heterogeneity defines a disease spectrum in Ewing sarcoma Data site
2015 Cell Reports Epigenome mapping reveals distinct modes of gene regulation and widespread enhancer reprogramming by the oncogenic fusion protein EWS-FLI1 Data site
2015 Cell Reports Single-cell DNA methylome sequencing and bioinformatic inference of epigenomic cell-state dynamics GEO:GSE65196
2013 PLoS Genetics Extensive evolutionary changes in regulatory element activity during human origins are associated with altered gene expression and positive selection GEO:GSE54908

Training and skills resources