Software & Data

You can find most of our software spread across several github organizations: databio, refgenie and pepkit. Here is a list of polished software aggregated in one place and sorted by purpose:

Project management and pipeline development

source

docs

Pypiper is a development-oriented pipeline framework. It is a Python package that helps you write robust pipelines directly in Python, handling mundane tasks like restartability, monitoring for time and memory use, monitoring job status, copious log output, robust error handling, easy debugging tools, and guaranteed file output integrity. Language: Python

source

docs

A pipeline submitting engine. Looper deploys any command-line pipeline for each sample in a project organized in standard sample metadata format (PEP). You can think of looper as providing a single user interface to running, summarizing, monitoring, and otherwise managing all of your sample-intensive research projects the same way, regardless of data type or pipeline used. Language: Python

source

docs

Sample metadata validation and conversion tool. Eido provides an ability to take standardized metadata in PEP format and convert to any output format. It also provides a validation engine based on JSON-schema for sample metadata. Language: Python

source

docs

A multi-container computing environment manager. A bulker environment consists of an individual container image for each command. Bulker environments are portable, interactive, and independent of any specific workflow. Bulker simplifies both interactive analysis and workflow development by building drop-in replacements to command-line tools that act like native tools, but run in containers. Think of bulker as a lightweight wrapper for docker/singularity to simplify sharing complete, containerized environments. Language: Python

source

docs

Caravel provides a web interface to interact with your PEP-formatted projects. Caravel lets you submit jobs to any cluster resource manager, monitor jobs, summarize results, and browse project summary web pages. Caravel is a local web GUI for looper built with flask. Language: Python

source

docs

peppy

A Python package that provides an API for handling standardized project and sample metadata. If you define your project in Portable Encapsulated Project (PEP) format, you can use the peppy package to instantiate an in-memory representation of your project and sample metadata. You can then use peppy for interactive analysis, or to develop Python tools so you don't have to handle sample processing. Peppy is useful to tool developers and data analysts who want a standard way of representing sample-intensive research project metadata. Language: Python

source

docs

pipestat

Pipestat Language: Python

source

docs

Divvy is a computing resource configuration manager. It organizes your computing resources and populates job submission templates. It makes it easy for users to toggle among any computing resource (laptop, cluster, cloud). Divvy provides both an interactive Python API and a command-line interface. Language: Python

source

docs

pepspec

The formal PEP specification Language: various

source

docs

A command-line tool that downloads sequencing data and metadata from GEO and SRA and creates standard sample metadata tables in PEP format. Language: Python

source

docs

pepr

An R package for interfacing with sample metadata in PEP format. Language: R

source

docs

BiocProject

An R package for integrating PEPs with other data structures Language: R

source

docs

projectInit

A project initialization manager Language: R

source

docs

pephub

GA web API and database for biological sample metadata. Language: Python

source

docs

pephubclient

A tool to provide Python and CLI interface and Python API for PEPhub. Language: Python

source

docs

pepdbagent

Database for storing pep projects. Language: Python

source

docs

Yacman is a YAML configuration manager. It provides convenience functions for Python developers dealing with YAML configuration files. Language: Python

source

docs

Refgenie is full-service reference genome manager that organizes storage, access, and transfer of reference genomes. It provides command-line and Python interfaces to download pre-built reference genome "assets" like indexes used by bioinformatics tools. It can also build assets for custom genome assemblies. Language: Python

source

docs

refgenieserver

Refgenieserver is containerized code that hosts genome assets that can be automatically downloaded by the refgenie command-line interface. Language: Python

source

docs

henge

Henge is a Python package that builds backends for generic decomposable recursive unique identifiers, providing a way to store arbitrary data, defined with JSON-schema, and uses object-derived unique identifiers to retrieve the data. Language: Python

source

docs

refget

The refget package provides a Python interface to both remote and local use of the refget protocol. Language: Python

source

docs

seqcol

The seqcol package provides a Python interface to sequence collections. Language: Python

source

docs

pephub

GA web API and database for biological sample metadata. Language: Python

source

docs

pephubclient

A tool to provide Python and CLI interface and Python API for PEPhub. Language: Python

source

docs

pepdbagent

Database for storing pep projects. Language: Python

Data analysis software

source

docs

Coordinate Covariation Analysis. Identifying sources of intersample variation using PCA and region sets for genomic coordinate-based data. Language: R

source

docs

GenomicDistributions

Calculate and plot distributions of genomic ranges. Language: R

source

docs

MIRA

A Bioconductor package for inferring regulatory activity from DNA methylation data. Language: R

source

docs

Genomic Locus Overlap Analysis. An R package for enrichment analysis of genomic ranges. Given an input set of genomic regions and a database of genomic region sets, LOLA will compute overlaps and return a list of database region sets ranked by similarity. Language: R

source

docs

simpleCache

An R package providing functions for caching R objects. Its purpose is to encourage writing reusable, restartable, and reproducible analysis pipelines for projects with massive data and computational requirements. Language: R

source

docs

pararead

A Python package that simplifies parallel processing of DNA sequencing reads (BAM or SAM files), by parallelizing across chromosomes. Pararead is built for developers of Python scripts that process data read-by-read. It enables you to quickly and easily parallelize your script. Language: Python

source

docs

AIList

Augmented Interval List is a data structure with the fastest currently known algorithm for searching for genomic overlaps between two sets of genomic ranges with high containment. Language: C

source

docs

geniml

Genomic interval machine learning toolkit. Does interesting stuff to BED files. Language: Python

Pipelines

source

docs

PEPPRO is a pipeline designed to process PRO-seq data. It is optimized on unique features of PRO-seq to be fast and accurate. It performs adapter removal, including UMI of variable length, read deduplication, trimming, mapping, and signal tracks (bigWig) for plus and minus strands using scaled (based on mappability information) or unscaled read count patterns. Language: Python

source

docs

PEPATAC is an ATAC-seq pipeline. It trims adapters, maps reads, calls peaks, and creates bigwig tracks, TSS enrichment files, and other outputs. It is optimized on unique features of ATAC-seq data to be fast and accurate and provides several unique analytical approaches. Language: Python

source

dnameth

Pipelines for Whole Genome and Reduced Representation Bisulfite-seq. Language: Python

source

rnapipe

Pipeline for RNA-seq data. Language: Python

Web resources and services

LOLAweb - A server with public hosting of our shiny interface to the LOLA R-package.
Refgenie reference genome asset server - Implementation of refgenieserver, hosting various genome-related resources.
Regulatory Elements Database - A database of DNase hypersensitivity data
Ewing Sarcoma Epigenome Resources - Comprehensive epigenome mapping of Ewing sarcoma.
Region Databases - Curated databases of region sets for use with LOLA and other tools.

Papers that published raw or processed data

Year	Journal	Title	Data
2017	Nature Medicine	DNA methylation heterogeneity defines a disease spectrum in Ewing sarcoma	Data site GEO:GSE88826
2015	Cell Reports	Epigenome mapping reveals distinct modes of gene regulation and widespread enhancer reprogramming by the oncogenic fusion protein EWS-FLI1	Data site
2015	Cell Reports	Single-cell DNA methylome sequencing and bioinformatic inference of epigenomic cell-state dynamics	GEO:GSE65196
2013	PLoS Genetics	Extensive evolutionary changes in regulatory element activity during human origins are associated with altered gene expression and positive selection	GEO:GSE54908

Project management and pipeline development

peppy

pipestat

pepspec

pepr

BiocProject

projectInit

pephub

pephubclient

pepdbagent

Data sharing and API software

refgenieserver

henge

refget

seqcol

pephub

pephubclient

pepdbagent

Data analysis software

GenomicDistributions

MIRA

simpleCache

pararead

AIList

geniml

Pipelines

dnameth

rnapipe

Web resources and services

Papers that published raw or processed data

Training and skills resources