We develop and apply computational methods
to organize, analyze, and understand large epigenomic data.
Biological motivation
Cells alter phenotype by using DNA differently.
Breakdowns lead to disease
Full-stack bioinformatics
Full-stack bioinformatics
Full-stack bioinformatics
Full-stack bioinformatics
Augmented Interval List
Full-stack bioinformatics
Full-stack bioinformatics
Analysis of DNA methylation in Ewing sarcoma
Full-stack bioinformatics
Recent projects in the Sheffield lab
- Analysis methods for genomic region sets
- [LOLA](/slides/adventure.html#/LOLA): Enrichment of genomic ranges
- [MIRA](/slides/adventure.html#/MIRA): DNA methyation inference of regulatory activity
- [COCOA](/slides/adventure.html#/COCOA): Covariation analysis of epigenetic heterogeneity
- [PEPATAC](/slides/adventure.html#/PEPATAC): Serial alignments in ATAC-seq data processing
- Scientific computing and metadata management
- [Refgenie](/slides/adventure.html#/refgenie): Standardizing reference assembly assets
- [PEP](/slides/adventure.html#/pepkit): A management structure for sample metadata
- [bulker](/slides/adventure.html#/bulker): Using docker containers for desktop computing
Epigenomic data: high-dimensional and low-interpretable
Dimensionality reduction
Even with known groups
How can we annotate the source of variation?
COCOA Overview
John Lawson
Coordinate Covariation Analysis
Quantify variation into a 'target variable'
Supervised (e.g. clincial variable).
Unsupervised (e.g. PCA)
Annotate target variable with region sets.
What is epigenetic signal covariation?
What is epigenetic signal covariation?
What is epigenetic signal covariation?
What is epigenetic signal covariation?
What is epigenetic signal covariation?
Covariation informs source of observed variation
1. Choose target variable
What is the variation we'd like to explain?
Supervised target
Unsupervised target
2. Quantify correlation with target variable
2. Quantify correlation with target variable
Permutation tests establish significance
Case studies
Breast cancer DNA methylation (Unsupervised) Breast cancer ATAC-seq (Unsupervised) Kidney cancer DNA methylation (Supervised) Pan-cancer EZH2 analysis
Breast cancer DNA methylation PCA
COCOA results for PC1
ER-related regions have higher loadings on PC1
Raw DNA Methylation in ER binding regions
COCOA results for PC1-4
COCOA results for PC1-4
COCOA meta-region plots for PC1-4
Case studies
Breast cancer DNA methylation (Unsupervised) Breast cancer ATAC-seq (Unsupervised) Kidney cancer DNA methylation (Supervised) Pan-cancer EZH2 analysis
Breast cancer ATAC-seq PCA
COCOA results for ATAC-seq
ER-related regions have higher loadings on PC1
COCOA results for ATAC-seq
Case studies
Breast cancer DNA methylation (Unsupervised) Breast cancer ATAC-seq (Unsupervised) Kidney cancer DNA methylation (Supervised) Pan-cancer EZH2 analysis
Kidney cancer DNA methylation (Supervised)
Rank region sets for methylation that correlates with cancer stage
COCOA results for cancer stage
COCOA results for cancer stage
COCOA results for cancer stage
Case studies
Breast cancer DNA methylation (Unsupervised) Breast cancer ATAC-seq (Unsupervised) Kidney cancer DNA methylation (Supervised) Pan-cancer EZH2 analysis
DNA methylation in EZH2 regions and survival
DNA methylation in EZH2-binding regions most often positively correlated with risk of death.
#### Conclusions: COCOA...
- can an interpret continuous regulatory variation
- can use any signal that annotates genomic coordinates
- can work on supervised and unsupervised data
- is available from Bioconductor
- depends critically on the database of region sets...
An optimized ATAC-seq pipeline
with serial alignments
Smith et al (2021, In press). NAR Genomics and Bioinformatics.
Jason Smith
PEPATAC strengths
Modular system
Prealignments
Flexibility and portability
Outputs
Command-line interface with only 3 required arguments
$ /pipelines/pepatac.py -h
usage: pepatac.py [-h] [-R] [-N] [-D] [-F] [-C CONFIG_FILE]
[-O PARENT_OUTPUT_FOLDER] [-M MEMORY_LIMIT]
[-P NUMBER_OF_CORES] -S SAMPLE_NAME -I INPUT_FILES
[INPUT_FILES ...] [-I2 [INPUT_FILES2 [INPUT_FILES2 ...]]] -G
GENOME_ASSEMBLY [-Q SINGLE_OR_PAIRED] [-gs GENOME_SIZE]
[--frip-ref-peaks FRIP_REF_PEAKS] [--TSS-name TSS_NAME]
[--anno-name ANNO_NAME] [--keep]
[--peak-caller {fseq,macs2}]
[--trimmer {trimmomatic,skewer}]
[--prealignments PREALIGNMENTS [PREALIGNMENTS ...]] [-V]
PEPATAC version 0.7.3
optional arguments:
-h, --help show this help message and exit
-R, --recover Overwrite locks to recover from previous failed run
-N, --new-start Overwrite all results to start a fresh run
-D, --dirty Don't auto-delete intermediate files
-F, --force-follow Always run 'follow' commands
-C CONFIG_FILE, --config CONFIG_FILE
Pipeline configuration file (YAML). Relative paths are
with respect to the pipeline script.
-O PARENT_OUTPUT_FOLDER, --output-parent PARENT_OUTPUT_FOLDER
Parent output directory of project
-M MEMORY_LIMIT, --mem MEMORY_LIMIT
Memory limit (in Mb) for processes accepting such
-P NUMBER_OF_CORES, --cores NUMBER_OF_CORES
Number of cores for parallelized processes
-I2 [INPUT_FILES2 [INPUT_FILES2 ...]], --input2 [INPUT_FILES2 [INPUT_FILES2 ...]]
Secondary input files, such as read2
-Q SINGLE_OR_PAIRED, --single-or-paired SINGLE_OR_PAIRED
Single- or paired-end sequencing protocol
-gs GENOME_SIZE, --genome-size GENOME_SIZE
genome size for MACS2
--frip-ref-peaks FRIP_REF_PEAKS
Reference peak set for calculating FRiP
--TSS-name TSS_NAME Name of TSS annotation file
--anno-name ANNO_NAME
Name of reference bed file for calculating FRiF
--keep Keep prealignment BAM files
--peak-caller {fseq,macs2}
Name of peak caller
--trimmer {trimmomatic,pyadapt,skewer}
Name of read trimming program
--prealignments PREALIGNMENTS [PREALIGNMENTS ...]
Space-delimited list of reference genomes to align to
before primary alignment.
-V, --version show program's version number and exit
required named arguments:
-S SAMPLE_NAME, --sample-name SAMPLE_NAME
Name for sample to run
-I INPUT_FILES [INPUT_FILES ...], --input INPUT_FILES [INPUT_FILES ...]
One or more primary input files
-G GENOME_ASSEMBLY, --genome GENOME_ASSEMBLY
Identifier for genome assembly
Portable Encapsulated Projects (PEP) provide interoperability
Portable Encapsulated Projects (PEP) provide interoperability
Nuclear-mitochondrial DNA (NuMts) confuse aligners
Problems with region masking
Inaccurate alignment statistics
Requires pre-defined NuMt locations
Wastes compute power
### Advantages of serial alignments
- Accuracy (better rates plus no blacklist needed).
- Speed.
- Modular reference assemblies.
PEPATAC strengths
Modular system
Prealignments
Flexibility and portability
Outputs
### Flexibility and Portability
- trimmer options: `skewer` and `trimmomatic`
- peak caller options: `macs2` and `fseq`
- aligner options: `bowtie2` and `bwa`
```
./pepatac.py --trimmer trimmomatic --peak-caller fseq
```
Flexibility and Portability
parameterization via config file pepatac.yaml
### Flexibility and Portability
Running options:
- natively
- conda
- containers using `docker` or `singularity`.
- use bulker to manage containers for your (http://bulker.io)
```
git clone github.com/databio/pepatac
docker pull databio/pepatac
docker run --rm -it databio/pepatac pipelines/pepatac.py
```
Many tools require genome-related assets (like indexes).
How should we organize these on disk?
## A standard organization simplifies tool interface
Flexible paths must be passed individually:
```
pipeline.py --bowtie2-index path/to/hg38/bowtie2-index \
--tss_annotation path/to/hg38/tss_annotation.bed \
--ensembl_anno path/to/hg38/ensembl_v86.gtf
```
A standard establishes expectations:
```
pipeline.py --genome hg38
```
## Illumina's [iGenomes](https://support.illumina.com/sequencing/sequencing_software/igenome.html) is one answer
iGenomes is *a collection of reference sequences and annotation files for commonly analyzed organisms*.
You download a tarball of a standard structure for your genome of interest, then write tools off that.
## The 'central repository' approach is limited
- *Not scripted.* No iGenomes for an arbitrary genome/asset.
- *Not modular*. No access to individual assets.
- *Not programmatic*. Can't access data/metadata via API.
## Refgenie solves these limitations
- *Two ways to retrieve an asset.*
- `build` any asset from a recipe.
- `pull` any individual asset from a server
- *Better discoverability and modularity*.
- `list/listr` shows assets
- `refgenieserver` is a browseable web interface and API
- *Managed locations*.
- `seek` returns the local path to assets
- `add/remove` to manage your own assets
Refgenie consists of 3 components
Refgenie splits tasks between CLI and server
## Refgenie CLI example
[http://refgenie.databio.org](http://refgenie.databio.org)
The build/pull method needs provenance checks
Asset provenance:
Genome provenance:
Refget
Refget enables access to reference sequences
using an identifier derived from the sequence itself.
## Refgenie implements (collection) refget-like
You first have to have the `fasta` asset:
```
refgenie pull -g hg38 -a fasta
```
Then you can use `getseq`:
```
refgenie getseq -g hg38 -l chr1:50000-50400
AAACAGGTTAATCGCCACGACATAGTAGTATTTAGAGTTACTAGTAAGCCTGATGCCACTACACAATTCTAGCTTTTCTCTTTAGGATGATTGTTTCATTCAGTCTTATCTCTTTTAGAAAACATAGGAAAAAATTATTTAATAATAAAATTTAATTGGCAAAATGAAGGTATGGCTTATAAGAGTGTTTTCCTATTGTTTTCAGTGTAGGACTCACTGTTCTAAATAACTGGGACACCCAAGGATTCTGTAAAATGCCATCCAGTTATCATTTATATTCCCTAACTCAAAATTCATTCACATGTATTCATTTTTTTCTAAACAAATTAGCATGTAGAATTCTGGTTAAAATTTGGCATAGAACACCCGGGTATTTTTTCATAATGCACCCAATAACTGT
```
Sheffield lab Erfaneh Gharavi
Michal Stolarczyk
John Lawson
Jason Smith
Kristyna Kupkova
John Stubbs
Bingjie Xue
Jose Verdezoto
Nathan LeRoy
Oleksandr Khoroshevskyi