Methods for analyzing non-coding genomic intervals and their applications in cancer biology
Nathan Sheffield, PhD
www.databio.org/slides
Outline
COCOA
|
20%
35%
35%
10%
|
|
|
Background, LOLA/MIRA
RegionSet2vec
Other projects
◁ Questions ▷
Biological motivation

Cells alter phenotype by using DNA differently.

Breakdowns lead to disease
Region pooling
Locus Overlap Analysis
Sheffield and Bock (2016). Bioinformatics.
Nagraj, Magee, and Sheffield (2018). Nucleic Acids Research.
Methylation-based Inference of Regulatory Activity (MIRA)
Lawson et al. (2018). Bioinformatics.

Sheffield et al. (2017). Nature Medicine.
Coordinate Covariation Analysis (COCOA)
Lawson et al. (2020). Genome Biology.
Goal: understand variation among individuals
Supervised differential analysis
Supervised continuous analysis
Unsupervised analysis
Epigenomic data: high-dimensional
and low-interpretable

Dimensionality reduction

Even with known groups
How can we annotate the source of variation?
COCOA Overview

John Lawson

Coordinate Covariation Analysis
- Quantify variation into a 'target variable'
- Supervised (e.g. clincial variable).
- Unsupervised (e.g. PCA)
- Annotate target variable with region sets.
What is epigenetic signal covariation?

What is epigenetic signal covariation?

What is epigenetic signal covariation?

What is epigenetic signal covariation?

What is epigenetic signal covariation?

Covariation informs source of observed variation
1. Choose target variable
What is the variation we'd like to explain?
Supervised target
Unsupervised target
2. Quantify correlation with target variable

2. Quantify correlation with target variable

Permutation tests establish significance

Case studies
Breast cancer DNA methylation (Unsupervised)
Breast cancer ATAC-seq (Unsupervised)
Kidney cancer DNA methylation (Supervised)
Pan-cancer EZH2 analysis
Breast cancer DNA methylation PCA

COCOA results for PC1

ER-related regions have higher loadings on PC1
Raw DNA Methylation in ER binding regions

COCOA results for PC1-4

COCOA results for PC1-4

COCOA meta-region plots for PC1-4

Case studies
Breast cancer DNA methylation (Unsupervised)
Breast cancer ATAC-seq (Unsupervised)
Kidney cancer DNA methylation (Supervised)
Pan-cancer EZH2 analysis
Breast cancer ATAC-seq PCA

COCOA results for ATAC-seq

ER-related regions have higher loadings on PC1
COCOA results for ATAC-seq

Case studies
Breast cancer DNA methylation (Unsupervised)
Breast cancer ATAC-seq (Unsupervised)
Kidney cancer DNA methylation (Supervised)
Pan-cancer EZH2 analysis
Kidney cancer DNA methylation (Supervised)

Rank region sets for methylation
that correlates with cancer stage
COCOA results for cancer stage
COCOA results for cancer stage
COCOA results for cancer stage
Case studies
Breast cancer DNA methylation (Unsupervised)
Breast cancer ATAC-seq (Unsupervised)
Kidney cancer DNA methylation (Supervised)
Pan-cancer EZH2 analysis
DNA methylation in EZH2 regions and survival

DNA methylation in EZH2-binding regions most often positively correlated with risk of death.
#### Conclusions: COCOA...
- can an interpret continuous regulatory variation
- can use any signal that annotates genomic coordinates
- can work on supervised and unsupervised data
- is available from Bioconductor
- depends critically on the database of region sets...
What does it mean for two region sets (BED files) to be similar?
Overlaps makes some sense...but what about:
degree of overlap?
weighting of specific regions?
biological similarity of regions?
The bag-of-words model for text classification
Zheng and Casari (2018), Feature Engineering for Machine Learning
The bag-of-intervals model for genomic intervals
Advantages
- Vector representation of a region set
- Similarity metrics among vectors
- Space and time complexity
Limitations of the bag of words vector approach
hotel = [0 0 0 0 0 0 0 1 0 0 0 0 0 0]
motel = [0 0 0 0 0 0 0 0 0 0 0 0 1 0]
- Sparsity
- Curse of dimensionality
- No concept of relationships among words
- Space and time complexity
Decreasing space/time complexity
Genomic interval sets
↓
High-dimensional vectors
↓
Low-dimensional vectors
Word embeddings
http://suriyadeepan.github.io
Word2vec model
Word context
You shall know a word by the company it keeps. (Firth 1957)
Words that occur in similar contexts tend to have similar meanings.
Image credit: Shubham Agarwal
Genomic Interval Embeddings
Evaluation
We have created unsupervised 100-dimensional vector representations (embeddings) of region sets.
Do relationships among vectors reflect biology?
Evaluation 1: Classification performance
Evaluation 1: Classification performance
Evaluation 2: Perturbation similarity detection
Evaluation 3: Peak threshold robustness
Conclusion
- Regionset2vec uses an adapted word2vec model to train vectors for genomic regions
- Regionset2vec embeddings capture expected biological annotations
- Regionset2vec reflects known simulated perturbations
- Regionset2vec is robust to missing data
- NLP approaches can be adapted for applications in genomic interval analysis
Future applications
Cancer mutations
Single-cell
A high-performance server and API
for genomic interval data.
- Data spans projects (40,000 GEO accessions, 100,000+ files)
- Programmatic API for metadata, statistics, and data chunks
- Human browsing of statistical and biological attributes
- Human-friendly search
Thank You
Collaborators
Aakrosh Ratan
Aidong Zhang
Guangtao Zheng
Don Brown
Hyun Jae Cho
Mikhail Dozmorov
Fran Garrett-Backelman
Christoph Bock
Eleni Tomazou
Sheffield lab
Erfaneh Gharavi
Kristyna Kupkova
John Stubbs
Bingjie Xue
Jose Verdezoto
Nathan LeRoy
Oleksandr Khoroshevskyi
Alumni
Aaron Gu
Jianglin Feng
Tessa Danehy
Michal Stolarczyk
John Lawson
Jason Smith
nsheff ·
databio.org ·
nsheffield@virginia.edu