Methods for analyzing non-coding genomic intervals and their applications in cancer biology

Nathan Sheffield, PhD

www.databio.org/slides

Outline

COCOA

20%

35%

10%

Background, LOLA/MIRA

RegionSet2vec

Other projects

◁ Questions ▷

Biological motivation

Cells alter phenotype by using DNA differently.

Breakdowns lead to disease

Region pooling

Locus Overlap Analysis

http://code.databio.org/LOLA/

Sheffield and Bock (2016). Bioinformatics.

Nagraj, Magee, and Sheffield (2018). Nucleic Acids Research.

Methylation-based Inference of Regulatory Activity (MIRA)

http://code.databio.org/MIRA/

Lawson et al. (2018). Bioinformatics.

Sheffield et al. (2017). Nature Medicine.

Coordinate Covariation Analysis (COCOA)

http://code.databio.org/COCOA/

Lawson et al. (2020). Genome Biology.

Goal: understand variation among individuals

Supervised differential analysis

Supervised continuous analysis

Unsupervised analysis

Epigenomic data: high-dimensional
and low-interpretable

Dimensionality reduction

Even with known groups

How can we annotate the source of variation?

COCOA Overview

John Lawson

Coordinate Covariation Analysis

Quantify variation into a 'target variable'

Supervised (e.g. clincial variable).
Unsupervised (e.g. PCA)

Annotate target variable with region sets.

What is epigenetic signal covariation?

Covariation informs source of observed variation

1. Choose target variable

What is the variation we'd like to explain?

Supervised target

Unsupervised target

2. Quantify correlation with target variable

Permutation tests establish significance

Case studies

Breast cancer DNA methylation (Unsupervised)
Breast cancer ATAC-seq (Unsupervised)
Kidney cancer DNA methylation (Supervised)
Pan-cancer EZH2 analysis

Breast cancer DNA methylation PCA

COCOA results for PC1

ER-related regions have higher loadings on PC1

Raw DNA Methylation in ER binding regions

COCOA results for PC1-4

COCOA meta-region plots for PC1-4

Case studies

Breast cancer DNA methylation (Unsupervised)
Breast cancer ATAC-seq (Unsupervised)
Kidney cancer DNA methylation (Supervised)
Pan-cancer EZH2 analysis

Breast cancer ATAC-seq PCA

COCOA results for ATAC-seq

ER-related regions have higher loadings on PC1

COCOA results for ATAC-seq

Case studies

Breast cancer DNA methylation (Unsupervised)
Breast cancer ATAC-seq (Unsupervised)
Kidney cancer DNA methylation (Supervised)
Pan-cancer EZH2 analysis

Kidney cancer DNA methylation (Supervised)

Rank region sets for methylation
that correlates with cancer stage

COCOA results for cancer stage

Case studies

Breast cancer DNA methylation (Unsupervised)
Breast cancer ATAC-seq (Unsupervised)
Kidney cancer DNA methylation (Supervised)
Pan-cancer EZH2 analysis

DNA methylation in EZH2 regions and survival

DNA methylation in EZH2-binding regions most often positively correlated with risk of death.

#### Conclusions: COCOA... - can an interpret continuous regulatory variation - can use any signal that annotates genomic coordinates - can work on supervised and unsupervised data - is available from Bioconductor - depends critically on the database of region sets...

Region-set 2 Vec

Embeddings of genomic region sets
in lower dimensions.

https://github.com/databio/regionset-embedding

Gharavi et al. (2021). Bioinformatics.

Erfaneh Gharavi

Word embeddings

http://suriyadeepan.github.io

Word2vec model

Mikolov et al. (2013). arXiv:1301.3781v3.

Word context

You shall know a word by the company it keeps. (Firth 1957)
Words that occur in similar contexts tend to have similar meanings.

Image credit: Shubham Agarwal

Genomic context

A genomic interval is more likely to appear in a BED file with other genomic intervals of a similar function.

Genomic Interval Embeddings

Evaluation

We have created unsupervised 100-dimensional vector representations (embeddings) of region sets.
Do relationships among vectors reflect biology?

Evaluation 1: Classification performance

Conclusion

Regionset2vec adapts word2vec to learn genomic region embeddings
Regionset2vec embeddings capture biological information
NLP approaches can be adapted for applications in genomic interval analysis

Future applications

Cancer mutations

Single-cell

A high-performance server and API
for genomic interval data.

http://bedbase.org

Data spans projects (e.g. all data on GEO; 40,000 accessions, 100,000+ BED files)
Programmatic API for metadata, statistics, and data chunks
Human browsing of statistical and biological attributes
Aware of similarities among BED files
Human-friendly search
Shaped into 'non-redundant' sets for analysis

Reference genome manager

http://refgenie.databio.org

Stolarczyk et al. (2020). GigaScience.

Stolarczyk, Xue, and Sheffield (2021). NAR Genomics and Bioinformatics.

refgenie pull hg38/bowtie2_index

Portable Encapsulated Projects (PEP)

A structure and toolkit for organizing large-scale,
sample-intensive biological research projects

Sheffield et al. (2021). GigaScience. http://pep.databio.org/

</div>

Thank You

Collaborators
Aakrosh Ratan
Aidong Zhang
Guangtao Zheng
Don Brown
Hyun Jae Cho
Mikhail Dozmorov
Fran Garrett-Backelman
Christoph Bock
Eleni Tomazou

Sheffield lab
Erfaneh Gharavi
Kristyna Kupkova
John Stubbs
Bingjie Xue
Jose Verdezoto
Nathan LeRoy
Oleksandr Khoroshevskyi

Alumni
Aaron Gu
Jianglin Feng
Tessa Danehy
Michal Stolarczyk
John Lawson
Jason Smith

Funding:

NIGMS R35-GM128636

nsheff ·

databio.org ·

nsheffield@virginia.edu

Methods for analyzing non-coding genomic intervals and their applications in cancer biology

Outline

Biological motivation

Region pooling

Locus Overlap Analysis

Methylation-based Inference of Regulatory Activity (MIRA)

Coordinate Covariation Analysis (COCOA)

Goal: understand variation among individuals

Epigenomic data: high-dimensionaland low-interpretable

COCOA Overview

Coordinate Covariation Analysis

What is epigenetic signal covariation?

What is epigenetic signal covariation?

What is epigenetic signal covariation?

What is epigenetic signal covariation?

What is epigenetic signal covariation?

Covariation informs source of observed variation

1. Choose target variable

2. Quantify correlation with target variable

2. Quantify correlation with target variable

Permutation tests establish significance

Case studies

Breast cancer DNA methylation PCA

COCOA results for PC1

Raw DNA Methylation in ER binding regions

COCOA results for PC1-4

COCOA results for PC1-4

COCOA meta-region plots for PC1-4

Case studies

Breast cancer ATAC-seq PCA

COCOA results for ATAC-seq

COCOA results for ATAC-seq

Case studies

Kidney cancer DNA methylation (Supervised)

COCOA results for cancer stage

COCOA results for cancer stage

COCOA results for cancer stage

Case studies

DNA methylation in EZH2 regions and survival

Region-set 2 Vec

Word embeddings

Word2vec model

Word context

Genomic context

Genomic Interval Embeddings

Evaluation

Evaluation 1: Classification performance

Evaluation 1: Classification performance

Evaluation 1: Classification performance

Conclusion

Future applications

Portable Encapsulated Projects (PEP)

Thank You

Epigenomic data: high-dimensional
and low-interpretable