Methods for analyzing non-coding genomic intervals and their applications in cancer biology

Nathan Sheffield, PhD
www.databio.org/slides

Outline

COCOA
|

20%
35%
35%
10%
|
|
|

Background, LOLA/MIRA
RegionSet2vec
Other projects
◁ Questions ▷

Biological motivation




Cells alter phenotype by using DNA differently.

Breakdowns lead to disease

Region pooling


Locus Overlap Analysis

Sheffield and Bock (2016). Bioinformatics.
Nagraj, Magee, and Sheffield (2018). Nucleic Acids Research.

Methylation-based Inference of Regulatory Activity (MIRA)

Lawson et al. (2018). Bioinformatics.

Sheffield et al. (2017). Nature Medicine.

Coordinate Covariation Analysis (COCOA)


Lawson et al. (2020). Genome Biology.

Goal: understand variation among individuals


Supervised differential analysis

Supervised continuous analysis

Unsupervised analysis

Epigenomic data: high-dimensional
and low-interpretable


Dimensionality reduction

Even with known groups
How can we annotate the source of variation?

COCOA Overview

John Lawson

Coordinate Covariation Analysis

  1. Quantify variation into a 'target variable'
    1. Supervised (e.g. clincial variable).
    2. Unsupervised (e.g. PCA)
  2. Annotate target variable with region sets.

What is epigenetic signal covariation?


 

What is epigenetic signal covariation?


 

What is epigenetic signal covariation?


 

What is epigenetic signal covariation?


 

What is epigenetic signal covariation?


Covariation informs source of observed variation

1. Choose target variable

What is the variation we'd like to explain?
Supervised target
Unsupervised target

 

2. Quantify correlation with target variable


 

2. Quantify correlation with target variable


 

Permutation tests establish significance


 


Case studies

Breast cancer DNA methylation (Unsupervised)
Breast cancer ATAC-seq (Unsupervised)
Kidney cancer DNA methylation (Supervised)
Pan-cancer EZH2 analysis

Breast cancer DNA methylation PCA


COCOA results for PC1


ER-related regions have higher loadings on PC1

Raw DNA Methylation in ER binding regions


COCOA results for PC1-4


COCOA results for PC1-4


COCOA meta-region plots for PC1-4


Case studies

Breast cancer DNA methylation (Unsupervised)
Breast cancer ATAC-seq (Unsupervised)
Kidney cancer DNA methylation (Supervised)
Pan-cancer EZH2 analysis

Breast cancer ATAC-seq PCA


COCOA results for ATAC-seq


ER-related regions have higher loadings on PC1

COCOA results for ATAC-seq


Case studies

Breast cancer DNA methylation (Unsupervised)
Breast cancer ATAC-seq (Unsupervised)
Kidney cancer DNA methylation (Supervised)
Pan-cancer EZH2 analysis

Kidney cancer DNA methylation (Supervised)


Rank region sets for methylation
that correlates with cancer stage

COCOA results for cancer stage

COCOA results for cancer stage

COCOA results for cancer stage

Case studies

Breast cancer DNA methylation (Unsupervised)
Breast cancer ATAC-seq (Unsupervised)
Kidney cancer DNA methylation (Supervised)
Pan-cancer EZH2 analysis

DNA methylation in EZH2 regions and survival


DNA methylation in EZH2-binding regions most often positively correlated with risk of death.
#### Conclusions: COCOA... - can an interpret continuous regulatory variation - can use any signal that annotates genomic coordinates - can work on supervised and unsupervised data - is available from Bioconductor - depends critically on the database of region sets...

Region-set 2 Vec

Embeddings of genomic region sets
in lower dimensions.
Gharavi et al. (2021). Bioinformatics.

Erfaneh Gharavi

Word embeddings

http://suriyadeepan.github.io

Word2vec model

Word2vec model


Mikolov et al. (2013). arXiv:1301.3781v3.

Word context

You shall know a word by the company it keeps. (Firth 1957)
Words that occur in similar contexts tend to have similar meanings.
Image credit: Shubham Agarwal

Genomic context

A genomic interval is more likely to appear in a BED file with other genomic intervals of a similar function.

Genomic Interval Embeddings

Evaluation

We have created unsupervised 100-dimensional vector representations (embeddings) of region sets.
Do relationships among vectors reflect biology?

Evaluation 1: Classification performance

Evaluation 1: Classification performance

Evaluation 1: Classification performance

Evaluation 2: Perturbation similarity detection

Evaluation 3: Peak threshold robustness

Conclusion

  • Regionset2vec uses an adapted word2vec model to train vectors for genomic regions
  • Regionset2vec embeddings capture expected biological annotations
  • Regionset2vec reflects known simulated perturbations
  • Regionset2vec is robust to missing data
  • NLP approaches can be adapted for applications in genomic interval analysis

Future applications

Cancer mutations
Single-cell

A high-performance server and API
for genomic interval data.
  • Data spans projects (e.g. all data on GEO; 40,000 accessions, 100,000+ BED files)
  • Programmatic API for metadata, statistics, and data chunks
  • Human browsing of statistical and biological attributes
  • Aware of similarities among BED files
  • Human-friendly search
  • Shaped into 'non-redundant' sets for analysis

Reference genome manager Stolarczyk et al. (2020). GigaScience.
Stolarczyk, Xue, and Sheffield (2021). NAR Genomics and Bioinformatics.
refgenie pull hg38/bowtie2_index

Portable Encapsulated Projects (PEP)

A structure and toolkit for organizing large-scale,
sample-intensive biological research projects
Sheffield et al. (2021). GigaScience. http://pep.databio.org/
</div>

Thank You

Collaborators
Aakrosh Ratan
Aidong Zhang
Guangtao Zheng
Don Brown
Hyun Jae Cho
Mikhail Dozmorov
Fran Garrett-Backelman
Christoph Bock
Eleni Tomazou
Sheffield lab
Erfaneh Gharavi
Kristyna Kupkova
John Stubbs
Bingjie Xue
Jose Verdezoto
Nathan LeRoy
Oleksandr Khoroshevskyi

Alumni
Aaron Gu
Jianglin Feng
Tessa Danehy
Michal Stolarczyk
John Lawson
Jason Smith
Funding:



NIGMS R35-GM128636

nsheff · databio.org · nsheffield@virginia.edu