Epigenomes and intervals

Nathan Sheffield, PhD
www.databio.org/slides

Mission statement

We develop and apply computational methods
to organize, analyze, and understand large epigenomic data.


Full-stack bioinformatics



Biological motivation




Cells alter phenotype by using DNA differently.

Breakdowns lead to disease

What is epigenetics?

There has always been a place in biology for words that have different meanings for different people. Epigenetics is an extreme case, because it has several meanings with independent roots. (Bird 2007)
The meaning of the term "epigenetics" has itself undergone an evolution. (Felsenfeld 2014)
the causal study of embryological development (Waddington 1957, The strategy of the genes)
The study of mitotically and/or meiotically heritable changes in gene function that cannot be explained by changes in DNA sequence
(Riggs et al. 1996)
a change in the state of expression of a gene that does not involve a mutation, but that is nevertheless inherited in the absence of the signal (or event) that initiated the change. (Ptashne and Gant 2002)
the structural adaptation of chromosomal regions so as to register, signal or perpetuate altered activity states. (Bird 2007)

What is epigenetics?

the causal study of embryological development (Waddington 1957, The strategy of the genes)
The study of mitotically and/or meiotically heritable changes in gene function that cannot be explained by changes in DNA sequence
(Riggs et al. 1996)
a change in the state of expression of a gene that does not involve a mutation, but that is nevertheless inherited in the absence of the signal (or event) that initiated the change. (Ptashne and Gant 2002)
the structural adaptation of chromosomal regions so as to register, signal or perpetuate altered activity states. (Bird 2007)

What is epigenetics?

Epigenetics refers to changes in gene regulation brought about through modifications to the DNA's packaging proteins or the DNA molecules themselves without changing the underlying sequence.
(Lord and Cruchaga 2014, Nature Neuroscience)
the study of the mechanisms that allow cells to translate the nearly constant genome content of a multicellular organism into multiple functional and stable cellular conditions (Schwartzman and Tanay 2015)
Epigenetic processes are a means by which endogenous and exoge­nous cues exert long-term control over gene expression (Nugent et al. 2015)

What is epigenetics?

The pop definition:
The word literally means "on top of genetics," and it's the study of how individual genes can be activated or deactivated by life experiences. (The Week, 2013)

What is epigenomics?

epigenomics is the study of the physical modifications, associations and conformations of genomic DNA sequences (Schwartzman and Tanay 2015)
epigenomics is the study of the chemical modification and physical conformation of cellular DNA and bound proteins (Sheffield 2017)
The word "epigenome" lacks the baggage of heritability.

Rosa et al. 2013

Histone variants


https://en.wikipedia.org/wiki/Histone_octamer

Histone modification (PTM)


https://en.wikipedia.org/wiki/Histone

DNA Methylation


Chromatin conformation


Genomic intervals
## What can be represented as an interval? - ChIP-seq or ATAC-seq peaks
## Peaks ![](/images/presentations/epigenomics/peaks.svg) Genomic intervals are often colloquially referred to as 'peaks'.
## What can be represented as an interval? - ChIP-seq or ATAC-seq peaks - Single-Nucleotide Polymorphisms (SNPs)
## SNPs SNPs are interval of width 1 ![](/images/presentations/epigenomics/snps.svg)
## What can be represented as an interval? - ChIP-seq or ATAC-seq peaks - Single-Nucleotide Polymorphisms (SNPs) - Genes and gene components (TSS, exons, introns, etc)
## Genes and gene components ![](/images/presentations/epigenomics/brca2-gene-model.png)
## What can be represented as an interval? - ChIP-seq or ATAC-seq peaks - Single-Nucleotide Polymorphisms (SNPs) - Genes and gene components (TSS, exons, introns, etc) - Non-coding DNA annotation (promoters, enhancers)
## Non-coding DNA annotation ![](/images/presentations/epigenomics/regulatory-build.png)
## What can be represented as an interval? - ChIP-seq or ATAC-seq peaks - Single-Nucleotide Polymorphisms (SNPs) - Genes and gene components (TSS, exons, introns, etc) - Non-coding DNA annotation (promoters, enhancers)
## What can be represented as an interval? - ChIP-seq or ATAC-seq peaks - Single-Nucleotide Polymorphisms (SNPs) - Genes and gene components (TSS, exons, introns, etc) - Non-coding DNA annotation (promoters, enhancers) - Protein domains
## Protein domains ![](/images/presentations/epigenomics/ets-domains.jpeg)
## What can be represented as an interval? - ChIP-seq or ATAC-seq peaks - Single-Nucleotide Polymorphisms (SNPs) - Genes and gene components (TSS, exons, introns, etc) - Non-coding DNA annotation (promoters, enhancers) - Protein domains - Anything else?
# Key point Because of the linear nature of DNA and RNA, many biological entities can be conceptualized as genomic intervals. Genomic intervals are often a simplified abstraction of genomic sequence. Interval operations are fundamental in genomics

Locus Overlap Analysis

Sheffield and Bock (2016). Bioinformatics.
Nagraj, Magee, and Sheffield (2018). Nucleic Acids Research.

Augmented Interval List (AIList)

A novel data structure for efficiently computing overlaps
across genomic interval data.
Feng et al. (2020). Bioinformatics.

Jianglin Feng

If subject list has no containment,
identifying overlaps is fast

binary search on start intervals, followed by backward steps:

The problem arises with contained interval overlaps

How can we improve efficiency
without guaranteeing no containment?

Many approaches to solve the 'containment' issue:

- Nested Containment Lists (GRanges) [@Alekseyenko2007; @Aboyoun2012] - R-trees (bedtools) [@Kent2002; @Quinlan2010], Augmented interval trees [@Cormen2001] These methods try to structure the data to provide non-containment guarantees

Methods provide non-containment guarantees

R-trees

Annotates tree nodes with a minimum bounding rectangle of elements. A query that does not intersect the bounding rectangle will not intersect any child element.

Nested Containment Lists

Augmented Interval List

1. Augment the list with the running maximum *end* value. *solves the problem for lowly-contained lists* 2. Decompose the list to minimize containment. *extends the solution to highly-contained lists*

Augment with the running maximum end value, `maxE`

Provides a local guarantee of no containment.

AIList works on contained lists

But long containment runs are problematic

Decompose long runs with constant `maxE`

Performance

  • How does the `maxE` minimum run length affect performance?
  • How does it compare to existing approaches?
  • How does it scale with increasing size of subject?

Datasets

How does the `maxE` minimum run length affect performance?

How does it compare to existing approaches?

How does it scale with increasing size of subject?

Conclusion

  • Augmented Interval Lists add the maximum running end value to a list of intervals
  • The data structure is simpler than other methods
  • AILists improve performance, particularly in highly contained interval sets

Region-set 2 Vec

Embeddings of genomic region sets
in lower dimensions.
Gharavi et al. (2021). Bioinformatics.

Erfaneh Gharavi
What does it mean for two region sets (BED files) to be similar?
Overlaps makes some sense...but what about:
degree of overlap?
weighting of specific regions?
biological similarity of regions?

The bag-of-words model for text classification


Zheng and Casari (2018), Feature Engineering for Machine Learning

The bag-of-intervals model for genomic intervals

    Advantages
  • Vector representation of a region set
  • Similarity metrics among vectors
  • Space and time complexity

Limitations of the bag of words vector approach

hotel = [0 0 0 0 0 0 0 1 0 0 0 0 0 0]
motel = [0 0 0 0 0 0 0 0 0 0 0 0 1 0]
  • Sparsity
  • Curse of dimensionality
  • No concept of relationships among words
  • Space and time complexity

Decreasing space/time complexity

Genomic interval sets

High-dimensional vectors

Low-dimensional vectors

Word embeddings

http://suriyadeepan.github.io

Word2vec model

Word2vec model


Mikolov et al. (2013). arXiv:1301.3781v3.

Word context

You shall know a word by the company it keeps. (Firth 1957)
Words that occur in similar contexts tend to have similar meanings.
Image credit: Shubham Agarwal

Genomic Interval Embeddings

Evaluation

We have created unsupervised 100-dimensional vector representations (embeddings) of region sets.
Do relationships among vectors reflect biology?

Evaluation 1: Classification performance

Evaluation 1: Classification performance

Evaluation 2: Perturbation similarity detection

Evaluation 3: Peak threshold robustness

Conclusion

  • Regionset2vec uses an adapted word2vec model to train vectors for genomic regions
  • Regionset2vec embeddings capture expected biological annotations
  • Regionset2vec reflects known simulated perturbations
  • Regionset2vec is robust to missing data
  • NLP approaches can be adapted for applications in genomic interval analysis

Thank You

Collaborators
Vince Reuter
Andre Rendeiro
Levi Waldron

Alumni
Aaron Gu
Jianglin Feng
Ognen Duzlevski
Tessa Danehy
Sheffield lab
Erfaneh Gharavi
Michal Stolarczyk
John Lawson
Jason Smith
Kristyna Kupkova
John Stubbs
Bingjie Xue
Jose Verdezoto
Nathan LeRoy
Oleksandr Khoroshevskyi
Funding:



NIGMS R35-GM128636

nsheff · databio.org · nsheffield@virginia.edu