Mission statement
We develop and apply computational methods
to organize, analyze, and understand large epigenomic data.
Full-stack bioinformatics
Biological motivation
Cells alter phenotype by using DNA differently.
Breakdowns lead to disease
There has always been a place in biology for words that have different meanings for different people. Epigenetics is an extreme case, because it has several meanings with independent roots. (Bird 2007)
The meaning of the term "epigenetics" has itself undergone an evolution. (Felsenfeld 2014)
the causal study of embryological development (Waddington 1957, The strategy of the genes)
The study of mitotically and/or meiotically heritable changes in gene function that cannot be explained by changes in DNA sequence
(Riggs et al. 1996)
a change in the state of expression of a gene that does not involve a mutation, but that is nevertheless inherited in the absence of the signal (or event) that initiated the change. (Ptashne and Gant 2002)
the structural adaptation of chromosomal regions so as to register, signal or perpetuate altered activity states. (Bird 2007)
What is epigenetics?
the causal study of embryological development (Waddington 1957, The strategy of the genes)
The study of mitotically and/or meiotically heritable changes in gene function that cannot be explained by changes in DNA sequence
(Riggs et al. 1996)
a change in the state of expression of a gene that does not involve a mutation, but that is nevertheless inherited in the absence of the signal (or event) that initiated the change. (Ptashne and Gant 2002)
the structural adaptation of chromosomal regions so as to register, signal or perpetuate altered activity states. (Bird 2007)
What is epigenetics?
Epigenetics refers to changes in gene regulation brought about through modifications to the DNA's packaging proteins or the DNA molecules themselves without changing the underlying sequence.
(Lord and Cruchaga 2014, Nature Neuroscience)
the study of the mechanisms that allow cells to translate the nearly constant genome content of a multicellular organism into multiple functional and stable cellular conditions (Schwartzman and Tanay 2015)
Epigenetic processes are a means by which endogenous and exogenous cues exert long-term control over gene expression (Nugent et al. 2015)
What is epigenetics?
The pop definition:
The word literally means "on top of genetics," and it's the study of how individual genes can be activated or deactivated by life experiences. (The Week, 2013)
What is epigenomics?
epigenomics is the study of the physical modifications, associations and conformations of genomic DNA sequences (Schwartzman and Tanay 2015)
epigenomics is the study of the chemical modification and physical conformation of cellular DNA and bound proteins (Sheffield 2017)
The word "epigenome" lacks the baggage of heritability.
Rosa et al. 2013
Histone variants
https://en.wikipedia.org/wiki/Histone_octamer
Histone modification (PTM)
https://en.wikipedia.org/wiki/Histone
DNA Methylation
Chromatin conformation
## What can be represented as an interval?
- ChIP-seq or ATAC-seq peaks
## Peaks
![](/images/presentations/epigenomics/peaks.svg)
Genomic intervals are often colloquially referred to as 'peaks'.
## What can be represented as an interval?
- ChIP-seq or ATAC-seq peaks
- Single-Nucleotide Polymorphisms (SNPs)
## SNPs
SNPs are interval of width 1
![](/images/presentations/epigenomics/snps.svg)
## What can be represented as an interval?
- ChIP-seq or ATAC-seq peaks
- Single-Nucleotide Polymorphisms (SNPs)
- Genes and gene components (TSS, exons, introns, etc)
## Genes and gene components
![](/images/presentations/epigenomics/brca2-gene-model.png)
## What can be represented as an interval?
- ChIP-seq or ATAC-seq peaks
- Single-Nucleotide Polymorphisms (SNPs)
- Genes and gene components (TSS, exons, introns, etc)
- Non-coding DNA annotation (promoters, enhancers)
## Non-coding DNA annotation
![](/images/presentations/epigenomics/regulatory-build.png)
## What can be represented as an interval?
- ChIP-seq or ATAC-seq peaks
- Single-Nucleotide Polymorphisms (SNPs)
- Genes and gene components (TSS, exons, introns, etc)
- Non-coding DNA annotation (promoters, enhancers)
## What can be represented as an interval?
- ChIP-seq or ATAC-seq peaks
- Single-Nucleotide Polymorphisms (SNPs)
- Genes and gene components (TSS, exons, introns, etc)
- Non-coding DNA annotation (promoters, enhancers)
- Protein domains
## Protein domains
![](/images/presentations/epigenomics/ets-domains.jpeg)
## What can be represented as an interval?
- ChIP-seq or ATAC-seq peaks
- Single-Nucleotide Polymorphisms (SNPs)
- Genes and gene components (TSS, exons, introns, etc)
- Non-coding DNA annotation (promoters, enhancers)
- Protein domains
- Anything else?
# Key point
Because of the linear nature of DNA and RNA, many biological entities can be conceptualized as genomic intervals.
Genomic intervals are often a simplified abstraction of genomic sequence.
Interval operations are fundamental in genomics
Locus Overlap Analysis
Sheffield and Bock (2016). Bioinformatics.
Nagraj, Magee, and Sheffield (2018). Nucleic Acids Research.
If subject list has no containment,
identifying overlaps is fast
binary search on start intervals, followed by backward steps:
The problem arises with contained interval overlaps
How can we improve efficiency
without guaranteeing no containment?
Many approaches to solve the 'containment' issue:
- Nested Containment Lists (GRanges) [@Alekseyenko2007; @Aboyoun2012]
- R-trees (bedtools) [@Kent2002; @Quinlan2010], Augmented interval trees [@Cormen2001]
These methods try to structure the data
to provide non-containment guarantees
Methods provide non-containment guarantees
R-trees
Annotates tree nodes with a minimum bounding rectangle of elements. A query that does not intersect the bounding rectangle will not intersect any child element.
Nested Containment Lists
Augmented Interval List
1. Augment the list with the running maximum *end* value. *solves the problem for lowly-contained lists*
2. Decompose the list to minimize containment. *extends the solution to highly-contained lists*
Augment with the running maximum end value, `maxE`
Provides a local guarantee of no containment.
AIList works on contained lists
But long containment runs are problematic
Decompose long runs with constant `maxE`
Performance
- How does the `maxE` minimum run length affect performance?
- How does it compare to existing approaches?
- How does it scale with increasing size of subject?
Datasets
How does the `maxE` minimum run length affect performance?
How does it compare to existing approaches?
How does it scale with increasing size of subject?
Conclusion
- Augmented Interval Lists add the maximum running end value to a list of intervals
- The data structure is simpler than other methods
- AILists improve performance, particularly in highly contained interval sets
Word embeddings
http://suriyadeepan.github.io
Word2vec model
Word context
You shall know a word by the company it keeps. (Firth 1957)
Words that occur in similar contexts tend to have similar meanings.
Image credit: Shubham Agarwal
Genomic context
A genomic interval is more likely to appear in a BED file with other genomic intervals of a similar function.
Genomic Interval Embeddings
Evaluation
We have created unsupervised 100-dimensional vector representations (embeddings) of region sets.
Do relationships among vectors reflect biology?
Evaluation 1: Classification performance
Evaluation 1: Classification performance
Evaluation 1: Classification performance
Evaluation 2: Perturbation similarity detection
Evaluation 3: Peak threshold robustness
Conclusion
- Regionset2vec uses an adapted word2vec model to train vectors for genomic regions
- Regionset2vec embeddings capture expected biological annotations
- Regionset2vec reflects known simulated perturbations
- Regionset2vec is robust to missing data
- NLP approaches can be adapted for applications in genomic interval analysis
Thank You
Collaborators
Vince Reuter
Andre Rendeiro
Levi Waldron
Alumni
Aaron Gu
Jianglin Feng
Ognen Duzlevski
Tessa Danehy
Sheffield lab
Erfaneh Gharavi
Michal Stolarczyk
John Lawson
Jason Smith
Kristyna Kupkova
John Stubbs
Bingjie Xue
Jose Verdezoto
Nathan LeRoy
Oleksandr Khoroshevskyi
nsheff ·
databio.org ·
nsheffield@virginia.edu