Recent advances in genomic interval analysis

Nathan Sheffield, PhD
www.databio.org/slides

Outline

Augmented Interval Lists
Regionset2vec
|
|

25%
25%
40%
10%
|
|

Integrated Genome Database
BEDbase
◁ Questions ▷

Augmented Interval List (AIList)

A novel data structure for efficiently computing overlaps
across genomic interval data.
Feng et al. (2020). Bioinformatics.

Jianglin Feng

Locus Overlap Analysis (LOLA) overview

Sheffield and Bock (2016). Bioinformatics. Nagraj et al. (2018). Nucleic Acids Research.

LOLA requires comparing sets of intervals

Can we improve the efficiency to enable faster,
larger-scale analysis?

If subject list has no containment,
identifying overlaps is fast

binary search on start intervals, followed by backward steps:

The problem arises with contained interval overlaps

How can we improve efficiency
without guaranteeing no containment?

Many approaches to solve the 'containment' issue:

- Nested Containment Lists (GRanges) [@Alekseyenko2007; @Aboyoun2012] - R-trees (bedtools) [@Kent2002; @Quinlan2010], Augmented interval trees [@Cormen2001] These methods try to structure the data to provide non-containment guarantees

Methods provide non-containment guarantees

R-trees

Annotates tree nodes with a minimum bounding rectangle of elements. A query that does not intersect the bounding rectangle will not intersect any child element.

Nested Containment Lists

Augmented Interval List

1. Augment the list with the running maximum *end* value. *solves the problem for lowly-contained lists* 2. Decompose the list to minimize containment. *extends the solution to highly-contained lists*

Augment with the running maximum end value, `maxE`

Provides a local guarantee of no containment.

AIList works on contained lists

But long containment runs are problematic

Decompose long runs with constant `maxE`

Performance

  • How does the `maxE` minimum run length affect performance?
  • How does it compare to existing approaches?
  • How does it scale with increasing size of subject?

Datasets

How does the `maxE` minimum run length affect performance?

How does it compare to existing approaches?

How does it scale with increasing size of subject?

Conclusion

  • Augmented Interval Lists add the maximum running end value to a list of intervals
  • The data structure is simpler than other methods
  • AILists improve performance, particularly in highly contained interval sets

Integrated Genome Database (IGD)

A high-performance search engine
for large-scale genomic interval datasets.
Feng et al. (2021). Bioinformatics.

Jianglin Feng

Expanding the search space

An integrated data structure

GIGGLE

GIGGLE indexes many interval sets with a B+ tree.
Layer et al. (2018). Nature Methods.

IGD uses linear binning

  • The genome is divided into equal-size bins
  • Database intervals are placed in any bins they overlap
  • Intervals are sorted by start coordinate within a bin

Advantages

  • Single-layer data structure has less overhead
  • Bins are independent

Challenges

  • Duplication = bigger database
  • Duplication = possible for double-counting

Challenge 1: Database size

  • Adjustable with bin size
  • In practice: 5-20% bigger than raw, unduplicated data
  • Can be 2x or more if you have smaller bins than regions
  • Default bin size: 16,384 (214)

Challenge 2: Double-counting

Occurs only when both query and subject interval cross the same bin boundary.
Rule:If the query crosses the left boundary of the bin, then any region in the bin that also crosses the left boundary will be skipped

Question: Within a bin, how are overlaps calculated?

Can we use the AIList search algorithm?
Yes, but it doesn't help much because the bin size restricts the excess comparisons.

Performance

Conclusion

  • IGD computes overlaps between a query and database of indexed interval sets
  • IGD uses linear binning to index collections of region sets
  • Because bins are independent, IGD uses little memory, and could be parallelized
  • IGD reduces database size and increases performance

Region-set 2 Vec

Embeddings of genomic region sets
in lower dimensions.
Gharavi et al. (2021). Bioinformatics.

Erfaneh Gharavi
What does it mean for two region sets (BED files) to be similar?
Overlaps makes some sense...but what about:
degree of overlap?
weighting of specific regions?
biological similarity of regions?

The bag-of-words model for text classification


Zheng and Casari (2018), Feature Engineering for Machine Learning

The bag-of-intervals model for genomic intervals

    Advantages
  • Vector representation of a region set
  • Similarity metrics among vectors
  • Space and time complexity

Limitations of the bag of words vector approach

hotel = [0 0 0 0 0 0 0 1 0 0 0 0 0 0]
motel = [0 0 0 0 0 0 0 0 0 0 0 0 1 0]
  • Sparsity
  • Curse of dimensionality
  • No concept of relationships among words
  • Space and time complexity

Decreasing space/time complexity

Genomic interval sets

High-dimensional vectors

Low-dimensional vectors

Word embeddings

http://suriyadeepan.github.io

Word2vec model

Word2vec model


Mikolov et al. (2021). arXiv:1301.3781v3.

Word context

You shall know a word by the company it keeps. (Firth 1957)
Words that occur in similar contexts tend to have similar meanings.
Image credit: Shubham Agarwal

Genomic Interval Embeddings

Evaluation

We have created unsupervised 100-dimensional vector representations (embeddings) of region sets.
Do relationships among vectors reflect biology?

Evaluation 1: Classification performance

Evaluation 1: Classification performance

Evaluation 2: Perturbation similarity detection

Evaluation 3: Peak threshold robustness

Conclusion

  • Regionset2vec uses an adapted word2vec model to train vectors for genomic regions
  • Regionset2vec embeddings capture more expected biological annotations
  • Regionset2vec reflects known simulated perturbations
  • Regionset2vec is robust to missing data
  • NLP approaches can be adapted for applications in genomic interval analysis

BEDbase

A high-performance server and API
for genomic interval data.


Michal Stolarczyk

Jose Verdezoto

Bingjie Xue
### BEDbase goals - Human browsing of statistical and biological attributes - Human-friendly search - Programmatic API for 1) metadata 2) statistics 3) data chunks - Data spans projects (*e.g.* all data on GEO)

BEDbase architecture

Human browsing of BED file splash pages

http://dev1.bedbase.org/bedsplash/78c0e4753d04b238fc07e4ebe5a02984

BEDsets allow comparison of BED files

http://dev1.bedbase.org/bedsetsplash/48a1a8c1476fecb1961894f81d1afadd

Human-friendly search

Co-embedded metadata and region sets: http://dev1.bedbase.org/
### Programmatic API to all metadata - OpenAPI: http://dev1.bedbase.org/docs - All metadata: http://dev1.bedbase.org/api/bed/78c0e4753d04b238fc07e4ebe5a02984/data - Promoter frequency: http://dev1.bedbase.org/api/bed/78c0e4753d04b238fc07e4ebe5a02984/data?ids=promotercore_percentage - Genome: http://dev1.bedbase.org/api/bed/78c0e4753d04b238fc07e4ebe5a02984/data?ids=genome - Number of regions: http://dev1.bedbase.org/api/bed/78c0e4753d04b238fc07e4ebe5a02984/data?ids=regions_no
### Programmatic API to data chunks You can use the API to extract entries in a defined region http://dev1.bedbase.org/api/bed/a8f498b373d3e3fd85880754c01873bb/regions/chr1?start=1000000&end=1217614

Conclusion

  • BEDbase provides both human- and machine- interfaces to BED data
  • Statistical and biological visualization
  • Human-friendly search
  • Programmatic access to data chunks

Thank You

Collaborators
Aakrosh Ratan
Aidong Zhang
Guangtao Zheng
Don Brown
Hyun Jae Cho
Vince Carey
Mikhail Dozmorov

Alumni
Aaron Gu
Jianglin Feng
Ognen Duzlevski
Tessa Danehy
Sheffield lab
Erfaneh Gharavi
Michal Stolarczyk
John Lawson
Jason Smith
Kristyna Kupkova
John Stubbs
Bingjie Xue
Jose Verdezoto
Nathan LeRoy
Oleksandr Khoroshevskyi
Funding:



NIGMS R35-GM128636

nsheff · databio.org · nsheffield@virginia.edu