Recent advances in genomic interval analysis

Nathan Sheffield, PhD

www.databio.org/slides

Outline

Augmented Interval Lists

Regionset2vec

25%

40%

10%

Integrated Genome Database

BEDbase

◁ Questions ▷

Augmented Interval List (AIList)

A novel data structure for efficiently computing overlaps
across genomic interval data.

http://ailist.databio.org/

Feng et al. (2020). Bioinformatics.

Jianglin Feng

If subject list has no containment,
identifying overlaps is fast

binary search on start intervals, followed by backward steps:

The problem arises with contained interval overlaps

How can we improve efficiency
without guaranteeing no containment?

Many approaches to solve the 'containment' issue:

- Nested Containment Lists (GRanges) [@Alekseyenko2007; @Aboyoun2012] - R-trees (bedtools) [@Kent2002; @Quinlan2010], Augmented interval trees [@Cormen2001] These methods try to structure the data to provide non-containment guarantees

Methods provide non-containment guarantees

R-trees

Annotates tree nodes with a minimum bounding rectangle of elements. A query that does not intersect the bounding rectangle will not intersect any child element.

Nested Containment Lists

Augmented Interval List

1. Augment the list with the running maximum *end* value. *solves the problem for lowly-contained lists* 2. Decompose the list to minimize containment. *extends the solution to highly-contained lists*

Augment with the running maximum end value, `maxE`

Provides a local guarantee of no containment.

AIList works on contained lists

But long containment runs are problematic

Decompose long runs with constant `maxE`

Performance

How does the `maxE` minimum run length affect performance?
How does it compare to existing approaches?
How does it scale with increasing size of subject?

Datasets

How does the `maxE` minimum run length affect performance?

How does it compare to existing approaches?

How does it scale with increasing size of subject?

Conclusion

Augmented Interval Lists add the maximum running end value to a list of intervals
The data structure is simpler than other methods
AILists improve performance, particularly in highly contained interval sets

Integrated Genome Database (IGD)

A high-performance search engine
for large-scale genomic interval datasets.

https://github.com/databio/IGD

Feng et al. (2021). Bioinformatics.

Jianglin Feng

Expanding the search space

An integrated data structure

GIGGLE

GIGGLE indexes many interval sets with a B+ tree.

Layer et al. (2018). Nature Methods.

IGD uses linear binning

The genome is divided into equal-size bins
Database intervals are placed in any bins they overlap
Intervals are sorted by start coordinate within a bin

Advantages

Single-layer data structure has less overhead
Bins are independent

Challenges

Duplication = bigger database
Duplication = possible for double-counting

Challenge 1: Database size

Adjustable with bin size
In practice: 5-20% bigger than raw, unduplicated data
Can be 2x or more if you have smaller bins than regions
Default bin size: 16,384 (2¹⁴)

Challenge 2: Double-counting

Occurs only when both query and subject interval cross the same bin boundary.

Rule:If the query crosses the left boundary of the bin, then any region in the bin that also crosses the left boundary will be skipped

Question: Within a bin, how are overlaps calculated?

Can we use the AIList search algorithm?

Yes, but it doesn't help much because the bin size restricts the excess comparisons.

Performance

Conclusion

IGD computes overlaps between a query and database of indexed interval sets
IGD uses linear binning to index collections of region sets
Because bins are independent, IGD uses little memory, and could be parallelized
IGD reduces database size and increases performance

Region-set 2 Vec

Embeddings of genomic region sets
in lower dimensions.

https://github.com/databio/regionset-embedding

Gharavi et al. (2021). Bioinformatics.

Erfaneh Gharavi

Word embeddings

http://suriyadeepan.github.io

Word2vec model

Mikolov et al. (2013). arXiv:1301.3781v3.

Word context

You shall know a word by the company it keeps. (Firth 1957)
Words that occur in similar contexts tend to have similar meanings.

Image credit: Shubham Agarwal

Genomic context

A genomic interval is more likely to appear in a BED file with other genomic intervals of a similar function.

Genomic Interval Embeddings

Evaluation

We have created unsupervised 100-dimensional vector representations (embeddings) of region sets.
Do relationships among vectors reflect biology?

Evaluation 1: Classification performance

Conclusion

Regionset2vec adapts word2vec to learn genomic region embeddings
Regionset2vec embeddings capture biological information
NLP approaches can be adapted for applications in genomic interval analysis

BEDbase

A high-performance server and API
for genomic interval data.

bedbase.org

Michal Stolarczyk

Jose Verdezoto

Bingjie Xue

Oleksandr Khoroshevskyi

### BEDbase goals - Human browsing of statistical and biological attributes - Human-friendly, *intelligent* search - Programmatic API for metadata, statistics, and data chunks - Integrative analytical results - Data spans projects (all data on GEO)

BEDbase architecture

BEDbase is a microservice for data interoperability,
not another cloud platform

Web interface (front-end)
Clients (front-end)
API (back-end)
Database and files (back-end/infrastructure)
Processing pipelines (infrastructure)
Data served (content)

Human browsing of BED file splash pages

https://bedbase.org/bed/bd2578e70c0efe3674d0d39c782fe9e1

Reference genome compatibility

Donald Campbell

Reference genome compatibility

Donald Campbell

Reference genome compatibility

BEDsets allow comparison of BED files

https://bedbase.org/bedset/gse246900

Human-friendly search

Co-embedded metadata and region sets: https://bedbase.org/search?q=brain

Human-friendly search

Search by BED file

Co-embedded metadata and region sets: https://bedbase.org/search?view=b2b

Web interface (front-end)
Clients (front-end)
API (back-end)
Database and files (back-end/infrastructure)
Processing pipelines (infrastructure)
Data served (content)

BEDbase R client

``` library("bedbaser") bedbase <- BEDbase(tempdir()) bb_to_granges(bedbase, "ab446df9a043222067863cfd536ee8e0") ``` ``` GRanges object with 37 ranges and 5 metadata columns: seqnames ranges strand | name score [Rle] [IRanges] [Rle] | [character] [integer] [1] chr17 38083473-38083801 * | O-8A-H3K27ac_peak_11.. 21 [2] chr17 38108871-38110066 * | O-8A-H3K27ac_peak_11.. 25 [3] chr17 38137142-38137795 * | O-8A-H3K27ac_peak_11.. 33 [4] chr17 38210828-38211063 * | O-8A-H3K27ac_peak_11.. 17 [5] chr17 38218030-38220186 * | O-8A-H3K27ac_peak_11.. 19 ... ... ... ... . ... ... [33] chr17 38603620-38604355 * | O-8A-H3K27ac_peak_11.. 15 [34] chr17 38647047-38648053 * | O-8A-H3K27ac_peak_11.. 28 [35] chr17 38708445-38710424 * | O-8A-H3K27ac_peak_11.. 20 [36] chr17 38716283-38717201 * | O-8A-H3K27ac_peak_11.. 23 [37] chr17 38803702-38804538 * | O-8A-H3K27ac_peak_11.. 46 field8 field9 field10 [character] [character] [character] [1] 3.91024 4.82245 2.12195 [2] 4.44410 5.31600 2.51785 [3] 4.30183 6.25865 3.33754 [4] 3.94862 4.34253 1.75554 [5] 3.74929 4.57115 1.93712 ... ... ... ... [33] 3.80116 4.07741 1.55623 [34] 4.35182 5.72967 2.89524 [35] 3.88836 4.73101 2.06784 [36] 4.24210 5.08399 2.33889 [37] 5.09106 7.93609 4.68279 ------- ```

BEDbase Python client

Web interface (front-end)
Clients (front-end)
API (back-end)
Database and files (back-end/infrastructure)
Processing pipelines (infrastructure)
Data served (content)

bedhost

A FastAPI application following JAMstack philosophy.

JAMstack forces you to build a comprehensive API.

OpenAPI interface

https://api.bedbase.org/v1/docs

BED info via API

https://api.bedbase.org/v1/bed/bd2578e70c0efe3674d0d39c782fe9e1/metadata?full=true

Web interface (front-end)
Clients (front-end)
API (back-end)
Database and files (back-end/infrastructure)
Processing pipelines (infrastructure)
Data served (content)

## BEDbase data layer 1. BED file stored in Backblaze B2 (S3 compatible object store) - BED files n=21,438 (stats from 2025-05) - 346,071 total objects, 186.4 GB ($6/TB/month) 2. B2 interface is routed through cloudflare CDN (free egress!) 3. File metadata stored in a PostgreSQL database on AWS managed Relational Database Service ## → Minimal maintenance cost

Web interface (front-end)
Clients (front-end)
API (back-end)
Database and files (back-end/infrastructure)
Processing pipelines (infrastructure)
Data served (content)

- `bbconf`: bedbase configuration object, connection to database - `bedqc`: a pipeline for QC of BED files. - `bedmaker`: a pipeline to convert non-bed files into bed files - `bedstat`: a pipeline to calculate stats for a bed file - `bedbuncher`: a pipeline to create bedsets - `bedembed`: a pipeline to create bed file embeddings

Docs: http://code.databio.org/GenomicDistributions/
Code: http://github.com/databio/GenomicDistributions/

bioconductor.org/packages/GenomicDistributions

Kristyna Kupkova

Jose Verdezoto

Kupkova et al. (2022). BMC Genomics.

Web interface (front-end)
Clients (front-end)
API (back-end)
Database and files (back-end/infrastructure)
Processing pipelines (infrastructure)
Data served (content)

Connects the Gene Expression Omnibus (GEO)
and Sequence Read Archive (SRA)
with PEP format

Oleksandr Khoroshevskyi

geofetch.databio.org

``` geofetch --filter="bed|bigBed|narrowPeak|broadPeak" ```

Conclusion

BEDbase provides BED data for humans and machines
Output includes statistical and biological visualization
Upcoming human-friendly search is powerful
Programmatic access to data chunks improve interoperability

Thank You

Collaborators
Aakrosh Ratan
Aidong Zhang
Guangtao Zheng
Don Brown
Hyun Jae Cho
Vince Carey
Mikhail Dozmorov

Alumni
Aaron Gu
Jianglin Feng
Ognen Duzlevski
Tessa Danehy

Sheffield lab
Erfaneh Gharavi
Michal Stolarczyk
John Lawson
Jason Smith
Kristyna Kupkova
John Stubbs
Bingjie Xue
Jose Verdezoto
Nathan LeRoy
Oleksandr Khoroshevskyi

Funding:

NIGMS R35-GM128636

nsheff ·

databio.org ·

nsheffield@virginia.edu

Recent advances in genomic interval analysis

Outline

Augmented Interval List (AIList)

If subject list has no containment,identifying overlaps is fast

The problem arises with contained interval overlaps

How can we improve efficiencywithout guaranteeing no containment?

Many approaches to solve the 'containment' issue:

Methods provide non-containment guarantees

R-trees

Nested Containment Lists

Augmented Interval List

Augment with the running maximum end value, `maxE`

AIList works on contained lists

But long containment runs are problematic

Decompose long runs with constant `maxE`

Performance

Datasets

How does the `maxE` minimum run length affect performance?

How does it compare to existing approaches?

How does it scale with increasing size of subject?

Conclusion

Integrated Genome Database (IGD)

Expanding the search space

An integrated data structure

GIGGLE

IGD uses linear binning

Advantages

Challenges

Challenge 1: Database size

Challenge 2: Double-counting

Question: Within a bin, how are overlaps calculated?

Performance

Conclusion

Region-set 2 Vec

Word embeddings

Word2vec model

Word context

Genomic context

Genomic Interval Embeddings

Evaluation

Evaluation 1: Classification performance

Evaluation 1: Classification performance

Evaluation 1: Classification performance

Conclusion

BEDbase

BEDbase architecture

Human browsing of BED file splash pages

Reference genome compatibility

Reference genome compatibility

Reference genome compatibility

BEDsets allow comparison of BED files

Human-friendly search

Human-friendly search

Human-friendly search

Search by BED file

BEDbase R client

BEDbase Python client

bedhost

OpenAPI interface

BED info via API

Conclusion

Thank You

If subject list has no containment,
identifying overlaps is fast

How can we improve efficiency
without guaranteeing no containment?