Outline
Augmented Interval Lists
Regionset2vec
|
|
25%
25%
40%
10%
|
|
Integrated Genome Database
BEDbase
◁ Questions ▷
If subject list has no containment,
identifying overlaps is fast
binary search on start intervals, followed by backward steps:
The problem arises with contained interval overlaps
How can we improve efficiency
without guaranteeing no containment?
Many approaches to solve the 'containment' issue:
- Nested Containment Lists (GRanges) [@Alekseyenko2007; @Aboyoun2012]
- R-trees (bedtools) [@Kent2002; @Quinlan2010], Augmented interval trees [@Cormen2001]
These methods try to structure the data
to provide non-containment guarantees
Methods provide non-containment guarantees
R-trees
Annotates tree nodes with a minimum bounding rectangle of elements. A query that does not intersect the bounding rectangle will not intersect any child element.
Nested Containment Lists
Augmented Interval List
1. Augment the list with the running maximum *end* value. *solves the problem for lowly-contained lists*
2. Decompose the list to minimize containment. *extends the solution to highly-contained lists*
Augment with the running maximum end value, `maxE`
Provides a local guarantee of no containment.
AIList works on contained lists
But long containment runs are problematic
Decompose long runs with constant `maxE`
Performance
- How does the `maxE` minimum run length affect performance?
- How does it compare to existing approaches?
- How does it scale with increasing size of subject?
Datasets
How does the `maxE` minimum run length affect performance?
How does it compare to existing approaches?
How does it scale with increasing size of subject?
Conclusion
- Augmented Interval Lists add the maximum running end value to a list of intervals
- The data structure is simpler than other methods
- AILists improve performance, particularly in highly contained interval sets
Expanding the search space
An integrated data structure
IGD uses linear binning
- The genome is divided into equal-size bins
- Database intervals are placed in any bins they overlap
- Intervals are sorted by start coordinate within a bin
Advantages
- Single-layer data structure has less overhead
- Bins are independent
Challenges
- Duplication = bigger database
- Duplication = possible for double-counting
Challenge 1: Database size
- Adjustable with bin size
- In practice: 5-20% bigger than raw, unduplicated data
- Can be 2x or more if you have smaller bins than regions
- Default bin size: 16,384 (214)
Challenge 2: Double-counting
Occurs only when both query and subject interval cross the same
bin boundary.
Rule:If the query crosses the left boundary of the bin, then any region in the bin that also crosses the left boundary will be skipped
Question: Within a bin, how are overlaps calculated?
Can we use the AIList search algorithm?
Yes, but it doesn't help much because the bin size restricts the excess comparisons.
Performance
Conclusion
- IGD computes overlaps between a query and database of indexed interval sets
- IGD uses linear binning to index collections of region sets
- Because bins are independent, IGD uses little memory, and could be parallelized
- IGD reduces database size and increases performance
Word embeddings
http://suriyadeepan.github.io
Word2vec model
Word context
You shall know a word by the company it keeps. (Firth 1957)
Words that occur in similar contexts tend to have similar meanings.
Image credit: Shubham Agarwal
Genomic context
A genomic interval is more likely to appear in a BED file with other genomic intervals of a similar function.
Genomic Interval Embeddings
Evaluation
We have created unsupervised 100-dimensional vector representations (embeddings) of region sets.
Do relationships among vectors reflect biology?
Evaluation 1: Classification performance
Evaluation 1: Classification performance
Evaluation 1: Classification performance
Evaluation 2: Perturbation similarity detection
Evaluation 3: Peak threshold robustness
Conclusion
- Regionset2vec uses an adapted word2vec model to train vectors for genomic regions
- Regionset2vec embeddings capture expected biological annotations
- Regionset2vec reflects known simulated perturbations
- Regionset2vec is robust to missing data
- NLP approaches can be adapted for applications in genomic interval analysis
### BEDbase goals
- Human browsing of statistical and biological attributes
- Human-friendly, *intelligent* search
- Programmatic API for metadata, statistics, and data chunks
- Integrative analytical results
- Data spans projects (all data on GEO)
BEDbase architecture
BEDbase is a microservice for data interoperability,
not another cloud platform
- Web interface (front-end)
- Clients (front-end)
- API (back-end)
- Database and files (back-end/infrastructure)
- Processing pipelines (infrastructure)
- Data served (content)
- Web interface (front-end)
- Clients (front-end)
- API (back-end)
- Database and files (back-end/infrastructure)
- Processing pipelines (infrastructure)
- Data served (content)
```
library("bedbaseRClient")
query_bb("ab446df9a043222067863cfd536ee8e0",
which=GenomicRanges::GRanges("chr17:38000000-39000000"))
```
```
GRanges object with 37 ranges and 5 metadata columns:
seqnames ranges strand | name score
[Rle] [IRanges] [Rle] | [character] [integer]
[1] chr17 38083473-38083801 * | O-8A-H3K27ac_peak_11.. 21
[2] chr17 38108871-38110066 * | O-8A-H3K27ac_peak_11.. 25
[3] chr17 38137142-38137795 * | O-8A-H3K27ac_peak_11.. 33
[4] chr17 38210828-38211063 * | O-8A-H3K27ac_peak_11.. 17
[5] chr17 38218030-38220186 * | O-8A-H3K27ac_peak_11.. 19
... ... ... ... . ... ...
[33] chr17 38603620-38604355 * | O-8A-H3K27ac_peak_11.. 15
[34] chr17 38647047-38648053 * | O-8A-H3K27ac_peak_11.. 28
[35] chr17 38708445-38710424 * | O-8A-H3K27ac_peak_11.. 20
[36] chr17 38716283-38717201 * | O-8A-H3K27ac_peak_11.. 23
[37] chr17 38803702-38804538 * | O-8A-H3K27ac_peak_11.. 46
field8 field9 field10
[character] [character] [character]
[1] 3.91024 4.82245 2.12195
[2] 4.44410 5.31600 2.51785
[3] 4.30183 6.25865 3.33754
[4] 3.94862 4.34253 1.75554
[5] 3.74929 4.57115 1.93712
... ... ... ...
[33] 3.80116 4.07741 1.55623
[34] 4.35182 5.72967 2.89524
[35] 3.88836 4.73101 2.06784
[36] 4.24210 5.08399 2.33889
[37] 5.09106 7.93609 4.68279
-------
```
- Web interface (front-end)
- Clients (front-end)
- API (back-end)
- Database and files (back-end/infrastructure)
- Processing pipelines (infrastructure)
- Data served (content)
bedhost
A FastAPI application following JAMstack philosophy.
JAMstack forces you to build a comprehensive API.
OpenAPI docs
### Programmatic API to all metadata
- All metadata: http://dev1.bedbase.org/api/bed/78c0e4753d04b238fc07e4ebe5a02984/metadata
- Promoter frequency: http://dev1.bedbase.org/api/bed/78c0e4753d04b238fc07e4ebe5a02984/metadata?ids=promotercore_percentage
- Genome: http://dev1.bedbase.org/api/bed/78c0e4753d04b238fc07e4ebe5a02984/metadata?ids=genome
- Number of regions: http://dev1.bedbase.org/api/bed/78c0e4753d04b238fc07e4ebe5a02984/metadata?ids=regions_no
### Programmatic API to data chunks
You can use the API to extract entries in a defined region
http://dev1.bedbase.org/api/bed/a8f498b373d3e3fd85880754c01873bb/regions/chr1?start=1000000&end=1217614
- Web interface (front-end)
- Clients (front-end)
- API (back-end)
- Database and files (back-end/infrastructure)
- Processing pipelines (infrastructure)
- Data served (content)
## BEDbase data layer:
1. S3 for BED files (stats from 2022-09)
- BED files n=65,137
- 6.78 TB
- 2.39 million total objects
2. PostgreSQL database on AWS managed Relational Database Service for file metadata
- Web interface (front-end)
- Clients (front-end)
- API (back-end)
- Database and files (back-end/infrastructure)
- Processing pipelines (infrastructure)
- Data served (content)
- `bbconf`: bedbase configuration object, connection to database
- `bedqc`: a pipeline for QC of BED files.
- `bedmaker`: a pipeline to convert non-bed files into bed files
- `bedstat`: a pipeline to calculate stats for a bed file
- `bedbuncher`: a pipeline to create bedsets
- `bedembed`: a pipeline to create bed file embeddings
- Web interface (front-end)
- Clients (front-end)
- API (back-end)
- Database and files (back-end/infrastructure)
- Processing pipelines (infrastructure)
- Data served (content)
Connects the Gene Expression Omnibus (GEO)
and Sequence Read Archive (SRA)
with PEP format
Oleksandr Khoroshevskyi
geofetch.databio.org
```
geofetch --filter="bed|bigBed|narrowPeak|broadPeak"
```
Conclusion
- BEDbase provides BED data for humans and machines
- Output includes statistical and biological visualization
- Upcoming human-friendly search is powerful
- Programmatic access to data chunks improve interoperability
Thank You
Collaborators
Aakrosh Ratan
Aidong Zhang
Guangtao Zheng
Don Brown
Hyun Jae Cho
Vince Carey
Mikhail Dozmorov
Alumni
Aaron Gu
Jianglin Feng
Ognen Duzlevski
Tessa Danehy
Sheffield lab
Erfaneh Gharavi
Michal Stolarczyk
John Lawson
Jason Smith
Kristyna Kupkova
John Stubbs
Bingjie Xue
Jose Verdezoto
Nathan LeRoy
Oleksandr Khoroshevskyi
nsheff ·
databio.org ·
nsheffield@virginia.edu