Augmented Interval Lists
Integrated Genome Database
If subject list has no containment,
identifying overlaps is fast
binary search on start intervals, followed by backward steps:
The problem arises with contained interval overlaps
How can we improve efficiency
without guaranteeing no containment?
Many approaches to solve the 'containment' issue:
- Nested Containment Lists (GRanges) [@Alekseyenko2007; @Aboyoun2012]
- R-trees (bedtools) [@Kent2002; @Quinlan2010], Augmented interval trees [@Cormen2001]
These methods try to structure the data
to provide non-containment guarantees
Methods provide non-containment guarantees
Annotates tree nodes with a minimum bounding rectangle of elements. A query that does not intersect the bounding rectangle will not intersect any child element.
Nested Containment Lists
Augmented Interval List
1. Augment the list with the running maximum *end* value. *solves the problem for lowly-contained lists*
2. Decompose the list to minimize containment. *extends the solution to highly-contained lists*
Augment with the running maximum end value, `maxE`
Provides a local guarantee of no containment.
AIList works on contained lists
But long containment runs are problematic
Decompose long runs with constant `maxE`
- How does the `maxE` minimum run length affect performance?
- How does it compare to existing approaches?
- How does it scale with increasing size of subject?
How does the `maxE` minimum run length affect performance?
How does it compare to existing approaches?
How does it scale with increasing size of subject?
- Augmented Interval Lists add the maximum running end value to a list of intervals
- The data structure is simpler than other methods
- AILists improve performance, particularly in highly contained interval sets
Expanding the search space
An integrated data structure
IGD uses linear binning
- The genome is divided into equal-size bins
- Database intervals are placed in any bins they overlap
- Intervals are sorted by start coordinate within a bin
- Single-layer data structure has less overhead
- Bins are independent
- Duplication = bigger database
- Duplication = possible for double-counting
Challenge 1: Database size
- Adjustable with bin size
- In practice: 5-20% bigger than raw, unduplicated data
- Can be 2x or more if you have smaller bins than regions
- Default bin size: 16,384 (214)
Challenge 2: Double-counting
Occurs only when both query and subject interval cross the same
bin boundary.
Rule:If the query crosses the left boundary of the bin, then any region in the bin that also crosses the left boundary will be skipped
Question: Within a bin, how are overlaps calculated?
Can we use the AIList search algorithm?
Yes, but it doesn't help much because the bin size restricts the excess comparisons.
- IGD computes overlaps between a query and database of indexed interval sets
- IGD uses linear binning to index collections of region sets
- Because bins are independent, IGD uses little memory, and could be parallelized
- IGD reduces database size and increases performance
Word embeddings
Word2vec model
Word context
You shall know a word by the company it keeps. (Firth 1957)
Words that occur in similar contexts tend to have similar meanings.
Image credit: Shubham Agarwal
Genomic context
A genomic interval is more likely to appear in a BED file with other genomic intervals of a similar function.
Genomic Interval Embeddings
We have created unsupervised 100-dimensional vector representations (embeddings) of region sets.
Do relationships among vectors reflect biology?
Evaluation 1: Classification performance
Evaluation 1: Classification performance
Evaluation 1: Classification performance
Evaluation 2: Perturbation similarity detection
Evaluation 3: Peak threshold robustness
- Regionset2vec uses an adapted word2vec model to train vectors for genomic regions
- Regionset2vec embeddings capture expected biological annotations
- Regionset2vec reflects known simulated perturbations
- Regionset2vec is robust to missing data
- NLP approaches can be adapted for applications in genomic interval analysis
### BEDbase goals
- Human browsing of statistical and biological attributes
- Human-friendly, *intelligent* search
- Programmatic API for metadata, statistics, and data chunks
- Integrative analytical results
- Data spans projects (all data on GEO)
BEDbase architecture
BEDbase is a microservice for data interoperability,
not another cloud platform
- Web interface (front-end)
- Clients (front-end)
- API (back-end)
- Database and files (back-end/infrastructure)
- Processing pipelines (infrastructure)
- Data served (content)
GRanges object with 37 ranges and 5 metadata columns:
seqnames ranges strand | name score
[Rle] [IRanges] [Rle] | [character] [integer]
[1] chr17 38083473-38083801 * | O-8A-H3K27ac_peak_11.. 21
[2] chr17 38108871-38110066 * | O-8A-H3K27ac_peak_11.. 25
[3] chr17 38137142-38137795 * | O-8A-H3K27ac_peak_11.. 33
[4] chr17 38210828-38211063 * | O-8A-H3K27ac_peak_11.. 17
[5] chr17 38218030-38220186 * | O-8A-H3K27ac_peak_11.. 19
... ... ... ... . ... ...
[33] chr17 38603620-38604355 * | O-8A-H3K27ac_peak_11.. 15
[34] chr17 38647047-38648053 * | O-8A-H3K27ac_peak_11.. 28
[35] chr17 38708445-38710424 * | O-8A-H3K27ac_peak_11.. 20
[36] chr17 38716283-38717201 * | O-8A-H3K27ac_peak_11.. 23
[37] chr17 38803702-38804538 * | O-8A-H3K27ac_peak_11.. 46
field8 field9 field10
[character] [character] [character]
[1] 3.91024 4.82245 2.12195
[2] 4.44410 5.31600 2.51785
[3] 4.30183 6.25865 3.33754
[4] 3.94862 4.34253 1.75554
[5] 3.74929 4.57115 1.93712
... ... ... ...
[33] 3.80116 4.07741 1.55623
[34] 4.35182 5.72967 2.89524
[35] 3.88836 4.73101 2.06784
[36] 4.24210 5.08399 2.33889
[37] 5.09106 7.93609 4.68279
A FastAPI application following JAMstack philosophy.
JAMstack forces you to build a comprehensive API.
OpenAPI docs
### Programmatic API to all metadata
- All metadata: http://dev1.bedbase.org/api/bed/78c0e4753d04b238fc07e4ebe5a02984/metadata
- Promoter frequency: http://dev1.bedbase.org/api/bed/78c0e4753d04b238fc07e4ebe5a02984/metadata?ids=promotercore_percentage
- Genome: http://dev1.bedbase.org/api/bed/78c0e4753d04b238fc07e4ebe5a02984/metadata?ids=genome
- Number of regions: http://dev1.bedbase.org/api/bed/78c0e4753d04b238fc07e4ebe5a02984/metadata?ids=regions_no
### Programmatic API to data chunks
You can use the API to extract entries in a defined region
## BEDbase data layer:
1. S3 for BED files (stats from 2022-09)
- BED files n=65,137
- 6.78 TB
- 2.39 million total objects
2. PostgreSQL database on AWS managed Relational Database Service for file metadata
- `bbconf`: bedbase configuration object, connection to database
- `bedqc`: a pipeline for QC of BED files.
- `bedmaker`: a pipeline to convert non-bed files into bed files
- `bedstat`: a pipeline to calculate stats for a bed file
- `bedbuncher`: a pipeline to create bedsets
- `bedembed`: a pipeline to create bed file embeddings
Connects the Gene Expression Omnibus (GEO)
and Sequence Read Archive (SRA)
with PEP format
Oleksandr Khoroshevskyi
geofetch --filter="bed|bigBed|narrowPeak|broadPeak"
- BEDbase provides BED data for humans and machines
- Output includes statistical and biological visualization
- Upcoming human-friendly search is powerful
- Programmatic access to data chunks improve interoperability
