Sheffield and Bock (2016). Bioinformatics. Nagraj, Magee, and Sheffield (2018). Nucleic Acids Research.
If subject list has no containment, identifying overlaps is fast
binary search on start intervals, followed by backward steps:
The problem arises with contained interval overlaps
How can we improve efficiency without guaranteeing no containment?
Many approaches to solve the 'containment' issue:
- Nested Containment Lists (GRanges) [@Alekseyenko2007; @Aboyoun2012]
- R-trees (bedtools) [@Kent2002; @Quinlan2010], Augmented interval trees [@Cormen2001]
These methods try to structure the data
to provide non-containment guarantees
Methods provide non-containment guarantees
R-trees
Annotates tree nodes with a minimum bounding rectangle of elements. A query that does not intersect the bounding rectangle will not intersect any child element.
Nested Containment Lists
Augmented Interval List
1. Augment the list with the running maximum *end* value. *solves the problem for lowly-contained lists*
2. Decompose the list to minimize containment. *extends the solution to highly-contained lists*
Augment with the running maximum end value, `maxE`
Provides a local guarantee of no containment.
AIList works on contained lists
But long containment runs are problematic
Decompose long runs with constant `maxE`
Performance
How does the `maxE` minimum run length affect performance?
How does it compare to existing approaches?
How does it scale with increasing size of subject?
Datasets
How does the `maxE` minimum run length affect performance?
How does it compare to existing approaches?
How does it scale with increasing size of subject?
Conclusion
Augmented Interval Lists add the maximum running end value to a list of intervals
The data structure is simpler than other methods
AILists improve performance, particularly in highly contained interval sets
Conclusion
Pepkit provides a start-to-finish toolkit for processing epigenome data.