New computational tools for epigenome analysis

Nathan Sheffield, PhD
www.databio.org/slides

Outline

PEPATAC
Regionset2vec
|
|

40%
40%
20%
|

GA4GH Sequence Collections
◁ Questions ▷

An optimized ATAC-seq pipeline
with serial alignments Smith et al (2021, In press). NAR Genomics and Bioinformatics.

Jason Smith

PEPATAC strengths

Modular system

Prealignments
Flexibility and portability

Outputs
Command-line interface with only 3 required arguments
$ /pipelines/pepatac.py -h

usage: pepatac.py [-h] [-R] [-N] [-D] [-F] [-C CONFIG_FILE]
                  [-O PARENT_OUTPUT_FOLDER] [-M MEMORY_LIMIT]
                  [-P NUMBER_OF_CORES] -S SAMPLE_NAME -I INPUT_FILES
                  [INPUT_FILES ...] [-I2 [INPUT_FILES2 [INPUT_FILES2 ...]]] -G
                  GENOME_ASSEMBLY [-Q SINGLE_OR_PAIRED] [-gs GENOME_SIZE]
                  [--frip-ref-peaks FRIP_REF_PEAKS] [--TSS-name TSS_NAME]
                  [--anno-name ANNO_NAME] [--keep]
                  [--peak-caller {fseq,macs2}]
                  [--trimmer {trimmomatic,skewer}]
                  [--prealignments PREALIGNMENTS [PREALIGNMENTS ...]] [-V]

PEPATAC version 0.7.3

optional arguments:
  -h, --help            show this help message and exit
  -R, --recover         Overwrite locks to recover from previous failed run
  -N, --new-start       Overwrite all results to start a fresh run
  -D, --dirty           Don't auto-delete intermediate files
  -F, --force-follow    Always run 'follow' commands
  -C CONFIG_FILE, --config CONFIG_FILE
                        Pipeline configuration file (YAML). Relative paths are
                        with respect to the pipeline script.
  -O PARENT_OUTPUT_FOLDER, --output-parent PARENT_OUTPUT_FOLDER
                        Parent output directory of project
  -M MEMORY_LIMIT, --mem MEMORY_LIMIT
                        Memory limit (in Mb) for processes accepting such
  -P NUMBER_OF_CORES, --cores NUMBER_OF_CORES
                        Number of cores for parallelized processes
  -I2 [INPUT_FILES2 [INPUT_FILES2 ...]], --input2 [INPUT_FILES2 [INPUT_FILES2 ...]]
                        Secondary input files, such as read2
  -Q SINGLE_OR_PAIRED, --single-or-paired SINGLE_OR_PAIRED
                        Single- or paired-end sequencing protocol
  -gs GENOME_SIZE, --genome-size GENOME_SIZE
                        genome size for MACS2
  --frip-ref-peaks FRIP_REF_PEAKS
                        Reference peak set for calculating FRiP
  --TSS-name TSS_NAME   Name of TSS annotation file
  --anno-name ANNO_NAME
                        Name of reference bed file for calculating FRiF
  --keep                Keep prealignment BAM files
  --peak-caller {fseq,macs2}
                        Name of peak caller
  --trimmer {trimmomatic,pyadapt,skewer}
                        Name of read trimming program
  --prealignments PREALIGNMENTS [PREALIGNMENTS ...]
                        Space-delimited list of reference genomes to align to
                        before primary alignment.
  -V, --version         show program's version number and exit

required named arguments:
  -S SAMPLE_NAME, --sample-name SAMPLE_NAME
                        Name for sample to run
  -I INPUT_FILES [INPUT_FILES ...], --input INPUT_FILES [INPUT_FILES ...]
                        One or more primary input files
  -G GENOME_ASSEMBLY, --genome GENOME_ASSEMBLY
                        Identifier for genome assembly
Portable Encapsulated Projects (PEP) provide interoperability
Portable Encapsulated Projects (PEP) provide interoperability

PEP specification for sample metadata

1. Configuration file: config.yaml
pep_version: 2.0.0
sample_table: "path/to/sample_table.csv"
2. Tabular sample annotation table: sample_table.csv:
"sample_name", "protocol", "file"
"frog_1", "ATAC-seq", "frog1.fq.gz"
"frog_2", "ATAC-seq", "frog2.fq.gz"
"frog_3", "ATAC-seq", "frog3.fq.gz"
"frog_4", "ATAC-seq", "frog4.fq.gz"
pep.databio.org

MapReduce or Scatter/Gather

1. Map/Scatter PEPATAC across individual samples
looper run config.yaml
2. Gather results and do cross-sample analysis
looper runp config.yaml

PEPATAC strengths

Modular system

Prealignments
Flexibility and portability

Outputs



Nuclear-mitochondrial DNA (NuMts) confuse aligners
Problems with region masking
  • Inaccurate alignment statistics
  • Requires pre-defined NuMt locations
  • Wastes compute power
  • ### Advantages of serial alignments - Accuracy (better rates plus no blacklist needed). - Speed. - Modular reference assemblies.

    PEPATAC strengths

    Modular system

    Prealignments
    Flexibility and portability

    Outputs
    ### Flexibility and Portability - trimmer options: `skewer` and `trimmomatic` - peak caller options: `macs2` and `fseq` - aligner options: `bowtie2` and `bwa` ``` ./pepatac.py --trimmer trimmomatic --peak-caller fseq ```

    Flexibility and Portability

  • parameterization via config file pepatac.yaml
  • # basic tools 
    tools:  # absolute paths to required tools
      java: java
      python: python
      samtools: samtools
      bedtools: bedtools
      bowtie2: bowtie2
      fastqc: fastqc
      macs2: macs2
      picard: ${PICARD}
      skewer: skewer
      perl: perl
      # ucsc tools
      bedGraphToBigWig: bedGraphToBigWig
      wigToBigWig: wigToBigWig
      bigWigCat: bigWigCat
      bedSort: bedSort
      bedToBigBed: bedToBigBed
      # optional tools
      fseq: fseq  
      trimmo: ${TRIMMOMATIC}
      Rscript: Rscript 
    
    # user configure 
    resources:
      genomes: ${GENOMES}
      adapters: null  # Set to null to use default adapters
    
    parameters:  # parameters passed to bioinformatic tools
      samtools:
        q: 10
      macs2: 
        f: BED
        q: 0.01
        shift: 0
      fseq:
        of: npf    # narrowPeak as output format
        l: 600     # feature length
        t: 4.0     # "threshold" (standard deviations)
        s: 1       # wiggle track step
    ### Flexibility and Portability Running options: - natively - conda - containers using `docker` or `singularity`. - use bulker to manage containers for your (http://bulker.io) ``` git clone github.com/databio/pepatac docker pull databio/pepatac docker run --rm -it databio/pepatac pipelines/pepatac.py ```

    PEPATAC strengths

    Modular system

    Prealignments
    Flexibility and portability

    Outputs

    Output


    http://pepatac.databio.org/en/latest/files/examples/gold/gold_summary.html

    PEPATAC in practice

  • O'Connor et al. (2021). bioRxiv. DOI: 10.1101/2021.07.15.452570
  • Ram-Mohan et al. (2021). Life Science Alliance. DOI: 10.26508/lsa.202000976
  • Robertson et al. (2021). Nature Genetics. DOI: 10.1038/s41588-021-00880-5
  • Cheung et al. (2021). DOI: 10.1038/s41590-021-00928-y
  • Hasegawa et al. (2021). bioRxiv. DOI: 10.1101/2021.04.28.441728
  • Weber et al. (2021). Science. DOI: 10.1126/science.aba1786
  • Tovar et al. (2021). bioRxiv. DOI: 10.1101/2021.01.29.428733
  • Granja et al. (2021). Nature Genetics. DOI: 10.1038/s41588-021-00790-6
  • Fan et al. (2020). Cell Reports. DOI: 10.1016/j.celrep.2020.108473
  • Smith and Sheffield (2020). Current Protocols in Human Genetics. DOI: 10.1002/cphg.101
  • Liu (2020). DOI: 10.18632/oncotarget.27584
  • Zhou et al. (2020). bioRxiv. DOI: 10.1101/2020.05.16.099325
  • Cai et al. (2020). DOI: 10.1186/s12920-020-0695-0
  • Li et al. (2020). DOI: 10.1038/s41419-020-2303-9
  • Liang et al. (2019). DOI: 10.1002/1873-3468.13549
  • Corces et al. (2018). Science. DOI: 10.1126/science.aav1898

  • Unique identifiers and API for sequence collections. Stolarczyk, Xue, and Sheffield (2021). NAR Genomics and Bioinformatics.

    The problem: Who is the authoritative provider of the reference genome?

    • NCBI
    • UCSC
    • Ensembl
    Variation includes:
    • hard, soft, or no repeat masking?
    • are alternative scaffolds included?
    • are haplotypes included?
    • how are chromosomes named (chr1, 1, or NC_000001.11)?
    • how is the assembly named (hg38, GRCh38, or GCF_000001405.39)?
    • Are any decoy sequences included (like EBV)?
    Andy Yates' "Genome provider analysis"
    Provider Chr1 name Chr1 length Chr1 md5 Num chroms
    Ensembl primary 1 248956422 2648ae1bacce4ec4b6cf337dcae37816 195
    Ensembl toplevel 1 248956422 2648ae1bacce4ec4b6cf337dcae37816 649
    NCBI NC_000001.11 248956422 6aef897c3d6ff0c78aff06ac189178dd 640
    UCSC chr1 248956422 2648ae1bacce4ec4b6cf337dcae37816 456
    https://gist.github.com/andrewyatz/692f81baab1bebaf09c481937f2ad6c6
    Subtle differences in reference assembly lead to:
    1. Lack of reproducibility of analysis
    2. Lack of reusability of results

    Solution


    Refget -> Sequence collections

    Refget

    Refget enables access to reference sequences
    using an identifier derived from the sequence itself.

    How refget works

    Limitations

    • only handles a single sequence
    • excludes chromosome names

    Extending to sequence collections

    We need:
    • 1. An algorithm to create a deterministic, unique digest from a collection of sequences
    • 2. A server capable of retrieving sequences given an identifier

    First pass: Refgenie approach

    Stolarczyk, Xue, and Sheffield (2021). NAR Genomics and Bioinformatics.
    refgenomes.databio.org

    Limitations and discussion

    • Should we include sequence topology in the digest?
    • What other attributes could we include?
    • Are there better delimiters?
    • How do we construct the 'string-to-digest'?
    • How do we handle order of sequences?

    Current proposal: Array-based protocol

    Current proposal: Array-based protocol

    Current proposal: Array-based protocol

    seqcol = { 'names': ['chrUn_KI270742v1', 'chrUn_GL000216v2', 'chrUn_GL000218v1'],
               'lengths': ['186739', '176608', '161147'],
               'sequences': ['2f31c013a4a8301deb8ab7ed1ca1cd99',
                             '725009a7e3f5b78752b68afa922c090c',
                             '1d708b54644c26c7e01c2dad5426d38c'] }
    
    seqcol = { 'sequences': '8dd93796fa0225e92eb159a8779f1b254776557f748f8bfb',
               'lengths':   '501fd98e2fdcc276c47306bd72c9155489ed2b23123ddfa2',
               'names':     '7bc90a07cf25f2f64f33baee3d420ad1ae5f442055280d43'}
    

    Advantages

    • Accommodates new attributes with backwards-compatibility
    • Additional layer of recursion to assess individual attributes
    • Requires only a single delimiter

    How should we refer to reference genomes in practice?

    refgenie alias get

    Comparison function

    Provider Chr1 name Chr1 length Chr1 md5 Num chroms
    Ensembl primary 1 248956422 2648ae1bacce4ec4b6cf337dcae37816 195
    Ensembl toplevel 1 248956422 2648ae1bacce4ec4b6cf337dcae37816 649
    NCBI NC_000001.11 248956422 6aef897c3d6ff0c78aff06ac189178dd 640
    UCSC chr1 248956422 2648ae1bacce4ec4b6cf337dcae37816 456
    • seqcol 1: 047c6e1eda552b50c5add59ff0995
    • seqcol 2: 2230c535660fb4774114bfa966a62

    How compatible are they?

    API Endpoint: POST /compare; GET /compare
    Return value:
      
    {
      "lengths": {
        "any-elements-shared": true,
        "all-a-in-b": true,
        "all-b-in-a": true,
        "order-match": true,
      },
      "names": {
        "any-elements-shared": true,
        "all-a-in-b": true,
        "all-b-in-a": true,
        "order-match": true,
      },
      "sequences": {
        "any-elements-shared": false,
        "all-a-in-b": false,
        "all-b-in-a": false,
        "order-match": false,
      }
    }
    

    Seqcol API demonstration

    http://seqcolapi.databio.org

    Conclusions

    • Refget provides universal IDs for individual sequences
    • Sequence collections extends this to reference genomes
    • Using a deterministic algorithm, you can find the identifier
    • A lookup service can retrieve the original sequence
    • A comparison function allows fine-grained compatibility tests
    • Please follow along: https://github.com/ga4gh/seqcol-spec

    Region-set 2 Vec

    Embeddings of genomic region sets
    in lower dimensions.
    Gharavi et al. (2021). Bioinformatics.

    Erfaneh Gharavi
    What does it mean for two region sets (BED files) to be similar?
    Overlaps makes some sense...but what about:
    degree of overlap?
    weighting of specific regions?
    biological similarity of regions?

    The bag-of-words model for text classification


    Zheng and Casari (2018), Feature Engineering for Machine Learning

    The bag-of-intervals model for genomic intervals

      Advantages
    • Vector representation of a region set
    • Similarity metrics among vectors
    • Space and time complexity

    Limitations of the bag of words vector approach

    hotel = [0 0 0 0 0 0 0 1 0 0 0 0 0 0]
    motel = [0 0 0 0 0 0 0 0 0 0 0 0 1 0]
    
    • Sparsity
    • Curse of dimensionality
    • No concept of relationships among words
    • Space and time complexity

    Decreasing space/time complexity

    Genomic interval sets

    High-dimensional vectors

    Low-dimensional vectors

    Word embeddings

    http://suriyadeepan.github.io

    Word2vec model

    Word2vec model


    Mikolov et al. (2013). arXiv:1301.3781v3.

    Word context

    You shall know a word by the company it keeps. (Firth 1957)
    Words that occur in similar contexts tend to have similar meanings.
    Image credit: Shubham Agarwal

    Genomic Interval Embeddings

    Evaluation

    We have created unsupervised 100-dimensional vector representations (embeddings) of region sets.
    Do relationships among vectors reflect biology?

    Evaluation 1: Classification performance

    Evaluation 1: Classification performance

    Evaluation 2: Perturbation similarity detection

    Evaluation 3: Peak threshold robustness

    Conclusion

    • Regionset2vec uses an adapted word2vec model to train vectors for genomic regions
    • Regionset2vec embeddings capture expected biological annotations
    • Regionset2vec reflects known simulated perturbations
    • Regionset2vec is robust to missing data
    • NLP approaches can be adapted for applications in genomic interval analysis

    Next steps

    Region embeddings help with:
    Given a BED file, find a similar BED file.
    (LOLA, GIGGLE, IGD)

    What about:
    Given a human search term, find a BED file.

    A high-performance server and API
    for genomic interval data.
    • Data spans projects (e.g. all data on GEO; 40,000 accessions, 100,000+ BED files)
    • Programmatic API for metadata, statistics, and data chunks
    • Human browsing of statistical and biological attributes
    • Aware of similarities among BED files
    • Human-friendly search
    • Shaped into 'non-redundant' sets for analysis

    Thank You

    Collaborators
    Aakrosh Ratan
    Aidong Zhang
    Guangtao Zheng
    Don Brown
    Hyun Jae Cho
    Vince Carey
    Mikhail Dozmorov
    GA4GH Refget group

    Alumni
    Aaron Gu
    Jianglin Feng
    Ognen Duzlevski
    Tessa Danehy
    Sheffield lab
    Erfaneh Gharavi
    Michal Stolarczyk
    John Lawson
    Jason Smith
    Kristyna Kupkova
    John Stubbs
    Bingjie Xue
    Jose Verdezoto
    Nathan LeRoy
    Oleksandr Khoroshevskyi
    Funding:



    NIGMS R35-GM128636

    nsheff · databio.org · nsheffield@virginia.edu