New computational tools for epigenome analysis

Nathan Sheffield, PhD

www.databio.org/slides

Outline

PEPATAC

Regionset2vec

40%

20%

GA4GH Sequence Collections

◁ Questions ▷

An optimized ATAC-seq pipeline
with serial alignments

http://pepatac.databio.org

Smith et al (2021, In press). NAR Genomics and Bioinformatics.

Jason Smith

PEPATAC strengths

Modular system

Prealignments

Flexibility and portability

Outputs

Command-line interface with only 3 required arguments

$ /pipelines/pepatac.py -h


usage: pepatac.py [-h] [-R] [-N] [-D] [-F] [-C CONFIG_FILE]
                  [-O PARENT_OUTPUT_FOLDER] [-M MEMORY_LIMIT]
                  [-P NUMBER_OF_CORES] -S SAMPLE_NAME -I INPUT_FILES
                  [INPUT_FILES ...] [-I2 [INPUT_FILES2 [INPUT_FILES2 ...]]] -G
                  GENOME_ASSEMBLY [-Q SINGLE_OR_PAIRED] [-gs GENOME_SIZE]
                  [--frip-ref-peaks FRIP_REF_PEAKS] [--TSS-name TSS_NAME]
                  [--anno-name ANNO_NAME] [--keep]
                  [--peak-caller {fseq,macs2}]
                  [--trimmer {trimmomatic,skewer}]
                  [--prealignments PREALIGNMENTS [PREALIGNMENTS ...]] [-V]

PEPATAC version 0.7.3

optional arguments:
  -h, --help            show this help message and exit
  -R, --recover         Overwrite locks to recover from previous failed run
  -N, --new-start       Overwrite all results to start a fresh run
  -D, --dirty           Don't auto-delete intermediate files
  -F, --force-follow    Always run 'follow' commands
  -C CONFIG_FILE, --config CONFIG_FILE
                        Pipeline configuration file (YAML). Relative paths are
                        with respect to the pipeline script.
  -O PARENT_OUTPUT_FOLDER, --output-parent PARENT_OUTPUT_FOLDER
                        Parent output directory of project
  -M MEMORY_LIMIT, --mem MEMORY_LIMIT
                        Memory limit (in Mb) for processes accepting such
  -P NUMBER_OF_CORES, --cores NUMBER_OF_CORES
                        Number of cores for parallelized processes
  -I2 [INPUT_FILES2 [INPUT_FILES2 ...]], --input2 [INPUT_FILES2 [INPUT_FILES2 ...]]
                        Secondary input files, such as read2
  -Q SINGLE_OR_PAIRED, --single-or-paired SINGLE_OR_PAIRED
                        Single- or paired-end sequencing protocol
  -gs GENOME_SIZE, --genome-size GENOME_SIZE
                        genome size for MACS2
  --frip-ref-peaks FRIP_REF_PEAKS
                        Reference peak set for calculating FRiP
  --TSS-name TSS_NAME   Name of TSS annotation file
  --anno-name ANNO_NAME
                        Name of reference bed file for calculating FRiF
  --keep                Keep prealignment BAM files
  --peak-caller {fseq,macs2}
                        Name of peak caller
  --trimmer {trimmomatic,pyadapt,skewer}
                        Name of read trimming program
  --prealignments PREALIGNMENTS [PREALIGNMENTS ...]
                        Space-delimited list of reference genomes to align to
                        before primary alignment.
  -V, --version         show program's version number and exit

required named arguments:
  -S SAMPLE_NAME, --sample-name SAMPLE_NAME
                        Name for sample to run
  -I INPUT_FILES [INPUT_FILES ...], --input INPUT_FILES [INPUT_FILES ...]
                        One or more primary input files
  -G GENOME_ASSEMBLY, --genome GENOME_ASSEMBLY
                        Identifier for genome assembly

Portable Encapsulated Projects (PEP) provide interoperability

PEP specification for sample metadata

1. Configuration file: config.yaml

pep_version: 2.0.0
sample_table: "path/to/sample_table.csv"

2. Tabular sample annotation table: sample_table.csv:

"sample_name", "protocol", "file"
"frog_1", "ATAC-seq", "frog1.fq.gz"
"frog_2", "ATAC-seq", "frog2.fq.gz"
"frog_3", "ATAC-seq", "frog3.fq.gz"
"frog_4", "ATAC-seq", "frog4.fq.gz"

pep.databio.org

MapReduce or Scatter/Gather

1. Map/Scatter PEPATAC across individual samples looper run config.yaml 2. Gather results and do cross-sample analysis looper runp config.yaml

PEPATAC strengths

Modular system

Prealignments

Flexibility and portability

Outputs

Nuclear-mitochondrial DNA (NuMts) confuse aligners

Problems with region masking

Inaccurate alignment statistics

Requires pre-defined NuMt locations

Wastes compute power

### Advantages of serial alignments - Accuracy (better rates plus no blacklist needed). - Speed. - Modular reference assemblies.

PEPATAC strengths

Modular system

Prealignments

Flexibility and portability

Outputs

### Flexibility and Portability - trimmer options: `skewer` and `trimmomatic` - peak caller options: `macs2` and `fseq` - aligner options: `bowtie2` and `bwa` ``` ./pepatac.py --trimmer trimmomatic --peak-caller fseq ```

Flexibility and Portability

parameterization via config file pepatac.yaml

# basic tools 
tools:  # absolute paths to required tools
  java: java
  python: python
  samtools: samtools
  bedtools: bedtools
  bowtie2: bowtie2
  fastqc: fastqc
  macs2: macs2
  picard: ${PICARD}
  skewer: skewer
  perl: perl
  # ucsc tools
  bedGraphToBigWig: bedGraphToBigWig
  wigToBigWig: wigToBigWig
  bigWigCat: bigWigCat
  bedSort: bedSort
  bedToBigBed: bedToBigBed
  # optional tools
  fseq: fseq  
  trimmo: ${TRIMMOMATIC}
  Rscript: Rscript 

# user configure 
resources:
  genomes: ${GENOMES}
  adapters: null  # Set to null to use default adapters

parameters:  # parameters passed to bioinformatic tools
  samtools:
    q: 10
  macs2: 
    f: BED
    q: 0.01
    shift: 0
  fseq:
    of: npf    # narrowPeak as output format
    l: 600     # feature length
    t: 4.0     # "threshold" (standard deviations)
    s: 1       # wiggle track step

### Flexibility and Portability Running options: - natively - conda - containers using `docker` or `singularity`. - use bulker to manage containers for your (http://bulker.io) ``` git clone github.com/databio/pepatac docker pull databio/pepatac docker run --rm -it databio/pepatac pipelines/pepatac.py ```

PEPATAC strengths

Modular system

Prealignments

Flexibility and portability

Outputs

Output

http://pepatac.databio.org/en/latest/files/examples/gold/gold_summary.html

PEPATAC in practice

O'Connor et al. (2021). bioRxiv. DOI: 10.1101/2021.07.15.452570

Ram-Mohan et al. (2021). Life Science Alliance. DOI: 10.26508/lsa.202000976

Robertson et al. (2021). Nature Genetics. DOI: 10.1038/s41588-021-00880-5

Cheung et al. (2021). DOI: 10.1038/s41590-021-00928-y

Hasegawa et al. (2021). bioRxiv. DOI: 10.1101/2021.04.28.441728

Weber et al. (2021). Science. DOI: 10.1126/science.aba1786

Tovar et al. (2021). bioRxiv. DOI: 10.1101/2021.01.29.428733

Granja et al. (2021). Nature Genetics. DOI: 10.1038/s41588-021-00790-6

Fan et al. (2020). Cell Reports. DOI: 10.1016/j.celrep.2020.108473

Smith and Sheffield (2020). Current Protocols in Human Genetics. DOI: 10.1002/cphg.101

Liu (2020). DOI: 10.18632/oncotarget.27584

Zhou et al. (2020). bioRxiv. DOI: 10.1101/2020.05.16.099325

Cai et al. (2020). DOI: 10.1186/s12920-020-0695-0

Li et al. (2020). DOI: 10.1038/s41419-020-2303-9

Liang et al. (2019). DOI: 10.1002/1873-3468.13549

Corces et al. (2018). Science. DOI: 10.1126/science.aav1898

Sequence Collections
Unique identifiers and API for sequence collections.

https://seqcol.readthedocs.io

Stolarczyk, Xue, and Sheffield (2021). NAR Genomics and Bioinformatics.

Problem

Who is the authoritative provider of the reference genome?

NCBI?
UCSC?
Ensembl?

Variation includes:

hard, soft, or no repeat masking?
are alternative scaffolds included?
are haplotypes included?
how are chromosomes named (chr1, 1, or NC_000001.11)?
how is the assembly named (hg38, GRCh38, or GCF_000001405.39)?
Are any decoy sequences included (like EBV)?

Andy Yates' "Genome provider analysis"

Provider	Chr1 name	Chr1 length	Chr1 md5	Num chroms
Ensembl primary	1	248956422	2648ae1bacce4ec4b6cf337dcae37816	195
Ensembl toplevel	1	248956422	2648ae1bacce4ec4b6cf337dcae37816	649
NCBI	NC_000001.11	248956422	6aef897c3d6ff0c78aff06ac189178dd	640
UCSC	chr1	248956422	2648ae1bacce4ec4b6cf337dcae37816	456

https://gist.github.com/andrewyatz/692f81baab1bebaf09c481937f2ad6c6

Subtle differences in reference assembly lead to:

Lack of reproducibility of analysis
Lack of reusability of results

Solution

Refget -> Sequence collections

Refget

Refget enables access to reference sequences
using an identifier derived from the sequence itself.

http://samtools.github.io/hts-specs/refget.html

How refget works

Limitations

only handles a single sequence
excludes chromosome names
no capacity for annotation

Extending to sequence collections

We need:

1. An algorithm to create a deterministic, unique digest from a collection of sequences
2. A server capable of retrieving sequences given an identifier

First pass: Refgenie approach

Stolarczyk, Xue, and Sheffield (2021). NAR Genomics and Bioinformatics.

refgenomes.databio.org

Limitations and discussion

Should we include sequence topology in the digest?
What other attributes could we include?
Are there better delimiters?
How do we construct the 'string-to-digest'?
How do we handle order of sequences?
How should the API respond to requests?

Project goal:

to standardize unique identifiers for collections of sequences

can be used to identify genomes, transcriptomes, or proteomes -- anything that can be represented as a collection of sequences

The project specifies:

an algorithm for computing sequence identifiers from collections

a lookup API to retrieve a collection given an identifier

a comparison API to assess compatibility of two collections

How do we digest a sequence collection?

JSON object: each sequence collection attribute is a property

{
  "lengths": [
    4,
    4,
    8
  ],
  "names": [
    "chr1",
    "chr2",
    "chrX"
  ],
  "sequences": [
    "31fc6ca291a32fb9df82b85e5f077e31",
    "92c6a56c9e9459d8a42b96f7884710bc",
    "5f63cfaa3ef61f88c9635fb9d18ec945"
  ]
}

← length of the sequences

← names of the sequences

← refget digests

You can drop the sequences attribute:

{
  "lengths": [
    4,
    4,
    8
  ],
  "names": [
    "chr1",
    "chr2",
    "chrX"
  ],
  "sequences": [
    "31fc6ca291a32fb9df...",
    "92c6a56c9e9459d8a4...",
    "5f63cfaa3ef61f88c9..."
  ]
}

{
  "lengths": [
    4,
    4,
    8
  ],
  "names": [
    "chr1",
    "chr2",
    "chrX"
  ]
}

Or add a topology attribute

{
  "lengths": [
    4,
    4,
    8
  ],
  "names": [
    "chr1",
    "chr2",
    "chrX"
  ],
  "sequences": [
    "31fc6ca291a32fb9df...",
    "92c6a56c9e9459d8a4...",
    "5f63cfaa3ef61f88c9..."
  ],
  "topologies" [ 
    "linear",
    "linear",
    "circular"
  ]
}

Digest algorithm

Canonicalize each attribute following RFC-8785 (JSON Canonicalization Scheme)
Digest each string (GA4GH digest: SHA512 truncated to 24 bits, converted to base64)
Canonicalize the entire object
Digest the canonicalized string

Example

Tim Cezard

Advantages

Accommodates new attributes with backwards-compatibility
Additional layer of recursion to assess individual attributes
Relies on existing JCS standard for string encoding

What gets digested?

Inherent attributes are included in the calculation of the identifier

Non-inherent attributes enables storing additional metadata, comparison helpers, etc

These are specified using a schema

Comparison function

Provider	Chr1 name	Chr1 length	Chr1 md5	Num chroms
Ensembl primary	1	248956422	2648ae1bacce4ec4b6cf337dcae37816	195
Ensembl toplevel	1	248956422	2648ae1bacce4ec4b6cf337dcae37816	649
NCBI	NC_000001.11	248956422	6aef897c3d6ff0c78aff06ac189178dd	640
UCSC	chr1	248956422	2648ae1bacce4ec4b6cf337dcae37816	456

seqcol 1: 047c6e1eda552b50c5add59ff0995
seqcol 2: 2230c535660fb4774114bfa966a62

How compatible are they?

Comparison endpoint

  
{
  "digests": {
    "a": "59319772d1bcf2e0dd4b8a296f2d9682",
    "b": "2e7bc302a54ecec62d8155e19fbf2748"
  },
  "arrays": {
    "a-only": [],
    "b-only": [],
    "a-and-b": [
      "lengths",
      "names",
      "sequences",
      "names_lengths"
    ]
  },
  "elements": {
    "total": {
      "a": 3,
      "b": 3
    },
    "a-and-b": {
      "lengths": 3,
      "names": 3,
      "sequences": 3,
      "names_lengths": 3
    },
    "a-and-b-same-order": {
      "lengths": false,
      "names": false,
      "sequences": false,
      "names_lengths": true
    }
  }
}

Seqcol API demonstration

https://seqcolapi.databio.org/

API endpoints

GET /service-info

GET /collection/:digest

GET /comparison/:digest1/:digest2

POST /comparison/:digest1

Conclusions

Refget provides universal IDs for individual sequences
Sequence collections extends this to reference genomes
Using a deterministic algorithm, you can find the identifier
A lookup service can retrieve the original sequence
A comparison function allows fine-grained compatibility tests
Please follow along: https://github.com/ga4gh/seqcol-spec

Region-set 2 Vec

Embeddings of genomic region sets
in lower dimensions.

https://github.com/databio/regionset-embedding

Gharavi et al. (2021). Bioinformatics.

Erfaneh Gharavi

Word embeddings

http://suriyadeepan.github.io

Word2vec model

Mikolov et al. (2013). arXiv:1301.3781v3.

Word context

You shall know a word by the company it keeps. (Firth 1957)
Words that occur in similar contexts tend to have similar meanings.

Image credit: Shubham Agarwal

Genomic context

A genomic interval is more likely to appear in a BED file with other genomic intervals of a similar function.

Genomic Interval Embeddings

Evaluation

We have created unsupervised 100-dimensional vector representations (embeddings) of region sets.
Do relationships among vectors reflect biology?

Evaluation 1: Classification performance

Conclusion

Regionset2vec adapts word2vec to learn genomic region embeddings
Regionset2vec embeddings capture biological information
NLP approaches can be adapted for applications in genomic interval analysis

Next steps

Region embeddings help with:
Given a BED file, find a similar BED file.
(LOLA, GIGGLE, IGD)

What about:
Given a human search term, find a BED file.

A high-performance server and API
for genomic interval data.

http://bedbase.org

Data spans projects (e.g. all data on GEO; 40,000 accessions, 100,000+ BED files)
Programmatic API for metadata, statistics, and data chunks
Human browsing of statistical and biological attributes
Aware of similarities among BED files
Human-friendly search
Shaped into 'non-redundant' sets for analysis

Thank You

Collaborators
Aakrosh Ratan
Aidong Zhang
Guangtao Zheng
Don Brown
Hyun Jae Cho
Vince Carey
Mikhail Dozmorov
GA4GH Refget group

Alumni
Aaron Gu
Jianglin Feng
Ognen Duzlevski
Tessa Danehy

Sheffield lab
Erfaneh Gharavi
Michal Stolarczyk
John Lawson
Jason Smith
Kristyna Kupkova
John Stubbs
Bingjie Xue
Jose Verdezoto
Nathan LeRoy
Oleksandr Khoroshevskyi

Funding:

NIGMS R35-GM128636

nsheff ·

databio.org ·

nsheffield@virginia.edu