Choose your own adventure through Sheffield lab research

Nathan Sheffield, PhD
www.databio.org/slides

Mission statement

We develop and apply computational methods
to organize, analyze, and understand large epigenomic data.


Biological motivation




Cells alter phenotype by using DNA differently.

Breakdowns lead to disease

Full-stack bioinformatics


Full-stack bioinformatics


Full-stack bioinformatics






Full-stack bioinformatics


Augmented Interval List

Full-stack bioinformatics





Full-stack bioinformatics


Analysis of DNA methylation in Ewing sarcoma

Full-stack bioinformatics



Bonus (unpublished) topics

- [bedbase.org](http://bedbase.org)

Locus Overlap Analysis

Sheffield and Bock (2016). Bioinformatics.
Nagraj, Magee, and Sheffield (2018). Nucleic Acids Research.

Methylation-based Inference of Regulatory Activity (MIRA)

Lawson et al. (2018). Bioinformatics.

Sheffield et al. (2017). Nature Medicine.

Coordinate Covariation Analysis (COCOA)


Lawson et al. (2020). Genome Biology.

Goal: understand variation among individuals


Supervised differential analysis

Supervised continuous analysis

Unsupervised analysis

Epigenomic data: high-dimensional
and low-interpretable


Dimensionality reduction

Even with known groups
How can we annotate the source of variation?

COCOA Overview

John Lawson

Coordinate Covariation Analysis

  1. Quantify variation into a 'target variable'
    1. Supervised (e.g. clincial variable).
    2. Unsupervised (e.g. PCA)
  2. Annotate target variable with region sets.

What is epigenetic signal covariation?


 

What is epigenetic signal covariation?


 

What is epigenetic signal covariation?


 

What is epigenetic signal covariation?


 

What is epigenetic signal covariation?


Covariation informs source of observed variation

1. Choose target variable

What is the variation we'd like to explain?
Supervised target
Unsupervised target

 

2. Quantify correlation with target variable


 

2. Quantify correlation with target variable


 

Permutation tests establish significance


 


Case studies

Breast cancer DNA methylation (Unsupervised)
Breast cancer ATAC-seq (Unsupervised)
Kidney cancer DNA methylation (Supervised)
Pan-cancer EZH2 analysis

Breast cancer DNA methylation PCA


COCOA results for PC1


ER-related regions have higher loadings on PC1

Raw DNA Methylation in ER binding regions


COCOA results for PC1-4


COCOA results for PC1-4


COCOA meta-region plots for PC1-4


Case studies

Breast cancer DNA methylation (Unsupervised)
Breast cancer ATAC-seq (Unsupervised)
Kidney cancer DNA methylation (Supervised)
Pan-cancer EZH2 analysis

Breast cancer ATAC-seq PCA


COCOA results for ATAC-seq


ER-related regions have higher loadings on PC1

COCOA results for ATAC-seq


Case studies

Breast cancer DNA methylation (Unsupervised)
Breast cancer ATAC-seq (Unsupervised)
Kidney cancer DNA methylation (Supervised)
Pan-cancer EZH2 analysis

Kidney cancer DNA methylation (Supervised)


Rank region sets for methylation
that correlates with cancer stage

COCOA results for cancer stage

COCOA results for cancer stage

COCOA results for cancer stage

Case studies

Breast cancer DNA methylation (Unsupervised)
Breast cancer ATAC-seq (Unsupervised)
Kidney cancer DNA methylation (Supervised)
Pan-cancer EZH2 analysis

DNA methylation in EZH2 regions and survival


DNA methylation in EZH2-binding regions most often positively correlated with risk of death.
#### Conclusions: COCOA... - can an interpret continuous regulatory variation - can use any signal that annotates genomic coordinates - can work on supervised and unsupervised data - is available from Bioconductor - depends critically on the database of region sets...
#### Introducing BEDbase: a comprehensive source of region sets - Will contain every BED file on GEO (40,000 accessions, estimated 200,000-300,000 BED files). - Includes a web API for filtering and searching BED files - Shaped into 'non-redundant' sets for COCOA analysis

An optimized ATAC-seq pipeline
with serial alignments Smith et al (2021). bioRxiv.

Comparison

PEPATAC strengths

Modular system

Prealignments
Flexibility and portability

Outputs
$ /pipelines/pepatac.py -h

usage: pepatac.py [-h] [-R] [-N] [-D] [-F] [-C CONFIG_FILE]
                  [-O PARENT_OUTPUT_FOLDER] [-M MEMORY_LIMIT]
                  [-P NUMBER_OF_CORES] -S SAMPLE_NAME -I INPUT_FILES
                  [INPUT_FILES ...] [-I2 [INPUT_FILES2 [INPUT_FILES2 ...]]] -G
                  GENOME_ASSEMBLY [-Q SINGLE_OR_PAIRED] [-gs GENOME_SIZE]
                  [--frip-ref-peaks FRIP_REF_PEAKS] [--TSS-name TSS_NAME]
                  [--anno-name ANNO_NAME] [--keep]
                  [--peak-caller {fseq,macs2}]
                  [--trimmer {trimmomatic,skewer}]
                  [--prealignments PREALIGNMENTS [PREALIGNMENTS ...]] [-V]

PEPATAC version 0.7.3

optional arguments:
  -h, --help            show this help message and exit
  -R, --recover         Overwrite locks to recover from previous failed run
  -N, --new-start       Overwrite all results to start a fresh run
  -D, --dirty           Don't auto-delete intermediate files
  -F, --force-follow    Always run 'follow' commands
  -C CONFIG_FILE, --config CONFIG_FILE
                        Pipeline configuration file (YAML). Relative paths are
                        with respect to the pipeline script.
  -O PARENT_OUTPUT_FOLDER, --output-parent PARENT_OUTPUT_FOLDER
                        Parent output directory of project
  -M MEMORY_LIMIT, --mem MEMORY_LIMIT
                        Memory limit (in Mb) for processes accepting such
  -P NUMBER_OF_CORES, --cores NUMBER_OF_CORES
                        Number of cores for parallelized processes
  -I2 [INPUT_FILES2 [INPUT_FILES2 ...]], --input2 [INPUT_FILES2 [INPUT_FILES2 ...]]
                        Secondary input files, such as read2
  -Q SINGLE_OR_PAIRED, --single-or-paired SINGLE_OR_PAIRED
                        Single- or paired-end sequencing protocol
  -gs GENOME_SIZE, --genome-size GENOME_SIZE
                        genome size for MACS2
  --frip-ref-peaks FRIP_REF_PEAKS
                        Reference peak set for calculating FRiP
  --TSS-name TSS_NAME   Name of TSS annotation file
  --anno-name ANNO_NAME
                        Name of reference bed file for calculating FRiF
  --keep                Keep prealignment BAM files
  --peak-caller {fseq,macs2}
                        Name of peak caller
  --trimmer {trimmomatic,pyadapt,skewer}
                        Name of read trimming program
  --prealignments PREALIGNMENTS [PREALIGNMENTS ...]
                        Space-delimited list of reference genomes to align to
                        before primary alignment.
  -V, --version         show program's version number and exit

required named arguments:
  -S SAMPLE_NAME, --sample-name SAMPLE_NAME
                        Name for sample to run
  -I INPUT_FILES [INPUT_FILES ...], --input INPUT_FILES [INPUT_FILES ...]
                        One or more primary input files
  -G GENOME_ASSEMBLY, --genome GENOME_ASSEMBLY
                        Identifier for genome assembly

Modular system

  • Command-line interface with only 3 required arguments.
  • No concept of structuring data inputs for multiple samples.

PEPATAC strengths

Modular system

Prealignments
Flexibility and portability

Outputs



Nuclear-mitochondrial DNA (NuMts) confuse aligners
  • Inaccurate alignment statistics
  • Requires pre-defined NuMt locations
  • Wastes compute power
  • ### Advantages of serial alignments - Accuracy (better rates plus no blacklist needed). - Speed. - Modular reference assemblies.

    PEPATAC strengths

    Modular system

    Prealignments
    Flexibility and portability

    Outputs
    ### Flexibility and Portability - trimmer options: `skewer` and `trimmomatic` - peak caller options: `macs2` and `fseq` ``` ./pepatac.py --trimmer trimmomatic --peak-caller fseq ```

    Flexibility and Portability

  • parameterization via config file pepatac.yaml
  • # basic tools 
    tools:  # absolute paths to required tools
      java: java
      python: python
      samtools: samtools
      bedtools: bedtools
      bowtie2: bowtie2
      fastqc: fastqc
      macs2: macs2
      picard: ${PICARD}
      skewer: skewer
      perl: perl
      # ucsc tools
      bedGraphToBigWig: bedGraphToBigWig
      wigToBigWig: wigToBigWig
      bigWigCat: bigWigCat
      bedSort: bedSort
      bedToBigBed: bedToBigBed
      # optional tools
      fseq: fseq  
      trimmo: ${TRIMMOMATIC}
      Rscript: Rscript 
    
    # user configure 
    resources:
      genomes: ${GENOMES}
      adapters: null  # Set to null to use default adapters
    
    parameters:  # parameters passed to bioinformatic tools
      samtools:
        q: 10
      macs2: 
        f: BED
        q: 0.01
        shift: 0
      fseq:
        of: npf    # narrowPeak as output format
        l: 600     # feature length
        t: 4.0     # "threshold" (standard deviations)
        s: 1       # wiggle track step
    ### Flexibility and Portability - Run `pepatac` in a container using either `docker` or `singularity`. ``` git clone github.com/databio/pepatac docker pull databio/pepatac docker run --rm -it databio/pepatac pipelines/pepatac.py ```

    PEPATAC strengths

    Modular system

    Prealignments
    Flexibility and portability

    Outputs

    Output


    http://pepatac.databio.org/en/latest/files/examples/gold/gold_summary.html

    A full-service reference genome manager. Stolarczyk et al. (2020). GigaScience.
    Stolarczyk, Xue, and Sheffield (2021). bioRxiv.

    The problem

    Many tools require genome-related assets (like indexes).
    How should we organize these on disk?
    ## A standard organization simplifies tool interface Flexible paths must be passed individually: ``` pipeline.py --bowtie2-index path/to/hg38/bowtie2-index \ --tss_annotation path/to/hg38/tss_annotation.bed \ --ensembl_anno path/to/hg38/ensembl_v86.gtf ``` A standard establishes expectations: ``` pipeline.py --genome hg38 ```
    ## Illumina's [iGenomes](https://support.illumina.com/sequencing/sequencing_software/igenome.html) is one answer iGenomes is *a collection of reference sequences and annotation files for commonly analyzed organisms*. You download a tarball of a standard structure for your genome of interest, then write tools off that.
    ## The 'central repository' approach is limited - *Not scripted.* No iGenomes for an arbitrary genome/asset. - *Not modular*. No access to individual assets. - *Not programmatic*. Can't access data/metadata via API.
    ## Refgenie solves these limitations - *Two ways to retrieve an asset.* - `build` any asset from a recipe. - `pull` any individual asset from a server - *Better discoverability and modularity*. - `list/listr` shows assets - `refgenieserver` is a browseable web interface and API - *Managed locations*. - `seek` returns the local path to assets - `add/remove` to manage your own assets

    Refgenie consists of 3 components

    Refgenie splits tasks between CLI and server

    ## Refgenie CLI example [http://refgenie.databio.org](http://refgenie.databio.org)

    The build/pull method needs provenance checks

    Asset provenance:

    Genome provenance:

    Refget

    Refget enables access to reference sequences
    using an identifier derived from the sequence itself.

    How refget works

    ## Refgenie implements (collection) refget-like You first have to have the `fasta` asset: ``` refgenie pull -g hg38 -a fasta ``` Then you can use `getseq`: ``` refgenie getseq -g hg38 -l chr1:50000-50400 AAACAGGTTAATCGCCACGACATAGTAGTATTTAGAGTTACTAGTAAGCCTGATGCCACTACACAATTCTAGCTTTTCTCTTTAGGATGATTGTTTCATTCAGTCTTATCTCTTTTAGAAAACATAGGAAAAAATTATTTAATAATAAAATTTAATTGGCAAAATGAAGGTATGGCTTATAAGAGTGTTTTCCTATTGTTTTCAGTGTAGGACTCACTGTTCTAAATAACTGGGACACCCAAGGATTCTGTAAAATGCCATCCAGTTATCATTTATATTCCCTAACTCAAAATTCATTCACATGTATTCATTTTTTTCTAAACAAATTAGCATGTAGAATTCTGGTTAAAATTTGGCATAGAACACCCGGGTATTTTTTCATAATGCACCCAATAACTGT ```

    Refget v2.0: Collections for genome provenance

    Recursive checksums have advantages

    Allows getting content list only

    Preserves chromosome order

    Re-uses the checksum function

    Duplicates are stored only once

    Go one step further for...

    It keeps going... and going...

    Asset provenance:


    Recipes + containers?

    Genome provenance:


    Solved by refget v2.0?

    Tying human identifiers to a digest:

    
    hg38:
      refget_digest: 32a37a52a377d95bfd4b3d66763e1396a4480f34ab5c318a
    

    Pepkit

    A structure and toolkit for organizing large-scale,
    sample-intensive biological research projects
    Sheffield et al. (2021). bioRxiv.

    Research is organized in projects

    How do we conceptualize a research project?

    Each project has 3 components


    Organizing multiple projects is a challenge


    How do I re-use a component?


    A project is a set of edges in a tripartite graph



    Enable linking with interfaces



    We are building a modular ecosystem


    pepkit · geofetch · looper · caravel · pypiper · divvy

    PEP: Portable Encapsulated Projects

    PEP format

    project_config.yaml
    metadata:
      sample_annotation: /path/to/samples.csv
      output_dir: /path/to/output/folder
    

    samples.csv
    sample_name, protocol, organism, data_source
    frog_0h, RNA-seq, frog, /path/to/frog0.gz
    frog_1h, RNA-seq, frog, /path/to/frog1.gz
    frog_2h, RNA-seq, frog, /path/to/frog2.gz
    frog_3h, RNA-seq, frog, /path/to/frog3.gz
    

    PEP portability features

    Derived attributes
    Implied attributes
    Subprojects
    Derived attributes
    Automatically build new sample attributes from existing attributes.
    Without derived attribute:
    | sample_name | t | protocol | organism | data_source | | ------------- | ---- | :-------------: | -------- | ---------------------- | | frog_0h | 0 | RNA-seq | frog | /path/to/frog0.gz | | frog_1h | 1 | RNA-seq | frog | /path/to/frog1.gz | | frog_2h | 2 | RNA-seq | frog | /path/to/frog2.gz | | frog_3h | 3 | RNA-seq | frog | /path/to/frog3.gz |
    Using derived attribute:
    | sample_name | t | protocol | organism | data_source | | ------------- | ---- | :-------------: | -------- | ---------------------- | | frog_0h | 0 | RNA-seq | frog | my_samples | | frog_1h | 1 | RNA-seq | frog | my_samples | | frog_2h | 2 | RNA-seq | frog | my_samples | | frog_3h | 3 | RNA-seq | frog | my_samples | | crab_0h | 0 | RNA-seq | crab | your_samples | | crab_3h | 3 | RNA-seq | crab | your_samples |
    | sample_name | t | protocol | organism | data_source | | ------------- | ---- | :-------------: | -------- | ---------------------- | | frog_0h | 0 | RNA-seq | frog | my_samples | | frog_1h | 1 | RNA-seq | frog | my_samples | | frog_2h | 2 | RNA-seq | frog | my_samples | | frog_3h | 3 | RNA-seq | frog | my_samples | | crab_0h | 0 | RNA-seq | crab | your_samples | | crab_3h | 3 | RNA-seq | crab | your_samples |
    Project config file:
    derived_columns: [data_source]
    data_sources:
      my_samples: "/path/to/my/samples/{organism}_{t}h.gz"
      your_samples: "/path/to/your/samples/{organism}_{t}h.gz"
    {variable} identifies sample annotation columns
    Benefit: Enables distributed files, portability
    Implied attributes
    Add new sample attributes conditioned on values of existing attributes
    Before:
    | sample_name | protocol | organism | | ------------- | :-------------: | -------- | | human_1 | RNA-seq | human | | human_2 | RNA-seq | human | | human_3 | RNA-seq | human | | mouse_1 | RNA-seq | mouse |
    After:
    | sample_name | protocol | organism | genome | | ------------- | :-------------: | -------- | ------ | | human_1 | RNA-seq | human | hg38 | | human_2 | RNA-seq | human | hg38 | | human_3 | RNA-seq | human | hg38 | | mouse_1 | RNA-seq | mouse | mm10 |
    | sample_name | protocol | organism | | ------------- | :-------------: | -------- | | human_1 | RNA-seq | human | | human_2 | RNA-seq | human | | human_3 | RNA-seq | human | | mouse_1 | RNA-seq | mouse |
    Project config file:
    implied_columns:
      organism:
        human:
          genome: hg38
        mouse:
          genome: mm10
    Benefit: Divides project from sample metadata
    Subprojects
    Define activatable project attributes.
    subprojects:
      diverse:
        metadata:
          sample_annotation: psa_rrbs_diverse.csv
      cancer:
        metadata:
          sample_annotation: psa_rrbs_intracancer.csv
    Benefit: Defines multiple similar projects in a single file
    How is this portable and encapsulated?

    Encapsulated:
    A project is an extensible object, with samples and settings as attributes.
    Portable:
    1. A project can be moved from one analysis tool to another
    2. A project can be moved from one computing environment to another


    A multi-container environment manager.
    Sheffield (2019). OSF Preprints.

    Reproducibility

    data + code
    + environment

    Containers

    A promising solution, but how should we use them?
    Combined
    Individual
     
    easy to deploy
    easy to use
    reusable
    combinable
    subsetable
    space efficient
    Combined
    Individual
    Bulker

    How bulker does it

    Two conceptual advances:

    Containerized executables
    Distribute containers in sets

    Thank You

    Collaborators
    Vince Reuter
    Andre Rendeiro
    Levi Waldron

    Recent Alumni
    Aaron Gu
    Jianglin Feng
    Ognen Duzlevski
    Sheffield lab
    Erfaneh Gharavi
    Michal Stolarczyk
    John Lawson
    Jason Smith
    Kristyna Kupkova
    John Stubbs
    Bingjie Xue
    Jose Verdezoto
    Tessa Danehy
    Funding:



    NIGMS R35-GM128636

    nsheff · databio.org · nsheffield@virginia.edu