ATAC-seq analysis via pipeline

Nathan Sheffield, PhD
www.databio.org/slides

Outline

Pipelines
ATAC-seq QC
|
|

20%
50%
30%
|

ATAC-seq pipeline
◁ Questions ▷

Analysis spectrum

Advantages: interactive analysis vs pipelines

Interactive
- More universal learning curve
- Direct control
- Quicker to get started
- Easier for simple analysis
Pipeline
- Easier for high volume
- More robust error handling
- Automatic logging
- Restartable
- Built-in monitoring
- More repeatable
- More reproducible

What should I use?


The combination that best fits your project requirements.

ATAC-seq pipelines

There is growing need for integrated pipelines to process ATAC-seq data. Several have been developed but have different focus for downstream analysis by stitching together previously discussed tools. (Yan et al. 2020)
  • ENCODE ATAC-seq pipeline
  • PEPATAC
  • esATAC
  • More pipelines
  • PEPATAC strengths

    Modular system

    Prealignments
    Flexibility and portability

    Outputs
    Command-line interface with only 3 required arguments
    $ /pipelines/pepatac.py -h
    
    usage: pepatac.py [-h] [-R] [-N] [-D] [-F] [-C CONFIG_FILE]
                      [-O PARENT_OUTPUT_FOLDER] [-M MEMORY_LIMIT]
                      [-P NUMBER_OF_CORES] -S SAMPLE_NAME -I INPUT_FILES
                      [INPUT_FILES ...] [-I2 [INPUT_FILES2 [INPUT_FILES2 ...]]] -G
                      GENOME_ASSEMBLY [-Q SINGLE_OR_PAIRED] [-gs GENOME_SIZE]
                      [--frip-ref-peaks FRIP_REF_PEAKS] [--TSS-name TSS_NAME]
                      [--anno-name ANNO_NAME] [--keep]
                      [--peak-caller {fseq,macs2}]
                      [--trimmer {trimmomatic,skewer}]
                      [--prealignments PREALIGNMENTS [PREALIGNMENTS ...]] [-V]
    
    PEPATAC version 0.7.3
    
    optional arguments:
      -h, --help            show this help message and exit
      -R, --recover         Overwrite locks to recover from previous failed run
      -N, --new-start       Overwrite all results to start a fresh run
      -D, --dirty           Don't auto-delete intermediate files
      -F, --force-follow    Always run 'follow' commands
      -C CONFIG_FILE, --config CONFIG_FILE
                            Pipeline configuration file (YAML). Relative paths are
                            with respect to the pipeline script.
      -O PARENT_OUTPUT_FOLDER, --output-parent PARENT_OUTPUT_FOLDER
                            Parent output directory of project
      -M MEMORY_LIMIT, --mem MEMORY_LIMIT
                            Memory limit (in Mb) for processes accepting such
      -P NUMBER_OF_CORES, --cores NUMBER_OF_CORES
                            Number of cores for parallelized processes
      -I2 [INPUT_FILES2 [INPUT_FILES2 ...]], --input2 [INPUT_FILES2 [INPUT_FILES2 ...]]
                            Secondary input files, such as read2
      -Q SINGLE_OR_PAIRED, --single-or-paired SINGLE_OR_PAIRED
                            Single- or paired-end sequencing protocol
      -gs GENOME_SIZE, --genome-size GENOME_SIZE
                            genome size for MACS2
      --frip-ref-peaks FRIP_REF_PEAKS
                            Reference peak set for calculating FRiP
      --TSS-name TSS_NAME   Name of TSS annotation file
      --anno-name ANNO_NAME
                            Name of reference bed file for calculating FRiF
      --keep                Keep prealignment BAM files
      --peak-caller {fseq,macs2}
                            Name of peak caller
      --trimmer {trimmomatic,pyadapt,skewer}
                            Name of read trimming program
      --prealignments PREALIGNMENTS [PREALIGNMENTS ...]
                            Space-delimited list of reference genomes to align to
                            before primary alignment.
      -V, --version         show program's version number and exit
    
    required named arguments:
      -S SAMPLE_NAME, --sample-name SAMPLE_NAME
                            Name for sample to run
      -I INPUT_FILES [INPUT_FILES ...], --input INPUT_FILES [INPUT_FILES ...]
                            One or more primary input files
      -G GENOME_ASSEMBLY, --genome GENOME_ASSEMBLY
                            Identifier for genome assembly
    
    Portable Encapsulated Projects (PEP) provide interoperability
    Portable Encapsulated Projects (PEP) provide interoperability

    PEP specification for sample metadata

    1. Configuration file: config.yaml
    pep_version: 2.0.0
    sample_table: "path/to/sample_table.csv"
    
    2. Tabular sample annotation table: sample_table.csv:
    "sample_name", "protocol", "file"
    "frog_1", "ATAC-seq", "frog1.fq.gz"
    "frog_2", "ATAC-seq", "frog2.fq.gz"
    "frog_3", "ATAC-seq", "frog3.fq.gz"
    "frog_4", "ATAC-seq", "frog4.fq.gz"
    
    pep.databio.org

    MapReduce or Scatter/Gather

    1. Map/Scatter PEPATAC across individual samples
    looper run config.yaml
    2. Gather results and do cross-sample analysis
    looper runp config.yaml

    PEPATAC strengths

    Modular system

    Prealignments
    Flexibility and portability

    Outputs



    Nuclear-mitochondrial DNA (NuMts) confuse aligners
    Problems with region masking
  • Inaccurate alignment statistics
  • Requires pre-defined NuMt locations
  • Wastes compute power
  • ### Advantages of serial alignments - Accuracy (better rates plus no blacklist needed). - Speed. - Modular reference assemblies.

    PEPATAC strengths

    Modular system

    Prealignments
    Flexibility and portability

    Outputs
    ### Flexibility and Portability - trimmer options: `skewer` and `trimmomatic` - peak caller options: `macs2` and `fseq` - aligner options: `bowtie2` and `bwa` ``` ./pepatac.py --trimmer trimmomatic --peak-caller fseq ```

    Flexibility and Portability

  • parameterization via config file pepatac.yaml
  • # basic tools 
    tools:  # absolute paths to required tools
      java: java
      python: python
      samtools: samtools
      bedtools: bedtools
      bowtie2: bowtie2
      fastqc: fastqc
      macs2: macs2
      picard: ${PICARD}
      skewer: skewer
      perl: perl
      # ucsc tools
      bedGraphToBigWig: bedGraphToBigWig
      wigToBigWig: wigToBigWig
      bigWigCat: bigWigCat
      bedSort: bedSort
      bedToBigBed: bedToBigBed
      # optional tools
      fseq: fseq  
      trimmo: ${TRIMMOMATIC}
      Rscript: Rscript 
    
    # user configure 
    resources:
      genomes: ${GENOMES}
      adapters: null  # Set to null to use default adapters
    
    parameters:  # parameters passed to bioinformatic tools
      samtools:
        q: 10
      macs2: 
        f: BED
        q: 0.01
        shift: 0
      fseq:
        of: npf    # narrowPeak as output format
        l: 600     # feature length
        t: 4.0     # "threshold" (standard deviations)
        s: 1       # wiggle track step
    ### Flexibility and Portability Running options: - natively - conda - containers using `docker` or `singularity`. - use bulker to manage containers for your (http://bulker.io) ``` git clone github.com/databio/pepatac docker pull databio/pepatac docker run --rm -it databio/pepatac pipelines/pepatac.py ```

    PEPATAC strengths

    Modular system

    Prealignments
    Flexibility and portability

    Outputs

    Output


    http://pepatac.databio.org/en/latest/files/examples/gold/gold_summary.html

    PEPATAC in practice

  • O'Connor et al. (2021). bioRxiv. DOI: 10.1101/2021.07.15.452570
  • Ram-Mohan et al. (2021). Life Science Alliance. DOI: 10.26508/lsa.202000976
  • Robertson et al. (2021). Nature Genetics. DOI: 10.1038/s41588-021-00880-5
  • Cheung et al. (2021). DOI: 10.1038/s41590-021-00928-y
  • Hasegawa et al. (2021). bioRxiv. DOI: 10.1101/2021.04.28.441728
  • Weber et al. (2021). Science. DOI: 10.1126/science.aba1786
  • Tovar et al. (2021). bioRxiv. DOI: 10.1101/2021.01.29.428733
  • Granja et al. (2021). Nature Genetics. DOI: 10.1038/s41588-021-00790-6
  • Fan et al. (2020). Cell Reports. DOI: 10.1016/j.celrep.2020.108473
  • Smith and Sheffield (2020). Current Protocols in Human Genetics. DOI: 10.1002/cphg.101
  • Liu (2020). DOI: 10.18632/oncotarget.27584
  • Zhou et al. (2020). bioRxiv. DOI: 10.1101/2020.05.16.099325
  • Cai et al. (2020). DOI: 10.1186/s12920-020-0695-0
  • Li et al. (2020). DOI: 10.1038/s41419-020-2303-9
  • Liang et al. (2019). DOI: 10.1002/1873-3468.13549
  • Corces et al. (2018). Science. DOI: 10.1126/science.aav1898
  • Thank You


    nsheff · databio.org · nsheffield@virginia.edu