ATAC-seq pipeline processing

Nathan Sheffield, PhD
www.databio.org/slides

PEPATAC strengths

Modular system

Prealignments
Flexibility and portability

Outputs
Command-line interface with only 3 required arguments
$ /pipelines/pepatac.py -h

usage: pepatac.py [-h] [-R] [-N] [-D] [-F] [-C CONFIG_FILE]
                  [-O PARENT_OUTPUT_FOLDER] [-M MEMORY_LIMIT]
                  [-P NUMBER_OF_CORES] -S SAMPLE_NAME -I INPUT_FILES
                  [INPUT_FILES ...] [-I2 [INPUT_FILES2 [INPUT_FILES2 ...]]] -G
                  GENOME_ASSEMBLY [-Q SINGLE_OR_PAIRED] [-gs GENOME_SIZE]
                  [--frip-ref-peaks FRIP_REF_PEAKS] [--TSS-name TSS_NAME]
                  [--anno-name ANNO_NAME] [--keep]
                  [--peak-caller {fseq,macs2}]
                  [--trimmer {trimmomatic,skewer}]
                  [--prealignments PREALIGNMENTS [PREALIGNMENTS ...]] [-V]

PEPATAC version 0.7.3

optional arguments:
  -h, --help            show this help message and exit
  -R, --recover         Overwrite locks to recover from previous failed run
  -N, --new-start       Overwrite all results to start a fresh run
  -D, --dirty           Don't auto-delete intermediate files
  -F, --force-follow    Always run 'follow' commands
  -C CONFIG_FILE, --config CONFIG_FILE
                        Pipeline configuration file (YAML). Relative paths are
                        with respect to the pipeline script.
  -O PARENT_OUTPUT_FOLDER, --output-parent PARENT_OUTPUT_FOLDER
                        Parent output directory of project
  -M MEMORY_LIMIT, --mem MEMORY_LIMIT
                        Memory limit (in Mb) for processes accepting such
  -P NUMBER_OF_CORES, --cores NUMBER_OF_CORES
                        Number of cores for parallelized processes
  -I2 [INPUT_FILES2 [INPUT_FILES2 ...]], --input2 [INPUT_FILES2 [INPUT_FILES2 ...]]
                        Secondary input files, such as read2
  -Q SINGLE_OR_PAIRED, --single-or-paired SINGLE_OR_PAIRED
                        Single- or paired-end sequencing protocol
  -gs GENOME_SIZE, --genome-size GENOME_SIZE
                        genome size for MACS2
  --frip-ref-peaks FRIP_REF_PEAKS
                        Reference peak set for calculating FRiP
  --TSS-name TSS_NAME   Name of TSS annotation file
  --anno-name ANNO_NAME
                        Name of reference bed file for calculating FRiF
  --keep                Keep prealignment BAM files
  --peak-caller {fseq,macs2}
                        Name of peak caller
  --trimmer {trimmomatic,pyadapt,skewer}
                        Name of read trimming program
  --prealignments PREALIGNMENTS [PREALIGNMENTS ...]
                        Space-delimited list of reference genomes to align to
                        before primary alignment.
  -V, --version         show program's version number and exit

required named arguments:
  -S SAMPLE_NAME, --sample-name SAMPLE_NAME
                        Name for sample to run
  -I INPUT_FILES [INPUT_FILES ...], --input INPUT_FILES [INPUT_FILES ...]
                        One or more primary input files
  -G GENOME_ASSEMBLY, --genome GENOME_ASSEMBLY
                        Identifier for genome assembly
Portable Encapsulated Projects (PEP) provide interoperability
Portable Encapsulated Projects (PEP) provide interoperability

PEP specification for sample metadata

1. Configuration file: config.yaml
pep_version: 2.0.0
sample_table: "path/to/sample_table.csv"
2. Tabular sample annotation table: sample_table.csv:
"sample_name", "protocol", "file"
"frog_1", "ATAC-seq", "frog1.fq.gz"
"frog_2", "ATAC-seq", "frog2.fq.gz"
"frog_3", "ATAC-seq", "frog3.fq.gz"
"frog_4", "ATAC-seq", "frog4.fq.gz"
pep.databio.org

MapReduce or Scatter/Gather

1. Map/Scatter PEPATAC across individual samples
looper run config.yaml
2. Gather results and do cross-sample analysis
looper runp config.yaml

PEPATAC strengths

Modular system

Prealignments
Flexibility and portability

Outputs



Nuclear-mitochondrial DNA (NuMts) confuse aligners
Problems with region masking
  • Inaccurate alignment statistics
  • Requires pre-defined NuMt locations
  • Wastes compute power
  • ### Advantages of serial alignments - Accuracy (better rates plus no blacklist needed). - Speed. - Modular reference assemblies.

    PEPATAC strengths

    Modular system

    Prealignments
    Flexibility and portability

    Outputs
    ### Flexibility and Portability - trimmer options: `skewer` and `trimmomatic` - peak caller options: `macs2` and `fseq` - aligner options: `bowtie2` and `bwa` ``` ./pepatac.py --trimmer trimmomatic --peak-caller fseq ```

    Flexibility and Portability

  • parameterization via config file pepatac.yaml
  • # basic tools 
    tools:  # absolute paths to required tools
      java: java
      python: python
      samtools: samtools
      bedtools: bedtools
      bowtie2: bowtie2
      fastqc: fastqc
      macs2: macs2
      picard: ${PICARD}
      skewer: skewer
      perl: perl
      # ucsc tools
      bedGraphToBigWig: bedGraphToBigWig
      wigToBigWig: wigToBigWig
      bigWigCat: bigWigCat
      bedSort: bedSort
      bedToBigBed: bedToBigBed
      # optional tools
      fseq: fseq  
      trimmo: ${TRIMMOMATIC}
      Rscript: Rscript 
    
    # user configure 
    resources:
      genomes: ${GENOMES}
      adapters: null  # Set to null to use default adapters
    
    parameters:  # parameters passed to bioinformatic tools
      samtools:
        q: 10
      macs2: 
        f: BED
        q: 0.01
        shift: 0
      fseq:
        of: npf    # narrowPeak as output format
        l: 600     # feature length
        t: 4.0     # "threshold" (standard deviations)
        s: 1       # wiggle track step
    ### Flexibility and Portability Running options: - natively - conda - containers using `docker` or `singularity`. - use bulker to manage containers for your (http://bulker.io) ``` git clone github.com/databio/pepatac docker pull databio/pepatac docker run --rm -it databio/pepatac pipelines/pepatac.py ```

    PEPATAC strengths

    Modular system

    Prealignments
    Flexibility and portability

    Outputs

    Output


    http://pepatac.databio.org/en/latest/files/examples/gold/gold_summary.html

    PEPATAC in practice

  • O'Connor et al. (2021). bioRxiv. DOI: 10.1101/2021.07.15.452570
  • Ram-Mohan et al. (2021). Life Science Alliance. DOI: 10.26508/lsa.202000976
  • Robertson et al. (2021). Nature Genetics. DOI: 10.1038/s41588-021-00880-5
  • Cheung et al. (2021). DOI: 10.1038/s41590-021-00928-y
  • Hasegawa et al. (2021). bioRxiv. DOI: 10.1101/2021.04.28.441728
  • Weber et al. (2021). Science. DOI: 10.1126/science.aba1786
  • Tovar et al. (2021). bioRxiv. DOI: 10.1101/2021.01.29.428733
  • Granja et al. (2021). Nature Genetics. DOI: 10.1038/s41588-021-00790-6
  • Fan et al. (2020). Cell Reports. DOI: 10.1016/j.celrep.2020.108473
  • Smith and Sheffield (2020). Current Protocols in Human Genetics. DOI: 10.1002/cphg.101
  • Liu (2020). DOI: 10.18632/oncotarget.27584
  • Zhou et al. (2020). bioRxiv. DOI: 10.1101/2020.05.16.099325
  • Cai et al. (2020). DOI: 10.1186/s12920-020-0695-0
  • Li et al. (2020). DOI: 10.1038/s41419-020-2303-9
  • Liang et al. (2019). DOI: 10.1002/1873-3468.13549
  • Corces et al. (2018). Science. DOI: 10.1126/science.aav1898
  • Conclusion

    If you're doing ATAC-seq analysis

    Try pepatac!

    code.databio.org/PEPATAC
    If you're developing pipelines

    Try looper!

    looper.readthedocs.io
    Everyone else

    Eat chicken nuggets!

    Thank You


    Sheffield lab
    John Lawson
    Vince Reuter
    Jason Smith
    Jianglin Feng
    Michal Stolarczyk
    Aaron Gu

    Christoph Bock
    Andre Rendeiro
    Johanna Klughammer

    Howard Chang
    Ryan Corces
    Yuning Wei
    Jin Xu
    Funding:




    nsheff · databio.org · nsheffield@virginia.edu

    Parallelism Philosophy


    by process
    by sample
    by dependence

    Very easy
    Easy
    Hard