Bioinformatics data management and epigenome analysis methods

Nathan Sheffield, PhD
www.databio.org/slides

outline

Motivation and background
Epigenome analysis
|
|

20%
50%
30%
|

Data management
◁ Questions ▷

Most pipelines require individual metadata organization

What if?

Why is this hard to do?
Because of microwave syndrome....

Microwave syndrome

In user interface design, prioritizing easy access to integrated functions over their individual components.

The UNIX philosophy

[T]he power of a system comes more from the relationships among programs than from the programs themselves.

Many UNIX programs do quite trivial tasks in isolation, but, combined with other programs, become general and useful tools.

- Kernighan and Pike, The UNIX Programming Environment (1983, p. viii)

Problem

Solution

PEP structure

A modern structure for organizing large-scale,
sample-intensive biological research projects

PEP: Portable Encapsulated Projects

Research is organized in projects

PEP format

project_config.yaml
metadata:
  sample_annotation: /path/to/samples.csv
  output_dir: /path/to/output/folder

samples.csv
sample_name, protocol, organism, data_source
frog_0h, RNA-seq, frog, /path/to/frog0.gz
frog_1h, RNA-seq, frog, /path/to/frog1.gz
frog_2h, RNA-seq, frog, /path/to/frog2.gz
frog_3h, RNA-seq, frog, /path/to/frog3.gz

PEP portability features

Derived attributes
Implied attributes
Subprojects
Derived attributes
Automatically build new sample attributes from existing attributes.
Without derived attribute:
| sample_name | t | protocol | organism | data_source | | ------------- | ---- | :-------------: | -------- | ---------------------- | | frog_0h | 0 | RNA-seq | frog | /path/to/frog0.gz | | frog_1h | 1 | RNA-seq | frog | /path/to/frog1.gz | | frog_2h | 2 | RNA-seq | frog | /path/to/frog2.gz | | frog_3h | 3 | RNA-seq | frog | /path/to/frog3.gz |
Using derived attribute:
| sample_name | t | protocol | organism | data_source | | ------------- | ---- | :-------------: | -------- | ---------------------- | | frog_0h | 0 | RNA-seq | frog | my_samples | | frog_1h | 1 | RNA-seq | frog | my_samples | | frog_2h | 2 | RNA-seq | frog | my_samples | | frog_3h | 3 | RNA-seq | frog | my_samples | | crab_0h | 0 | RNA-seq | crab | your_samples | | crab_3h | 3 | RNA-seq | crab | your_samples |
| sample_name | t | protocol | organism | data_source | | ------------- | ---- | :-------------: | -------- | ---------------------- | | frog_0h | 0 | RNA-seq | frog | my_samples | | frog_1h | 1 | RNA-seq | frog | my_samples | | frog_2h | 2 | RNA-seq | frog | my_samples | | frog_3h | 3 | RNA-seq | frog | my_samples | | crab_0h | 0 | RNA-seq | crab | your_samples | | crab_3h | 3 | RNA-seq | crab | your_samples |
Project config file:
derived_columns: [data_source]
data_sources:
  my_samples: "/path/to/my/samples/{organism}_{t}h.gz"
  your_samples: "/path/to/your/samples/{organism}_{t}h.gz"
{variable} identifies sample annotation columns
Benefit: Enables distributed files, portability
Implied attributes
Add new sample attributes conditioned on values of existing attributes
Before:
| sample_name | protocol | organism | | ------------- | :-------------: | -------- | | human_1 | RNA-seq | human | | human_2 | RNA-seq | human | | human_3 | RNA-seq | human | | mouse_1 | RNA-seq | mouse |
After:
| sample_name | protocol | organism | genome | | ------------- | :-------------: | -------- | ------ | | human_1 | RNA-seq | human | hg38 | | human_2 | RNA-seq | human | hg38 | | human_3 | RNA-seq | human | hg38 | | mouse_1 | RNA-seq | mouse | mm10 |
| sample_name | protocol | organism | | ------------- | :-------------: | -------- | | human_1 | RNA-seq | human | | human_2 | RNA-seq | human | | human_3 | RNA-seq | human | | mouse_1 | RNA-seq | mouse |
Project config file:
implied_columns:
  organism:
    human:
      genome: hg38
    mouse:
      genome: mm10
Benefit: Divides project from sample metadata
Subprojects
Define activatable project attributes.
subprojects:
  diverse:
    metadata:
      sample_annotation: psa_rrbs_diverse.csv
  cancer:
    metadata:
      sample_annotation: psa_rrbs_intracancer.csv
Benefit: Defines multiple similar projects in a single file
How is this portable and encapsulated?

Encapsulated:
A project is an extensible object, with samples and settings as attributes.
Portable:
  1. A project can be moved from one analysis tool to another
  2. A project can be moved from one computing environment to another

Looper

Deploys pipelines across samples by connecting
samples to any command-line tool
pipeline_interface.yaml
protocol_mappings:
  RNA-seq: rna-seq 

pipelines:
  rna-seq:
    name: RNA-seq_pipeline
    path: path/to/rna-seq.py
    arguments:
      "--option1": sample_attribute
      "--option2": sample_attribute2
  • maps protocols to pipelines
  • maps sample attributes (columns) to pipeline arguments
  • Looper features

    Single-input runs
    Flexible pipelines
    Flexible resources
    Flexible compute
    Job status-aware
    Single-input runs
    Run your entire project with one line:
    looper run project_config.yaml
    Flexible pipelines
    protocol_mappings:
      RRBS: rrbs
      WGBS: wgbs
      EG: wgbs.py
      SMART-seq: rnaBitSeq -f; rnaTopHat -f
      ATAC-SEQ: atacseq
      DNase-seq: atacseq
      CHIP-SEQ: chipseq
    Many-to-many mappings
    Flexible resources
    pipeline_key:
      name: pipeline_name
      arguments:
        "--option" : value
      resources:
        default:
          file_size: "0"
          cores: "2"
          mem: "6000"
          time: "01:00:00"
        large_input:
          file_size: "2000"
          cores: "4"
          mem: "12000"
          time: "08:00:00"
    Resources can vary by input file size
    Flexible compute
    compute:
      slurm:
        submission_template: templates/slurm_template.sub
        submission_command: sbatch
      localhost:
        submission_template: templates/localhost_template.sub
        submission_command: sh

    Adjust compute package on-the-fly:
    > looper run project_config.yaml --compute localhost
    Job status-aware
    Looper only submits jobs for samples not already flagged as running, completed, or failed.
    looper check project_config.yaml
    looper summarize project_config.yaml

    A robust ATAC-seq pipeline
    built on the PEP toolkit

    Comparison

    Prealignments

    Nuclear-mitochondrial DNA (NuMts) confuse aligners
  • Inaccurate alignment statistics
  • Requires pre-defined NuMt locations
  • Wastes compute power
  • ### Advantages of serial alignments - Accuracy (better rates plus no blacklist needed). - Speed. - Modular reference assemblies.

    Output


    http://code.databio.org/PEPATAC/files/examples/gold/summary.html

    Questions

    Epigenome analysis methods

    LOLA
    Locus Overlap Analysis
    MIRA
    Methylation-based Inference of Regulatory Activity

    Locus Overlap Analysis

    Sheffield and Bock (2016). Bioinformatics.
    Nagraj, Magee, and Sheffield (2018). Nucleic Acids Research.

    A shiny app and server for interactive LOLA analysis.
    Public server: http://lolaweb.databio.org
    GitHub: https://github.com/databio/LOLAweb

    DEMO

    Methylation-based Inference of Regulatory Activity (MIRA)

    Lawson et al. (2018). Bioinformatics.

    DNA methylation

    DNA methylation

    Bisulfite-seq

    Region pooling


    Sheffield et al. (2017). Nature Medicine.

    Thank You


    Sheffield lab
    John Lawson
    Vince Reuter
    Ognen Duzlevski
    Jason Smith
    Jianglin Feng
    Michal Stolarczyk
    Aaron Gu
    Anant Tewari

    Christoph Bock
    Andre Rendeiro
    Johanna Klughammer

    Howard Chang
    Ryan Corces
    Yuning Wei
    Jin Xu
    Funding:




    nsheff · databio.org · nsheffield@virginia.edu
    Postdoc positions available!