Bioinformatics data management and epigenome analysis methods

Nathan Sheffield, PhD

www.databio.org/slides

outline

Motivation and background

Epigenome analysis

20%

50%

30%

Data management

◁ Questions ▷

Most pipelines require individual metadata organization

What if?

Why is this hard to do?
Because of microwave syndrome....

Microwave syndrome

In user interface design, prioritizing easy access to integrated functions over their individual components.

The UNIX philosophy

[T]he power of a system comes more from the relationships among programs than from the programs themselves.

Many UNIX programs do quite trivial tasks in isolation, but, combined with other programs, become general and useful tools.

- Kernighan and Pike, The UNIX Programming Environment (1983, p. viii)

Problem

Solution

PEP: Portable Encapsulated Projects

PEP format

Start with a simple CSV with tabular data.

samples.csv

sample_name,protocol,organism,input_file
frog_0h,RNA-seq,frog,/path/to/frog0.gz
frog_1h,RNA-seq,frog,/path/to/frog1.gz
frog_2h,RNA-seq,frog,/path/to/frog2.gz
frog_3h,RNA-seq,frog,/path/to/frog3.gz

PEP format

Add a YAML for project-level data.

samples.csv

sample_name,protocol,organism,input_file
frog_0h,RNA-seq,frog,/path/to/frog0.gz
frog_1h,RNA-seq,frog,/path/to/frog1.gz
frog_2h,RNA-seq,frog,/path/to/frog2.gz
frog_3h,RNA-seq,frog,/path/to/frog3.gz

project_config.yaml

sample_table: /path/to/samples.csv
output_dir: /path/to/output/folder
other_variable: value

Add programmatic sample and project modifiers.

Derived attributes

Implied attributes

Subprojects

Derived attributes

Automatically build new sample attributes from existing attributes.

Without derived attribute:



| sample_name   | t    | protocol        | organism | input_file             |
| ------------- | ---- | :-------------: | -------- | ---------------------- |
| frog_0h       | 0    | RNA-seq         | frog     | /path/to/frog0.gz      |
| frog_1h       | 1    | RNA-seq         | frog     | /path/to/frog1.gz      |
| frog_2h       | 2    | RNA-seq         | frog     | /path/to/frog2.gz      |
| frog_3h       | 3    | RNA-seq         | frog     | /path/to/frog3.gz      |

Using derived attribute:



| sample_name   | t    | protocol        | organism | input_file             |
| ------------- | ---- | :-------------: | -------- | ---------------------- |
| frog_0h       | 0    | RNA-seq         | frog     | my_samples             |
| frog_1h       | 1    | RNA-seq         | frog     | my_samples             |
| frog_2h       | 2    | RNA-seq         | frog     | my_samples             |
| frog_3h       | 3    | RNA-seq         | frog     | my_samples             |
| crab_0h       | 0    | RNA-seq         | crab     | your_samples           |
| crab_3h       | 3    | RNA-seq         | crab     | your_samples           |



| sample_name   | t    | protocol        | organism | input_file             |
| ------------- | ---- | :-------------: | -------- | ---------------------- |
| frog_0h       | 0    | RNA-seq         | frog     | my_samples             |
| frog_1h       | 1    | RNA-seq         | frog     | my_samples             |
| frog_2h       | 2    | RNA-seq         | frog     | my_samples             |
| frog_3h       | 3    | RNA-seq         | frog     | my_samples             |
| crab_0h       | 0    | RNA-seq         | crab     | your_samples           |
| crab_3h       | 3    | RNA-seq         | crab     | your_samples           |

Project config file:

sample_modifiers:
  derive:
    attributes: [input_file]
    sources:
      my_samples: "/path/to/my/samples/{organism}_{t}h.gz"
      your_samples: "/path/to/your/samples/{organism}_{t}h.gz"

{variable} identifies sample annotation columns

Benefit: Enables distributed files, portability

Implied attributes

Add new sample attributes conditioned on values of existing attributes

Before:



| sample_name   | protocol        | organism | 
| ------------- | :-------------: | -------- | 
| human_1       | RNA-seq         | human    | 
| human_2       | RNA-seq         | human    | 
| human_3       | RNA-seq         | human    | 
| mouse_1       | RNA-seq         | mouse    |

After:



| sample_name   | protocol        | organism | genome | 
| ------------- | :-------------: | -------- | ------ |
| human_1       | RNA-seq         | human    | hg38   |
| human_2       | RNA-seq         | human    | hg38   |
| human_3       | RNA-seq         | human    | hg38   |
| mouse_1       | RNA-seq         | mouse    | mm10   |



| sample_name   | protocol        | organism | 
| ------------- | :-------------: | -------- | 
| human_1       | RNA-seq         | human    | 
| human_2       | RNA-seq         | human    | 
| human_3       | RNA-seq         | human    | 
| mouse_1       | RNA-seq         | mouse    |

Project config file:

sample_modifiers:
  imply:
    - if: 
        organism: human
      then:
        genome: hg38
    - if:
        organism: mouse
      then:
        genome: mm10

Benefit: Divides project from sample metadata

Subprojects

Define activatable project attributes.

project_modifiers:
  amendments:
    diverse:
      metadata:
        sample_annotation: psa_rrbs_diverse.csv
    cancer:
      metadata:
        sample_annotation: psa_rrbs_intracancer.csv

Benefit: Defines multiple similar projects in a single file

Looper

Deploys pipelines across samples by connecting
samples to any command-line tool

https://looper.databio.org

pipeline_interface.yaml

protocol_mappings:
  RNA-seq: rna-seq 

pipelines:
  rna-seq:
    name: RNA-seq_pipeline
    path: path/to/rna-seq.py
    arguments:
      "--option1": sample_attribute
      "--option2": sample_attribute2

maps protocols to pipelines

maps sample attributes (columns) to pipeline arguments

Looper features

Single-input runs

Flexible pipelines

Flexible resources

Flexible compute

Job status-aware

Single-input runs
Run your entire project with one line:

looper run project_config.yaml

Flexible pipelines

protocol_mappings:
  RRBS: rrbs
  WGBS: wgbs
  EG: wgbs.py
  SMART-seq: rnaBitSeq -f; rnaTopHat -f
  ATAC-SEQ: atacseq
  DNase-seq: atacseq
  CHIP-SEQ: chipseq

Many-to-many mappings

Flexible resources

pipeline_key:
  name: pipeline_name
  arguments:
    "--option" : value
  resources:
    default:
      file_size: "0"
      cores: "2"
      mem: "6000"
      time: "01:00:00"
    large_input:
      file_size: "2000"
      cores: "4"
      mem: "12000"
      time: "08:00:00"

Resources can vary by input file size

Flexible compute

compute:
  slurm:
    submission_template: templates/slurm_template.sub
    submission_command: sbatch
  localhost:
    submission_template: templates/localhost_template.sub
    submission_command: sh

Adjust compute package on-the-fly:

> looper run project_config.yaml --compute localhost

Job status-aware
Looper only submits jobs for samples not already flagged as running, completed, or failed.

looper check project_config.yaml

looper summarize project_config.yaml

A robust ATAC-seq pipeline
built on the PEP toolkit

http://code.databio.org/PEPATAC

Comparison

Prealignments

Nuclear-mitochondrial DNA (NuMts) confuse aligners

Inaccurate alignment statistics

Requires pre-defined NuMt locations

Wastes compute power

### Advantages of serial alignments - Accuracy (better rates plus no blacklist needed). - Speed. - Modular reference assemblies.

Output

http://code.databio.org/PEPATAC/files/examples/gold/summary.html

Questions

Epigenome analysis methods

LOLA
Locus Overlap Analysis

MIRA
Methylation-based Inference of Regulatory Activity

Locus Overlap Analysis

http://code.databio.org/LOLA/

Sheffield and Bock (2016). Bioinformatics.

Nagraj, Magee, and Sheffield (2018). Nucleic Acids Research.

A shiny app and server for interactive LOLA analysis.
Public server: http://lolaweb.databio.org
GitHub: https://github.com/databio/LOLAweb

DEMO

Methylation-based Inference of Regulatory Activity (MIRA)

http://code.databio.org/MIRA/

Lawson et al. (2018). Bioinformatics.

DNA methylation

Bisulfite-seq

Region pooling

Sheffield et al. (2017). Nature Medicine.

Thank You

Sheffield lab
John Lawson
Vince Reuter
Ognen Duzlevski
Jason Smith
Jianglin Feng
Michal Stolarczyk
Aaron Gu
Anant Tewari

Christoph Bock
Andre Rendeiro
Johanna Klughammer

Howard Chang
Ryan Corces
Yuning Wei
Jin Xu

Funding:

nsheff ·

databio.org ·

nsheffield@virginia.edu