BiocProject: a Bioconductor-oriented project management package

Nathan Sheffield, PhD

www.databio.org/slides

Most workflows require individual metadata organization

What if?

The solution

PEP: A standard format for project metadata

PEP format

project_config.yaml

metadata:
  sample_annotation: /path/to/samples.csv
  output_dir: /path/to/output/folder

samples.csv

sample_name, protocol, organism, data_source
frog_0h, RNA-seq, frog, /path/to/frog0.gz
frog_1h, RNA-seq, frog, /path/to/frog1.gz
frog_2h, RNA-seq, frog, /path/to/frog2.gz
frog_3h, RNA-seq, frog, /path/to/frog3.gz

#### BiocProject integrates PEP into Bioconductor It provides: - automated data loading - functions for interacting with project metadata - PEP-annotated Bioconductor data objects

#### Install ``` devtools::install_github(repo='pepkit/pepr') devtools::install_github(repo='pepkit/BiocProject') ```

#### Load an example PEP with bioconductor section Here's a demo included with the package: ``` metadata: sample_table: sample_table.csv bioconductor: readFunName: readBedFiles readFunPath: readBedFiles.R ``` `readFunName` is an R function that reads in your PEP. `readFunPath` is an R file that contains your function

#### sample_table sample_name | file_path ------------- | -------------- laminB1Lads | data/laminB1Lads.bed vistaEnhancers | data/vistaEnhancers.bed

#### readBedFiles.R ``` readBedFiles = function (project) { cwd = getwd() paths = pepr::samples(project)$file_path sampleNames = pepr::samples(project)$sample_name setwd(dirname(project@file)) result = lapply(paths, function(x) { df = read.table(x) colnames(df) = c("chr", "start", "end") gr = GenomicRanges::GRanges(df) }) setwd(cwd) names(result) = sampleNames return(GenomicRanges::GRangesList(result)) } ```

``` library(BiocProject) configFile = system.file("extdata", "example_peps-master", "example_BiocProject", "project_config.yaml", package = "BiocProject") bp = BiocProject(file=configFile) #> Loaded config file: .../example_BiocProject/project_config.yaml #> The 'bioconductor' key found in the Project config #> Used function 'readBedFiles' from the environment ```

bp = BiocProject(file=configFile)
bp
#> GRangesList object of length 2:
#> $laminB1Lads 
#> GRanges object with 1302 ranges and 0 metadata columns:
#>          seqnames              ranges strand
#>      [1]     chr1   11401198-11694590      *
#>      [2]     chr1   14877629-15246452      *
#>      [3]     chr1   18229570-19207602      *
#>      [4]     chr1   29618442-31162049      *
#>      [5]     chr1   33943885-35623392      *
#>      ...      ...                 ...    ...
#>   [1298]     chrX 154066672-154251301      *
#>   [1299]     chrY     2880166-7112793      *
#>   [1300]     chrY   15047033-15333970      *
#>   [1301]     chrY   15603977-16627892      *
#>   [1302]     chrY   16966225-21013116      *
#> 
#> ...
#> <1 more element>
#> -------
#> seqinfo: 24 sequences from an unspecified genome; no seqlengths
#> 
#> metadata: PEP project object. Class:  Project
#>   file: example_BiocProject/project_config.yaml
#>   samples:  2

``` samples(bp) #> sample_name file_path #> 1: laminB1Lads data/laminB1Lads.bed #> 2: vistaEnhancers data/vistaEnhancers.bed ```

``` config(bp) #> Config object. Class: Config #> metadata: #> sample_table: example_BiocProject/sample_table.csv #> bioconductor: #> readFunName: readBedFiles #> readFunPath: example_BiocProject/readBedFiles.R #> name: example_BiocProject ```

BiocProject in action

Zero to hero in 3 lines of code

1. Create a PEP:

geofetch -i GSE129383 -P /pepatac/pipeline_interface.yaml
...
Finished processing 1 accessions
Creating complete project annotation sheets and config file...
  Sample annotation sheet:${SRAMETA}/GSE129383/GSE129383_annotation.csv
  Sample subannotation sheet:${SRAMETA}/GSE129383/GSE129383_subannotation.csv
  Config file: ${SRAMETA}/GSE129383/GSE129383_config.yaml

This downloads raw data and creates your PEP.

2. Run the PEPATAC ATAC-seq pipeline:


looper run GSE129383_config.yaml --sp sra_convert
looper run GSE129383_config.yaml

This runs PEPATAC on your newly-created PEP.

3. Load processed data into R with BiocProject:

bp = BiocProject::BiocProject("GSE129383_config.yaml")
bp

This loads your bed files into R.

Conclusion

PEP is a language-agnostic standard project representation.
BiocProject loads metadata and data for a PEP into R
Add in geofetch, looper, and PEPATAC to connect raw data through analysis

More information at pepkit.github.io; code.databio.org/BiocProject.

Thank You

Sheffield lab
Ognen Duzlevski
Jianglin Feng
Aaron Gu
Kristyna Kupkova
John Lawson
Vince Reuter
Jason Smith
Michal Stolarczyk

Bioconductor
Levi Waldron
Sean Davis

Funding:

NIGMS 1R35GM128636

nsheff ·

databio.org ·

nsheffield@virginia.edu