Managing project metadata with a standard project format


Nathan Sheffield, PhD

Most pipelines require individual metadata organization

What if?

PEP: A standard format for project metadata

PEP format

project_config.yaml
metadata:
  sample_annotation: /path/to/samples.csv
  output_dir: /path/to/output/folder

samples.csv
sample_name, protocol, organism, data_source
frog_0h, RNA-seq, frog, /path/to/frog0.gz
frog_1h, RNA-seq, frog, /path/to/frog1.gz
frog_2h, RNA-seq, frog, /path/to/frog2.gz
frog_3h, RNA-seq, frog, /path/to/frog3.gz

$ python geofetch.py -i GSE502503
peppy package
import peppy

prj = Project("pep_config.yaml")
samples = prj.get_samples()

for sample in samples:
	print(sample.name)
	# do further analysis to each sample
Project API
pepr package
library("pepr")

prj = pepr::Project("pep_config.yaml")
samples = pepr::pepSamples(prj)

for (sample in samples) {
	message(pepr::sampleName(sample))
	# do further analysis to each sample
	}

Conclusion

  • PEP format is a novel approach to standardize projects.
  • Initial tools like geofetch and looper build PEP projects and connect them to pipelines
  • Python and R packages provide a universal interface to PEP metadata for tools and analysis
More information at pepkit.github.io.
Slides at databio.org.