Introduction to Sequence Collections

Nathan Sheffield, PhD
www.databio.org/slides

Unique identifiers and API for sequence collections. Stolarczyk, Xue, and Sheffield (2021). NAR Genomics and Bioinformatics.

The problem: Who is the authoritative provider of the reference genome?

  • NCBI
  • UCSC
  • Ensembl
Variation includes:
  • hard, soft, or no repeat masking?
  • are alternative scaffolds included?
  • are haplotypes included?
  • how are chromosomes named (chr1, 1, or NC_000001.11)?
  • how is the assembly named (hg38, GRCh38, or GCF_000001405.39)?
  • Are any decoy sequences included (like EBV)?
Andy Yates' "Genome provider analysis"
Provider Chr1 name Chr1 length Chr1 md5 Num chroms
Ensembl primary 1 248956422 2648ae1bacce4ec4b6cf337dcae37816 195
Ensembl toplevel 1 248956422 2648ae1bacce4ec4b6cf337dcae37816 649
NCBI NC_000001.11 248956422 6aef897c3d6ff0c78aff06ac189178dd 640
UCSC chr1 248956422 2648ae1bacce4ec4b6cf337dcae37816 456
https://gist.github.com/andrewyatz/692f81baab1bebaf09c481937f2ad6c6
Subtle differences in reference assembly lead to:
  1. Lack of reproducibility of analysis
  2. Lack of reusability of results

Solution


Refget -> Sequence collections

Refget

Refget enables access to reference sequences
using an identifier derived from the sequence itself.

How refget works

Limitations

  • only handles a single sequence
  • excludes chromosome names

Extending to sequence collections

We need:
  • 1. An algorithm to create a deterministic, unique digest from a collection of sequences
  • 2. A server capable of retrieving sequences given an identifier

First pass: Refgenie approach

Stolarczyk, Xue, and Sheffield (2021). NAR Genomics and Bioinformatics.
refgenomes.databio.org

Limitations and discussion

  • Should we include sequence topology in the digest?
  • What other attributes could we include?
  • Are there better delimiters?
  • How do we construct the 'string-to-digest'?
  • How do we handle order of sequences?

Current proposal: Array-based protocol

Current proposal: Array-based protocol

Current proposal: Array-based protocol

seqcol = { 'names': ['chrUn_KI270742v1', 'chrUn_GL000216v2', 'chrUn_GL000218v1'],
           'lengths': ['186739', '176608', '161147'],
           'sequences': ['2f31c013a4a8301deb8ab7ed1ca1cd99',
                         '725009a7e3f5b78752b68afa922c090c',
                         '1d708b54644c26c7e01c2dad5426d38c'] }

seqcol = { 'sequences': '8dd93796fa0225e92eb159a8779f1b254776557f748f8bfb',
           'lengths':   '501fd98e2fdcc276c47306bd72c9155489ed2b23123ddfa2',
           'names':     '7bc90a07cf25f2f64f33baee3d420ad1ae5f442055280d43'}

Advantages

  • Accommodates new attributes with backwards-compatibility
  • Additional layer of recursion to assess individual attributes
  • Requires only a single delimiter

How should we refer to reference genomes in practice?

refgenie alias get

Comparison function

Provider Chr1 name Chr1 length Chr1 md5 Num chroms
Ensembl primary 1 248956422 2648ae1bacce4ec4b6cf337dcae37816 195
Ensembl toplevel 1 248956422 2648ae1bacce4ec4b6cf337dcae37816 649
NCBI NC_000001.11 248956422 6aef897c3d6ff0c78aff06ac189178dd 640
UCSC chr1 248956422 2648ae1bacce4ec4b6cf337dcae37816 456
  • seqcol 1: 047c6e1eda552b50c5add59ff0995
  • seqcol 2: 2230c535660fb4774114bfa966a62

How compatible are they?

API Endpoint: POST /compare; GET /compare
Return value:
  
{
  "lengths": {
    "any-elements-shared": true,
    "all-a-in-b": true,
    "all-b-in-a": true,
    "order-match": true,
  },
  "names": {
    "any-elements-shared": true,
    "all-a-in-b": true,
    "all-b-in-a": true,
    "order-match": true,
  },
  "sequences": {
    "any-elements-shared": false,
    "all-a-in-b": false,
    "all-b-in-a": false,
    "order-match": false,
  }
}

Seqcol API demonstration

http://seqcolapi.databio.org

Conclusions

  • Refget provides universal IDs for individual sequences
  • Sequence collections extends this to reference genomes
  • Using a deterministic algorithm, you can find the identifier
  • A lookup service can retrieve the original sequence
  • A comparison function allows fine-grained compatibility tests
  • Please follow along: https://github.com/ga4gh/seqcol-spec

Thank You


nsheff · databio.org · nsheffield@virginia.edu