If you want to do parallel processing on big genomics data in python, you have to have a way to feed each process just a subset of the data. If you just load a bunch of data (say, a BED file) into memory and then try to split your process out to multiple processors, you will duplicate all that data for every process – whoops! Can’t do that for a big file.
One convenient way to solve this is to use a file format that can be indexed, enabling indexed access so you can instruct each process to just pull the data into memory that it needs; in this way, you only load each data point into memory once. Instead of passing data directly to the process, you pass a key so the process knows what data to pull from the indexed file. Then you could also even distribute this across machines if you wanted.
This is the way BAM files work; using pysam, you can pull out just a subset of reads from an (aligned) BAM file. So, to run a python process in parallel across a BAM file, you should map across chromosome strings, and then read the reads in from inside the subprocess; don’t load the BAM file and then farm out chunks of it to different processes.
But what if your data aren’t in a BAM file?
You can do this with TABIX files, also by Heng Li. Here’s how to create them.
You need a bed-like (TSV) file with columns for (at least) chromosome and start position:
chr1 10468 annotation1 chr1 10469 annotation2 chr1 10470 annotation3
Now compress this file with
And index the file with tabix (this creates a tbi index file):
tabix -s 1 -b 2 -e 2 file.tsv.gz
Now you can pull stuff out of this file with random access, sweet!
tabix file.tsv.gz.tbi chr5:50000-100000
Files in Tabix format are easy to process efficiently in parallel in python, using [pysam]. Here’s a pseudocode example:
import pysam import multiprocessing chromosome_list = ['chr1', 'chr2'] # Make a list of chromosomes f = pysam.TabixFile(filename) # Create a handle that indexes the file # Now write a function that will take a chromosome name, and pull out all entries on that chromosome def processing_function(chr): reads = f.fetch(chrom, multiple_iterators=True) for read in reads: # Do your thing to each read # Map this function across each chromosome... map(processing_function, chromosome_list) # Reduce results if necessary
And there you go! Parallel processing of a big genomic data file, using efficient compressed disk storage with Tabix, and low memory consumption in python!
You have to use
multiple_iterators=True if you want to read the file in parallel, so that the iterators for different processes don’t step on each others’ toes.