This specification relates to biological sequence data processing.
A sequencing machine can generate sequence data derived from multiple types of biological molecules, including, for example, ribonucleic acid (RNA) and deoxyribonucleic acid (DNA). The biological sequence data is often designated as reads. A single sequencing run can create between thousands and billions of reads. The sequence data, e.g. reads derived from such data sources, can be mapped to a reference genome (e.g. DNA to the reference genome) and stored in files called sequence alignment/map (SAM) files or binary sequence alignment/map (BAM) files, or in any other alternative file format containing the genomic coordinates to which a read may have been mapped or unmapped and additional details, e.g. sequence quality, mate-pair information, or both. Such files frequently reach a size of tens of gigabytes each. Utility programs specialized in processing sequence data, e.g., the Genome Analysis Toolkit (GATK®) or SAMtools® can be used to analyze the SAM or BAM files to identify various patterns in the reads. During processing, these utility programs can sort and index the sequence data, extract particular information from the sequence data, and convert data formats. An individual utility program can execute on a stand-alone computer to perform a processing task. A task of identifying a specific kind of pattern may require sorting, indexing, or converting the data in multiple ways. Even though a specialized utility program can be multi-threaded, each task can last one or more hours due to the amount of data to be processed. In addition, each utility program can easily have a memory footprint of several gigabytes.