Technical Field
This disclosure relates generally to next-generation sequencing (NGS) technologies and, in particular, technologies to store, transmit and process genomic data.
Background of the Related Art
Advances in next-generation sequencing (NGS) technologies have produced a deluge of genomic information, outpacing even increases in our computational resources. In addition to facilitating the widespread use of NGS data throughout biotechnology, this avalanche of data enables novel, large-scale population studies (e.g., maps of human genetic variation, reconstruction of human population history, and uncovering cell lineage relationships). To fully capitalize on these advances, however, better technologies to store, transmit, and process genomic data need to be developed.
The bulk of NGS data typically consists of read sequences, where each base call is associated with a corresponding quality score, which consume at least as much storage space as the base calls themselves. Quality scores are primarily used and often essential for assessing sequence quality, filtering low quality reads, assembling genomic sequences, mapping reads to a reference sequence, and performing accurate genotyping. Quality scores are a major bottleneck in any sequence analysis pipeline, impacting genomic medicine, environmental genomics, and the ability to find signatures of selection within large sets of closely-related sequenced individuals.
At the expense of downstream analysis, biomedical researchers have typically discarded quality scores or turned to compression, which has been moderately successful when applied to genomic sequence data. Quality score compression is usually lossy, meaning that maximum compression is achieved at the expense of the ability to reconstruct the original quality values. Due to decline in downstream accuracy, such methods are unsuitable for both transmission and indefinite storage of quality scores. To address these limitations, several newly-developed methods exploit sequence data to boost quality score compression using alignments to a reference genome, or use raw read datasets without reference alignment. Such solutions have not proven satisfactory. In particular, reference-based compression requires runtime-costly whole-genome alignments of the NGS dataset, while alignment-free compression applies costly indexing methods directly to the read dataset.
There remains a need to provide for an efficient and scalable method for very large (e.g., terabyte-sized) NGS datasets and that addresses the degradation of downstream genotyping accuracy that results from lossy compression.