Technical Field
This disclosure relates generally to next-generation sequencing (NGS) technologies and, in particular, technologies to store, transmit and process genomic data.
Background of the Related Art
Next generation sequencing (NGS) read data present challenges in data analysis and storage unrivalled in any other biological problem domain. Massive amounts of NGS reads (50-150 base pair fractions of genomic sequence) are generated every day by sequencing machines, growing annually at a greater exponential rate than computing power; these reads form the bulk of genomic sequences that must be stored and analyzed to obtain genomic information (e.g., GATK). Months of computing time are often required to process data for novel, large-scale sequencing studies that enable us to catalog human genetic variation, provide new insights into our evolutionary history, and promise to revolutionize the study of cell lineages in higher organisms. These computational challenges are at present a barrier to widespread use of NGS data throughout biotechnology, which impacts genomic medicine, environmental genomics, and the ability to detect signatures of selection within large sets of closely related read data.
A critical step in most sequence analysis pipelines is aligning, or mapping, the reads onto a reference genome to determine their loci in the genome. Read mapping is the costliest data processing step in sequence analysis pipelines. Existing read mapping methods (e.g., FM-index, hash table) require many sequence comparison steps, iterating over each read in order to map it onto reference genomes, which is expensive even when references are stored and organized efficiently. Thus, the time requirements of these known methods scale linearly with the size of the full read dataset, and each year require exponentially more runtime to process the exponentially-growing read data.
Identification of all possible mappings of a read within a given similarity threshold (denoted as “all-mapping”) is of particular importance to many downstream analyses and is the most robust way to comprehensively analyze structural variants, transposons, copy-number variants, and other repeat elements within the genome. Even single nucleotide polymorphism (SNP) genotyping accuracy is shown to substantially improve through the use of multiply-mapped reads during SNP-calling. However, all-mapping is often not utilized by current sequence analysis pipelines due to its high computational cost when performed by existing alignment software.
The term “compressive genomics” describes compression of data in such a way that it can be searched efficiently and accurately without first being decompressed. Compressive genomics exploits the redundancy of genomic sequences to enable parsimonious storage and fast access. It has been demonstrated previously that search algorithms such as BLAST and BLAT could be adapted to run in a compressive genomics framework. A compressive genomics framework is useful for mapping NGS read data because it allows the search or mapping of each read to a compressed collection of similar genomes. While redundancy of reference genomes can currently be exploited in read mapping, these advances have not yet been applied to large NGS read datasets. One key to capitalizing on read redundancy is the observation that raw NGS read datasets, such as those from the Genome 10K and 1000 Genomes projects, are also redundant due to the high similarity across and within individuals' read data. This similarity, however, is not easy to mine.
Thus, to fully realize the conceptual advance of compressive genomics requires development of novel computational techniques that take advantage of this redundancy to compress large read datasets, as well as the reference, in such a way that they can be mapped without first being decompressed. The technique of this disclosure addresses this need.