The past two decades have seen an exponential increase in genomic sequencing capabilities, outstripping advances in computing power. Extracting new insights from the datasets currently being generated will require not only faster computers; it will require smarter algorithms. Most genomes currently sequenced, however, are highly similar to ones already collected; thus, the amount of novel sequence information is growing much more slowly.
Successive generations of sequencing technology have exponentially increased the availability of genomic data. In the decade since the publication of the first draft of the human genome (a 10-year, $400 million effort), technologies have been developed that can sequence a human genome in one week for less than $10,000, and the 1000 Genomes Project is well on its way to building a library of over 2500 human genomes.
These leaps in sequencing technology promise to enable corresponding advances in biology and medicine, but they will require more efficient ways to store, access, and analyze large genomic data sets. Indeed, the scientific community is becoming aware of the fundamental challenges in analyzing such data. Difficulties with large data sets arise in a number of settings in which one analyzes genomic libraries, including finding sequences similar to a given query (e.g., from environmental or medical samples), or finding signatures of selection within large sets of closely related genomes.
Currently, the total amount of available genomic data is increasing approximately ten-fold every year, a rate much faster than Moore's Law for computational processing power. Any computational analysis, such as sequence search, that runs on the full genomic library (or even a constant fraction thereof) scales at least linearly in time with respect to the library size, and therefore effectively grows exponentially slower every year. To achieve sub-linear analysis, one must attempt to take advantage of redundancy inherent in the data. Intuitively, given two highly similar genomes, any analysis based on sequence similarity that is performed on one should have already done much of the work toward the same analysis on the other. While efficient algorithms such as the Basic Local Alignment and Search Tool (BLAST) have been developed for individual genomes, large genomic libraries have additional structure; they are highly redundant. For example, as human genomes differ on average by only 0.1%, one thousand human genomes contain less than twice the unique information of one genome. Thus, while individual genomes are not very compressible, theoretically collections of related genomes should be compressible.
Numerous algorithms exist for the compression of genomic datasets purely to reduce the space required for storage and transmission. Existing techniques, however, require decompression prior to computational analysis. Thus, while these techniques achieve a significant improvement in storage efficiency, they do not mitigate the computational bottleneck: in order to perform analysis, the original uncompressed data set must be reconstructed.
There have also been efforts to accelerate exact search via indexing techniques. While mapping short re-sequencing reads to a small number of genomes is already handled quite extensively by known algorithms, in the case of matching reads of unknown origin to a large database (e.g., in a medical or forensic context), known techniques have not proven satisfactory. Realizing acceleration becomes harder when one wishes to perform inexact search (e.g. BLAST and the Blast-Like Alignment Tool (BLAT)). To use compression effectively to accelerate inexact search requires a compression scheme that respects the metric on which similarity is scored.
There remains a need to provide new computational techniques that address these and other deficiencies in the known art.