Data compression is a process that encodes information into a more compact form. Data compression is an important and well-studied topic with clear economic benefits: compressed data is cheaper to store, either on a computer's hard drive or in a computer's random access memory (RAM), and requires less bandwidth to transmit.
As well as the level of compression achieved, the computational overheads (such as the amount of CPU time used or the amount of available RAM needed) of both compressing and decompressing the data must also be taken into account in choosing an appropriate compression strategy. In some applications (for example in compressing high resolution images for display on a web page), a lossy compression approach may be appropriate, allowing greater compression to be achieved at the expense of some loss of information during the compression process. In other applications, it is important that a perfect copy of the original data can be extracted from the compressed data. A lossless compression strategy is the appropriate choice for such cases.
For many applications of data compression, such as a variety used in biological sequence analysis, it is important that the original data be retrievable in its original, uncompressed form. Compression of data, and its reversal, becomes a trade-off of numerous factors, such as degree of compression versus the computational resources required to compress and uncompress the data and the time in which to do so.
Technology for determining the sequence of an organism's DNA has progressed dramatically since its genesis back in the 1970s when DNA was first sequenced (Maxam-Gilbert sequencing). With the development of dye-terminator based sequencing (Sanger sequencing) and related automated technologies, the field of nucleic acid sequencing took a giant step forward. The advent of dye based technologies and instrumentation and automated sequencing methods required development of related software and data processes to deal with the generated data.
Much of the early work on the compression of DNA sequences was motivated by the notion that the compressibility of a DNA sequence could serve as a measure of its information content and hence as a tool for sequence analysis. This concept was applied to topics such as feature detection in genomes and alignment free methods of sequence comparison, a comprehensive review of the field up to 2009 is found, for example, in Giancarlo et al (2009, Bioinformatics 25:1575-1586, incorporated herein by reference in its entirety). However, the exponential growth in the size of a nucleotide sequence database is a reason to be interested in compression for its own sake. The recent and rapid evolution of DNA sequencing technology has given the topic more practical relevance than ever.
The high demand for high-throughput, low cost nucleic acid sequencing methods and systems is driving the state of the art, leading to technologies that parallelize the sequencing process, producing very large amounts of sequence data at one time. Fueled by the commercial availability of a variety of high throughput sequencing platforms, current large scale sequencing projects generate reams of data, in the gigabyte and terabyte range.
Computer systems and data processing software for data analysis associated with current sequencing technologies have advanced considerably. Programs for compressing data that applies to generated sequence data, indexing the data, analyzing the data, and storing the data are available. However, computational analysis for large data sets, such as those generated by current and future sequencing technologies where data in the terabyte range is conceivable, is still a confounding issue as the amount of generated data is so large that analyzing and interpreting it presents a bottleneck for many investigators. Further, current computational sequence analysis requires an enormous amount of computer capacity and is not easily practiced on a typical desktop personal computer or laptop. As such, what are needed are methods and systems for computational analysis that can analyze very large datasets in a time efficient manner and that are easily managed on a typical desktop or laptop computer system, providing both efficiencies in both computer resource usage and time.