Genomics, the study of the genomes of organisms, is an active area of biotechnology that has been used to identify targets for pharmaceutical compounds and to gain insight into disease mechanisms. As the field of genomics has matured, automated nucleotide sequencing techniques and advanced assembly algorithms have resulted in the complete sequencing of the genomes of hundreds of organisms, including the human genome. Recently developed automated sequencing devices make use of automated and massively parallel sequencing techniques to produce gigabases of data in a single machine run. These automated sequencing devices have greatly reduced the time and cost of sequencing the genomes of individual humans, making personalized medicine increasingly feasible. Personalized medicine makes use of an individual's sequenced genome to determine the individual's susceptibility to diseases and to determine the individual's responsiveness to drug regimes.
Typically, automated sequencing technologies cleave the nucleotide sample to be processed into a set of smaller nucleotide strands, which are then sequenced, yielding a read sequence set containing all of the reads from a sample. Each read of the read sequence set is the nucleotide sequence of one of the smaller nucleotide strands, as determined by the automated sequencing device. The read sequence set is then assembled into a complete genome by piecing together the reads into a continuous nucleotide sequence by aligning overlapping portions of the reads in the set.
A challenging computational task related to assembling the read sequence data into a genome is the alignment of the sequences. The sequences in the read sequence data set may either be compared to other sequences in the data set (de novo assembly) and arranged so that sequences with the same series of nucleotide bases are overlapped, or the sequences may be assembled by aligning the sequences in the read sequence data set to a data set of an existing template genome, such as a data set in a publicly available genetic database (reference-guided assembly). Regardless of the methods used, the assembly of the sequences into a genome entails the pairwise comparison of the sequences of one data set to the sequences from another data set.
During the pairwise comparison process, all of the nucleotide sequences of one data set are compared individually to all of the nucleotide sequences of another data set. When both of the data sets are small and can be completely stored into the volatile memory of a computing device, it is reasonably efficient to pass nucleotide sequences of the data sets from storage on a nonvolatile memory device, such as a hard disk, to the volatile memory of the computing device to make the pairwise comparisons.
However, automated sequencing devices may potentially generate enormous data sets containing large numbers of nucleotide sequences that typically are stored in nonvolatile memory during the genome assembly process. Conventional gene assembly systems process entire data sets in a single operation, requiring careful management of the transfer of nucleotide sequence data from nonvolatile memory to volatile memory for processing. However, despite the measures taken to minimize processing times, conventional gene assembly systems typically have long processing times. For example, typical existing techniques organize the pairwise comparison of sequences in a raster-type scan, which requires a great deal of computational overhead for larger data sets. Advanced computational systems with large volumes of volatile memory operating at maximum capacity and a large time frame are typically required to assemble such large data sets.