1. Field of the Invention
The invention relates generally to string matching techniques, specifically to a system and method for searching sequences, such as genomes, for a specified query, such as a DNA fragment.
2. Description of Prior and Related Art
The identification of sequence homology—a kind of match—between an unknown genetic test sample and a known genomic sequence often provides clues about the function, identity, or evolutionary relatedness of the sequence in question. This information can be used, for example, in determining genetic diseases, defects, and other hereditary characteristics. With the explosive rise in the amount of genetic data made possible by new sequencing technologies, the ability to screen newly discovered DNA sequences against large databases of known genes has become a particularly important—and computationally intensive—aspect of modern biology.
Another computationally and data intensive task of modern biology is sequence assembly: the process of reconstructing the original DNA sequence from the small DNA fragments output by sequencing equipment. DNA sequencing requires duplicating a target sequence and then randomly fracturing these duplicates. This random fracturing ensures that the subsequence fragments produced will contain enough overlapping information to facilitate sequence reconstruction. The subsequent chemical readout process requires these fragments to be at most 500 bases long to keep the error rate below 1%. The end result is a large number of DNA fragments. To assemble the original sequence after readout, each fragment—now represented digitally by characters A (Adenine), T (Thiamine), G (Guanine), and C (Cytosine)—is compared (matched) to all other fragments looking for a best fit overlap. Overlapping fragments are merged, and the process repeats until the entire sequence of the input strand has been determined.
This match and merger method takes considerable computational resources. Genomics companies for example, use supercomputing clusters for the task. Algorithmic analysis shows that full reassembly process is θ(N2 log2 N) or higher, where N is the number of bases (nucleotides) in the DNA being sequenced. For a sequence the size of the human genome, full reconstruction through a pair-wise comparison of all reads has been estimated at several thousand years of compute-time on a general purpose computer, and therefore represents a formidable computational bottleneck.
Special purpose machines for accelerated genetic sequence analysis were first developed in the late 1980's when it was recognized that the growth of genetic databases outpaced Moore's law and the power of general purpose processors. With little exception, these machines have used a systolic array configuration consisting of hundreds to thousands of small processing elements, as this is the direct hardware adaptation of the dynamic programming algorithms used in biosequence analysis. These algorithms are typically based on a string edit distance optimization as first presented by Needleman and Wunsch (“A general method applicable to the search for similarities in the amino acid sequences of two proteins.” J. Molecular Biology vol. 48, pgs. 443-453, 1970.), and later expanded and extended by Smith and Waterman (“Comparison of Biosequences”. Adv. in Appl. Math. 2, pgs. 482-489, 1981.) The main variations between systolic architectures have involved trading ASIC speed for FPGA flexibility, using general purpose rather than specialized systolic arrays, and experimentation with intra-element communication and bus architecture. Systems have ranged from several chips on a PCI board to server sized machines, but all suffer from a number of disadvantages:                (a) Systolic arrays consist of hundreds to thousands of processing elements, where each processing element may require thousands of logic elements. The resulting systems are large (as measured in silicon area or number of chips), costly, and power hungry. The preferred embodiment of the systolic processor detailed in U.S. Pat. Nos. 5,964,860 and 5,632,041 to Peterson (1999, 1997 respectively), for example, shows that the 16 processing elements which fit on a chip require a total of 400,000 transistors, and that multiple chips are likely needed for a working system.        (b) Communication between thousands of processing elements, control units, and possibly multiple chips greatly reduces the theoretical processing power of systolic arrays. For example, in the preferred embodiment section of U.S. Pat. No. 5,706,498 to Fujimiya, et. al. (1998) it is noted that compute time is greatly restricted by bus transactions between units.        (c) Systolic arrays require specialized languages, compilers, and a non-standard programming model to fully utilize their capabilities.        
The linear shift register method of matching first presented in U.S. Pat. No. 5,724,253 to Skovira of IBM (1998) alleviates some of these problems but still has several disadvantages:                (d) The IBM system uses an encoding where more chemically similar nucleotides have a shorter Hamming distance between their constituent bits. Using this encoding, the system adds the result of every XOR nucleotide comparison in order to obtain a sum score quantifying the dissimilarity of a particular sequence alignment. Determining a match through a sequential addition of each nucleotide comparison takes much longer than performing a match determining operation in parallel, and means that compute-time scales linearly as the query size increases.        (e) Only one alignment between the genome and query sequence is checked during an iteration.        (f) Considerable memory bandwidth is used as a dissimilarity score is written to memory for every alignment checked.        