When a novel stretch of nucleic acid, e.g., DNA, or protein is first sequenced, its sequence is typically compared against a database of known DNA and protein information to provide a preliminary indication of the new DNA or protein's function. A researcher can then design experiments to assay the results of the database search.
As a result of the enormous improvement of DNA sequencing technology, the rate of growth of the DNA database has grown exponentially over the last decade from 1.5 million nucleotides per year in 1989 to over 1.6 billion nucleotides per year in 1999. Since 1999, entire genomes have been sequenced, including those of drosophila, mouse, and human. As the amount of known genetic sequence information increases exponentially, it becomes increasingly important to develop high speed methods to search the expanding sequence databases.
For example, GenBank, a public repository of genomic information, currently has nearly 16 GB of data, having grown from a mere 680 K in 1982 (Benson et al., Nucleic Acids Research, 28(1):15-18 (2000) (See also www.ncbi.nlm.nih.gov/Genbank/genbankstats.html.)). At that rate, the amount of data is doubling every 16.5 months. In 2001 alone, 3.5 million sequences totaling 3 GB of new sequence were entered into GenBank. Both public and private sequencing facilities consist of warehouse-sized factories generating data around the clock, limited only by the availability of reagents and the speed of the sequencing machines.
As more sequence data becomes available, one of the most basic problems is sequence alignment—how does a newly discovered nucleotide or amino acid sequence relate to previously known and studied data? Note that whenever the amount of sequence data doubles, the number of possible comparisons quadruples. Therefore, despite similarly impressive exponential doubling of computer speed on roughly the same time course, sequence comparison algorithms are falling farther and farther behind in their ability to find homologies for any significant part of the available data.
Much of the current research in homology searching focuses on algorithms that are more sensitive and selective—algorithms that use sophisticated techniques such as Hidden Markov Models to reduce the rate of false positives and false negatives, in other words to correctly identify weak homologies (e.g., genes in two very distantly related organisms), as described in, for example Eddy, “Profile Hidden Markov Models,” Bioinformatics, 14(9):755-763 (1998), the entirety of which is incorporated by reference herein. Such algorithms push the envelope of mining the maximum amount of information from the data available, and are of enormous research interest. This invention takes a different track. For many applications, existing algorithms are good enough. In particular, the BLAST (Basic Local Alignment Search Tool) family of algorithms, developed by NCBI is sufficiently sensitive and selective for most searches. See, for example, Altschul, et al., “Gapped BLAST and PSI-BLAST: new generation of protein database search programs,” Nucleic Acids Research, 25(17):389-3402 (1997), the entirety of which is incorporated by reference herein.
Basic Local Alignment Search Tool
Currently, the most common methods of high speed database are variants of the Basic Local Alignment Search Tool (“BLAST”) described by Altschul, et al., J. Mol. Biol., 215(3):403-10 (1990), the entirety of which is incorporated by reference herein. BLAST and its variants provide a fast, local alignment of sequences. BLAST programs have been written to compare protein or DNA queries with protein or DNA databases in any combination, with DNA sequences often undergoing conceptual translation before any comparison is performed.
BLAST is a heuristic that attempts to optimize a similarity measure. Specifically, BLAST identifies all local regions of similarity (called High-scoring Segment Pairs or “HSPs”) which can statistically be distinguished from random alignments. BLAST permits a tradeoff between speed and sensitivity, with the setting of a “threshold” parameter, T. A higher value of T yields greater speed, but also an increased probability of missing weak similarities. The BLAST program requires time proportional to the product of the lengths of the query sequence and the database searched.
The central idea of the BLAST algorithm is that a statistically significant alignment is likely to contain a high-scoring pair of aligned words. BLAST first scans the database for words (e.g., 3 for proteins and 12 for DNA) that score at least T when aligned with some word within the query sequence. Any aligned word pair satisfying this condition is called a “hit.” The second step of the algorithm checks whether each hit lies within an alignment with a score sufficient to be reported. This is done by extending a hit in both directions, until the running alignment's score has dropped more than X below the maximum score yet attained.
In its most fundamental implementation, the BLAST algorithm performs three steps: (1) compile a list of high scoring words of length w from the query sequence; (2) for each sequence in the database, scan for word hits—i.e. words from the query sequence matching words in the database sequence—with scores greater than a threshold T; (3) for each word hit, extend it in both directions to form a High Scoring Pair (“HSP”) of score greater than or equal to S. The to following paragraphs describe these steps in greater detail.
In a typical BLAST implementation, the list of high scoring words is compiled into a lookup table with i rows and j columns, where i is the number of all possible words of length w, and j is the number of elements in the query sequence minus w. Because the value i represents every possible word of length w, each row in the lookup table corresponds to one word of length w. The row number corresponds to the lexical order of its corresponding word, and can be considered the “row number” for that word. For DNA sequences i=4w; for protein sequences, i=20w. The value j represents the number of starting positions of words of length w in the query sequence. The lookup table is initialized with all zeroes, and then populated as follows: For every word of length w in the query, look up its corresponding row. Call this row x. Call the position of the word in the query sequence y. Set the (x,y) element of the lookup table to y. Once the lookup table is populated, it is then trimmed. Rows with all zeroes, representing words not present in the query, are removed. The remaining words are then screened for significance by examining them to see if their self-similarity score meets a minimum threshold T. Similarity scores are typically calculated using a substitution matrix such as PAM120 and BLOSUM62, as described in Dayhoff et al., Atlas of Protein Sequence and Structure, Vol. 5, Suppl. 3, Ed. M. O. Dayhoff (1978); Henikoff and Henikoff, Proc. Natl. Acad. Sci. USA, 89: 10915-10519 (1992); and Henikoff and Henikoff, Proteins, 17: 49-61 (1993). These publications are incorporated by reference herein in their entireties. Rows representing words with self-scores less than T are eliminated. Finally, all columns with zeroes are eliminated. The resulting lookup table is indexed by the lexical word number of significant words, and returns the position of significant words in the query sequence.
BLAST is a heuristic algorithm, and values of w and T are chosen for an optimal mix of sensitivity, specificity, and speed. As w increases for a given value of T, specificity is increased but sensitivity is decreased. Similarly, as T decreases for a given value of w, sensitivity is increased, but specificity is decreased and run time is increased. Exemplary settings of w and T are w=4, T=17 for proteins and w=12, T=60 for DNA.
Once the lookup table is constructed, it is used for comparing the query sequence against the database. Specifically, the database is searched for all words of length w that are present in the query sequence whose self-scores are higher than the threshold T. Every word thus found is called a “hit.” The output of the database search comprises a list of hits. Because the database is searched repeatedly, it is first preprocessed into a lookup table similar to that generated for the query sequence. Thus, the search process can be performed quickly by comparing the two lookup tables.
The final step is to extend each hit to form a High-scoring Segment Pair (“HSP”). Typically, the hit is extended in both directions until the similarity score between the corresponding stretches of query sequence and database sequence fall below a predetermined threshold S.
The output of a typical BLAST implementation includes descriptions of hits found in the database. For each hit, the output includes information about the sequence from which the hit came, the hit's score in bits, the probability that a given HSP or HSPs would have been found by chance, and the E-value, which is a measure of the number of matches expected to have been found by chance in a database of this size for the given score.
Previous Improvements to BLAST
Implementations of the BLAST algorithm have undergone improvements since the algorithm was first introduced. Three major improvements include the “two-hit” method for hit extension, the ability to generate gapped alignments, and a position-specific, iterated BLAST (“PSI-BLAST”) that in many cases is more sensitive to weak but biologically relevant sequence similarities. These improvements are described in detail, for example, in Altschul, “Gapped BLAST and PSI-BLAST: a new generation of protein database search programs,” Nucleic Acids Research, 25(17):3389-3402 (1997).
Performance data of the original BLAST implementation indicate that the extension step, where hits are extended to form HSPs, consumed the greatest amount of processing time. The “two-hit” method for hit extension is a refinement on this step, which results in the generation of fewer time-consuming extension steps. Experimental evidence indicates that a typical HSP of interest is much longer than a single word pair, and may therefore comprise multiple hits on the same diagonal and within a relatively short distance of one another. The two-hit method exploits this observation by requiring the existence of two non-overlapping word pairs on the same diagonal, and within a distance A of one another, before an extension is invoked. Any hit that overlaps the most recent one is ignored. Because this method requires two hits rather than one to invoke an extension, the threshold parameter T must be lowered to retain comparable sensitivity. The effect is that many more single hits are found, but only a small fraction have an associated second hit on the same diagonal that triggers an extension. Thus, the great majority of hits may be dismissed after the minor calculation of looking up, for the appropriate diagonal, the coordinate of the most recent hit, checking whether it is within distance A of the current hit's coordinate, and finally replacing the old with the new coordinate. Empirically, the computation saved by requiring fewer extensions more than offsets the extra computation required to process the larger number of hits.
Another method of performing the extension step has been described, for example, in “Multiple sequence alignment using block chaining,” Zheng Zhang, PhD dissertation, The Pennsylvania State University, UMI Dissertation Services, Ann Arbor (1996); and Zhang et al., “Chaining multiple-alignment blocks,” J. of Computational Biology, 1:217-226 (1994). This method is based on the extension step's similarity to a known class of problems in computer science known as “block chaining,” a special case of the classic optimal-path algorithm for directed, acyclic graphs. The method described above adopts a well-known technique of higher-dimensional computational geometry called K-D trees. Generally, K-D trees are used to decompose multidimensional space hierarchically into a relatively small number of cells such that no cell contains too many input objects (see Bentley, Communications of the ACM, 18:509-517, (1975), the entirety of which is incorporated by reference herein). In Zhang, K-D trees are used to decompose the space of possible block chains into regions, thus reducing the block chaining problem to the less computationally intensive problem of chaining regions. The use of K-D trees provides significant computational gains for multiple sequence comparison, i.e., more than two sequences are being compared to each other.
The ability to generate gapped alignments allows for significant improvements in BLAST's performance. The original BLAST program often finds several alignments involving a single database sequence which are statistically significant only when considered together. Overlooking any one of these alignments would compromise the combined result. By introducing an algorithm for generating gapped alignments, it becomes necessary to find only one rather than all the ungapped alignments subsumed in a significant, combined result. This allows the T parameter to be raised, increasing the speed of the initial database scan. One method of generating a gapped alignment employs a heuristic approach which is a simple generalization of BLAST's method for constructing HSPs. The central idea of this approach is to trigger a gapped extension for any HSP that exceeds a moderate score Sg, chosen so that no more than about one extension is invoked per 50 database sequences. Statistical analysis indicates that Sg should be set at approximately 22 bits for a typical-length protein query. A gapped extension takes much longer to execute than an ungapped extension, but by performing very few of them the fraction of the total running time they consume can be kept relatively low. Furthermore, by seeking a single gapped alignment, rather than a collection of ungapped ones, only one of the constituent HSPs need be located for the combined result to be generated successfully. This means that a much higher chance of missing any single moderately scoring HSP can be tolerated. This permits the T parameter for the hit-stage of the algorithm to be raised substantially while retaining comparable sensitivity. For example, T can be raised from 11 to 13 for the one-hit heuristic of the original BLAST implementation.
The iterated application of BLAST to position-specific score matrices, also known as profiles or motifs, allows for database searches that are often much better able to detect weak but biologically significant relationships. One implementation of position-specific, iterated BLAST called PSI-BLAST constructs a position-specific score matrix from the output of a primary BLAST run and uses this matrix as a query for a subsequent BLAST run.
Although several refinements have increased the speed, sensitivity, and specificity of current implementations of BLAST more than three-fold when compared to the original, the exponential rate of sequence database expansion mandates continuous improvement to implementations of the algorithm.
Parallel Processing
One approach for speeding up large database searches is to use parallel processing. High performance parallel computing is accomplished by splitting up large and complex tasks across multiple processors. In a simple example, a sequence database can be subdivided into several parts, with each part assigned to a specific processing unit. The same query can then be run on all processing units simultaneously, with each processing unit having to search only a fraction of the database. In a more complex example, a task is divided into subtasks. For instance, the extension step in BLAST requires an exhaustive examination of alternative chains of HSPs. In this example, the space of possible chains is subdivided into a plurality of subspaces and each subspace is allocated to a separate processor.
While a variety of methods can be used to achieve improved performance on multiple computers, the most common way to organize and orchestrate parallel processing is to write code which automatically decomposes the problem at hand and allows the processors to communicate with each other when necessary while performing their work. In the first example above, if a processor finds a match to the query in its part of the database, it can signal that event to the other processors performing their searches. In the second example above, if a processor finds a new maximum score, it can signal that event so that other processors raise their thresholds.
Most applications of parallel processing usually require some amount of interaction between processors, so they must be able to communicate with each other to exchange information. For example, values for cells on a map may depend on the values of their nearest neighboring cells. If the map is decomposed into two pieces, each being processed on a separate CPU, the processors must exchange cell values on adjacent edges of the map. Similarly, if a sequence alignment is decomposed into several smaller local alignments, each processor handling each local alignment must be able to communicate with processors handling adjacent alignments in order to extend an alignment past its original bounds.
Several contrasting approaches to parallel processing exist. For example, parallel processing can be performed on highly specialized hardware or on commercial off-the-shelf (“COTS”) hardware. BLAST and other sequence comparison and database searching algorithms have been implemented in highly-specialized, parallel processing hardware, such as PARACEL'S GENE MATCHER machine. As described in Ullner, U.S. Pat. No. 6,112,288, the entirety of which is incorporated by reference herein, PARACEL'S GENE MATCHER employs a programmable, special-purpose pipeline processing system, which includes a plurality of accelerator chips coupled in series. Each of the accelerator chips includes an instruction processor. Each of the pipeline processor segments includes a plurality of pipeline processors coupled in series. Such specialized hardware can provide significant speedups, especially for dynamic programming algorithms.
In contrast to the specialized hardware approach is the commodity parallel processing approach, which uses inexpensive, commercial off-the-shelf (“COTS”) hardware. One popular approach to commodity parallel processing is the Beowulf cluster, described in Becker, et al., Beowulf: A Parallel Workstation for Scientific Computation. Proceedings, International Conference on Parallel Processing (1995); and M. S. Warren, et al., Parallel supercomputing with commodity components; H. R. Arabnia, editor, Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA '97), pages 1372-1381, 1997. These publications are incorporated by reference herein in their entireties. A Beowulf-class cluster is a high-performance, massively parallel computer built primarily out of commodity hardware components, running a free-software operating system like Linux or FreeBSD, interconnected by a private, high-speed network. A typical Beowulf-class cluster comprises a plurality of interconnected PCs or workstations dedicated to running high-performance computing tasks. It is usually connected to the outside world through only a single node. A typical Beowulf-class cluster employs a popular method of inter-processor communication called message passing. Popular implementations of message passing include the Parallel Virtual Machine (“PVM”) and the Message Passing Interface (“MPI”).
Because commodity parallel processing hardware provides substantial performance at reasonable cost, and because exponential database growth requires substantial improvements in database searching techniques, there exists a need to improve the BLAST algorithm to increase its performance on commodity parallel processing hardware.