Sequence similarity is an observable quantity that may be expressed as, for example, a percentage. Comparison of newly identified sequences against known sequences often provides clues about the function of the sequences. If the sequence is a protein sequence, the sequence comparison may also provide clues as to the three-dimensional structure adopted by the protein sequence. Sequence similarity may also lead to inferences on the evolutionary relatedness, or the homology, of the sequences.
Current sequence databases are already immense and have continued to grow at an exponential rate. For example, the human genome project and other large scale nucleotide sequencing objectives have resulted in a large amount of sequence information available in both private and public databases. Sequence similarity searching is not simply used to compare a single sequence against the sequences in a single database, but is also used to compare or screen large numbers of new sequences against multiple databases. Moreover, sequence alignment and database searches are performed tens of thousands of times per day around the world. Therefore, the ability to quickly and precisely compare new sequence data against such sequence databases is becoming more and more important.
There are many different methods for comparing sequences. Some methods, such as those based on the analysis of transformational grammars (cf. Durbin, et al., Biological Sequence Analysis, Cambridge University Press (1998), Chapter 9), compare sequences by comparing the properties of the mathematical algorithms that may be used to generate the sequences in question. However, most common methods involve the use of sequence alignment at some point in the comparison process. Sequence alignment provides an explicit mapping between the residues of two or more sequences. When only two sequences are compared, the process is called pairwise alignment, but there are also methods of constructing multiple alignments that involve aligning more than two sequences.
The production of a sequence alignment result may be generically divided into two separate problems. The first problem is the alignment of the query sequence with the sequences in the databases being searched. The second problem is ranking or scoring of the aligned sequences. The results of the sequence alignment search are then reported as a ranked hit list followed by a series of individual sequence alignments, plus various scores and statistics.
There are various programs and algorithms available for performing database sequence similarity searching. For a basic discussion of bioinformatics and sequence similarity searching, see BIOINFORMATICS: A Practical Guide to the Analysis of Genes and Proteins, Baxevanis and Ouellette eds., Wiley-Interscience (1998) and Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids, Durbin et al., Cambridge University Press (1998). One of the first used algorithms for performing sequence alignment searching was incorporated into the FASTA program. (Lipman and Pearson, “Rapid and sensitive protein similarity searches,” Science, Vol. 227, PP. 1435-1441 (1985); Pearson and Lipman, “Improved tools for biological sequence comparison,” Proc. Natl. Acad. Sci., Vol. 85, pp. 2444-2448 (1988)). The FASTA program performs optimized searches for local alignments using a substitution matrix. In order to improve the speed of the search, the program uses an observed pattern or small matches, termed “word” hits, to identify potential matches before performing the more time-consuming optimization search.
A popular algorithm for sequence similarity searching is the BLAST (Basic Local Alignment Search Tool) algorithm, which is employed in programs such as blastp, blastn, blastx, tblastn, and tblastx. (Altschul et al., “Local alignment statistics,” Methods Enzymol., Vol. 266, pp. 460-480 (1996); Altschul et al., “Gapped BLAST and PSI-BLAST: A new generation of protein database search programs,” Nucl. Acids Res., Vol. 25, pp. 3389-3402 (1997); Karlin et al., “Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes,” Proc. Natl. Acad. Sci., Vol. 87, pp. 2264-2268 (1990); Karlin et al., “Applications and statistics for multiple high-scoring segments in molecular sequences,” Proc. Natl. Acad. Sci., Vol. 90, pp. 5873-5877 (1993)). The approach used by the BLAST program is to first identify segments, with or without gaps, that are similar in a query sequence and a database sequence, then to evaluate the statistical significance of all such matches that are identified, and finally to summarize only those matches that satisfy a preselected threshold of significance.
The blastp program compares an amino acid query sequence against a protein sequence database, while the blastn program compares a nucleotide query sequence against a nucleotide sequence database. The blastx program compares the six-frame conceptual translation products of a nucleotide query sequence (both strands) against a protein sequence database. A protein query sequence is compared against a nucleotide sequence database dynamically translated in all six reading frames (both strands) by the tblastn program, and tblastx compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database. The program blastall, one of the implementations of BLAST, can be used to perform all five flavors of the BLAST comparison.
The BLAST program can be downloaded from the NCBI and run locally as a full executable. It can be used to run BLAST searches against private local databases or downloaded copies of the NCBI databases. The 1.4 and later versions of BLAST are capable of being run in parallel using shared memory multiprocessors. (N. Camp, “High-Throughput BLAST,” Silicon Graphics, Inc., September 1988, www.sgi.com/chembio/resources/pap-ers/HTBlast/HT_Whitepaper.html)
Silicon Graphics, Inc. (“SGI”) has developed an alternative parallel system for running multiple BLAST searches. (N. Camp, “High-Throughput BLAST,” Silicon Graphics, Inc., September 1988, www.sgi.com/chembio/resources/papers/HTBlast/HT_Whitepaper.html). The system consists of a modified BLAST executable and a driver, and is called High-Throughput BLAST. (“HT BLAST”). HT BLAST allows multiple sequences to be compared against multiple databases by only a single invocation of code. The output of HT BLAST is a summary of the High Scoring Pair information generated during the search. Through a single invocation of code, HT BLAST saves on startup overhead through the reuse of data structures and elimination of the need to remap the databases. HT-BLAST also removes all parallel constructs from BLAST, allowing for increased single-processor speed. Parallelism has then been relocated to the driver which distributes blocks of sequences to multiple processors running HT BLAST. HT BLAST uses a dynamically scheduled loop to maintain load balance. As the independent tasks are blocks of sequences compared to multiple databases, the parallel grain-size can be much greater than it is for unmodified BLAST. Thus, scaling to large numbers of processors is accomplished even for short sequences and small databases.
HT BLAST, however, is run on a single multiprocessor mainframe. The method and apparatus of the instant invention allows a sequence similarity searching program, such as the BLAST executable, to be run on multiple, networked, heterogeneous machines. Moreover, HT-BLAST does not allow for dividing up collections of databases both by treating individual databases separately and by partitioning the individual databases. The method and apparatus of the instant invention do not require a shared disk architecture, whereas HT-BLAST assumes shared database storage and requires memory mapping. Finally, the method and apparatus of the instant invention manage multiple BLAST job requests through its queuing system.
The Blackstone Technology Group has developed a parallel processing system that allows for BLAST processing on a compute farm. (“SmartBlast™—Version 1.0,” Blackstone Technology Group, http://www.computefarm.com/compute/SmartBlast2.pdf (2001)). Compute farms are large groups of servers that merge computing power into a single resource that is mainly used for long-running and memory-intensive applications, such as those that handle vast amounts of genetic information. The system, SmartBlast™ distributes previously created segments of BLAST reference datasets to servers in the compute farm, based on demand. The segments are created using a proprietary data segmentation tool, SmartCache™ (“SmartCache.™.—Version 2.0,” Blackstone Technology Group, http://www.computefarm.com/compute/SmartCach-e2.pdf). Results are then collected, merged, and sorted by high scoring pair and presented in a single document.
The method and apparatus of the instant invention, as noted above, may be run on a wider class of machines/operating systems, including Windows and Macintosh, whereas the SmartBlast™ backend system only runs in a UNIX/Linux environment. In addition, in contrast to the apparatus and method disclosed herein, SmartBlast™ does not appear to divide up the input sequences. Finally, the apparatus and method of the instant invention allow for automatic partitioning of the databases during the search process, as well as in advance, based on the capabilities of the machines used for searching.