The invention relates to a method for searching multiple query sequences against one or more sequence databases. More specifically, the invention relates to a computer-implemented method and apparatus that provide high-performance, high-speed, remotely accessible sequence comparison searches.
Sequence similarity is an observable quantity that may be expressed as, for example, a percentage. Comparison of newly identified sequences against known sequences often provides clues about the function of the sequences. If the sequence is a protein sequence, the sequence comparison may also provide clues as to the three-dimensional structure adopted by the protein sequence. Sequence similarity may also lead to inferences on the evolutionary relatedness, or the homology, of the sequences.
Current sequence databases are already immense and have continued to grow at an exponential rate. For example, the human genome project and other large scale nucleotide sequencing objectives have resulted in a large amount of sequence information available in both private and public databases. Sequence similarity searching is not simply used to compare a single sequence against the sequences in a single database, but is also used to compare or screen large numbers of new sequences against multiple databases. Moreover, sequence alignment and database searches are performed tens of thousands of times per day around the world. Therefore, the ability to quickly and precisely compare new sequence data against such sequence databases is becoming more and more important.
There are many different methods for comparing sequences. Some methods, such as those based on the analysis of transformational grammars (cf. Durbin, et al., Biological Sequence Analysis, Cambridge University Press (1998), Chapter 9), compare sequences by comparing the properties of the mathematical algorithms that may be used to generate the sequences in question. However, most common methods involve the use of sequence alignment at some point in the comparison process. Sequence alignment provides an explicit mapping between the residues of two or more sequences. When only two sequences are compared, the process is called pairwise alignment, but there are also methods of constructing multiple alignments that involve aligning more than two sequences.
The production of a sequence alignment result may be generically divided into two separate problems. The first problem is the alignment of the query sequence with the sequences in the databases being searched. The second problem is ranking or scoring of the aligned sequences. The results of the sequence alignment search are then reported as a ranked hit list followed by a series of individual sequence alignments, plus various scores and statistics.
There are various programs and algorithms available for performing database sequence similarity searching. For a basic discussion of bioinformatics and sequence similarity searching, see BIOINFORMATICS: A Practical Guide to the Analysis of Genes and Proteins, Baxevanis and Ouellette eds., Wiley-Interscience (1998) and Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids, Durbin et al., Cambridge University Press (1998). One of the first used algorithms for performing sequence alignment searching was incorporated into the FASTA program. (Lipman and Pearson, xe2x80x9cRapid and sensitive protein similarity searches,xe2x80x9d Science, Vol. 227, PP. 1435-1441 (1985); Pearson and Lipman, xe2x80x9cImproved tools for biological sequence comparison,xe2x80x9d Proc. Natl. Acad. Sci., Vol. 85, pp. 2444-2448 (1988)). The FASTA program performs optimized searches for local alignments using a substitution matrix. In order to improve the speed of the search, the program uses an observed pattern or small matches, termed xe2x80x9cwordxe2x80x9d hits, to identify potential matches before performing the more time-consuming optimization search.
A popular algorithm for sequence similarity searching is the BLAST (Basic Local Alignment Search Tool) algorithm, which is employed in programs such as blastp, blastn, blastx, tblastn, and tblastx. (Altschul et al., xe2x80x9cLocal alignment statistics,xe2x80x9d Methods Enzymol., Vol. 266, pp. 460-480 (1996); Altschul et al., xe2x80x9cGapped BLAST and PSI-BLAST: A new generation of protein database search programs,xe2x80x9d Nucl. Acids Res., Vol. 25, pp. 3389-3402 (1997); Karlin et al., xe2x80x9cMethods for assessing the statistical significance of molecular sequence features by using general scoring schemes,xe2x80x9d Proc. Natl. Acad. Sci., Vol. 87, pp. 2264-2268 (1990); Karlin et al., xe2x80x9cApplications and statistics for multiple high-scoring segments in molecular sequences,xe2x80x9d Proc. Natl. Acad. Sci., Vol. 90, pp. 5873-5877 (1993)). The approach used by the BLAST program is to first identify segments, with or without gaps, that are similar in a query sequence and a database sequence, then to evaluate the statistical significance of all such matches that are identified, and finally to summarize only those matches that satisfy a preselected threshold of significance.
The blastp program compares an amino acid query sequence against a protein sequence database, while the blastn program compares a nucleotide query sequence against a nucleotide sequence database. The blastx program compares the six-frame conceptual translation products of a nucleotide query sequence (both strands) against a protein sequence database. A protein query sequence is compared against a nucleotide sequence database dynamically translated in all six reading frames (both strands) by the tblastn program, and tblastx compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database. The program blastall, one of the implementations of BLAST, can be used to perform all five flavors of the BLAST comparison.
The BLAST program can be downloaded from the NCBI and run locally as a full executable. It can be used to run BLAST searches against private local databases or downloaded copies of the NCBI databases. The 1.4 and later versions of BLAST are capable of being run in parallel using shared memory multiprocessors. (N. Camp, xe2x80x9cHigh-Throughput BLAST,xe2x80x9d Silicon Graphics, Inc., September 1988, www.sgi.com/chembio/resources/papers/HTBlast/HT_Whitepaper.html)
Silicon Graphics, Inc. (xe2x80x9cSGIxe2x80x9d) has developed an alternative parallel system for running multiple BLAST searches. (N. Camp, xe2x80x9cHigh-Throughput BLAST,xe2x80x9d Silicon Graphics, Inc., September 1988, sgi.com/chembio/resources/papers/HTBlast/HT_Whitepaper.html). The system consists of a modified BLAST executable and a driver, and is called High-Throughput BLAST. (xe2x80x9cHT BLASTxe2x80x9d). HT BLAST allows multiple sequences to be compared against multiple databases by only a single invocation of code. The output of HT BLAST is a summary of the High Scoring Pair information generated during the search. Through a single invocation of code, HT BLAST saves on startup overhead through the reuse of data structures and elimination of the need to remap the databases. HT-BLAST also removes all parallel constructs from BLAST, allowing for increased single-processor speed. Parallelism has then been relocated to the driver which distributes blocks of sequences to multiple processors running HT BLAST. HT BLAST uses a dynamically scheduled loop to maintain load balance. As the independent tasks are blocks of sequences compared to multiple databases, the parallel grain-size can be much greater than it is for unmodified BLAST. Thus, scaling to large numbers of processors is accomplished even for short sequences and small databases.
HT BLAST, however, is run on a single multiprocessor mainframe. The method and apparatus of the instant invention allows a sequence similarity searching program, such as the BLAST executable, to be run on multiple, networked, heterogeneous machines. Moreover, HT-BLAST does not allow for dividing up collections of databases both by treating individual databases separately and by partitioning the individual databases. The method and apparatus of the instant invention do not require a shared disk architecture, whereas HT-BLAST assumes shared database storage and requires memory mapping. Finally, the method and apparatus of the instant invention manage multiple BLAST job requests through its queuing system.
The Blackstone Technology Group has developed a parallel processing system that allows for BLAST processing on a compute farm. (xe2x80x9cSmartBlast#xe2x80x94Version 1.0, xe2x80x9d Blackstone Technology Group, computfarm.com/compute/SmartBlast2.pdf (2001)). Compute farms are large groups of servers that merge computing power into a single resource that is mainly used for long-running and memory-intensive applications, such as those that handle vast amounts of genetic information. The system, SmartBlast#, distributes previously created segments of BLAST reference datasets to servers in the compute farm, based on demand. The segments are created using a proprietary data segmentation tool, SmartCache# (xe2x80x9cSmartCache#xe2x80x94Version 2.0, xe2x80x9d Blackstone Technology Group, computefarm.com/compute/SmartCache2.pdf). Results are then collected, merged, and sorted by high scoring pair and presented in a single document.
The method and apparatus of the instant invention, as noted above, may be run on a wider class of machines/operating systems, including Windows and Macintosh, whereas the SmartBlast(trademark) backend system only runs in a UNIX/Linux environment. In addition, in contrast to the apparatus and method disclosed herein, SmartBlast(trademark) does not appear to divide up the input sequences. Finally, the apparatus and method of the instant invention allow for automatic partitioning of the databases during the search process, as well as in advance, based on the capabilities of the machines used for searching.
The invention relates to a computer-implemented method and apparatus for searching a plurality of query sequences against at least one sequence database containing a plurality of sequence records. The method comprises the steps of:
a. partitioning the plurality of query sequences into a set of smaller subsets of query sequences;
b. partitioning the at least one sequence database into a set of smaller subdatabases;
c. designating searching tasks to be performed by associating each of said subsets of query sequences with one or more of said subdatabases, assigning each searching task to one of a group of computers operating in parallel, wherein each member of the group of computers operating in parallel has at least one searching task assigned thereto, and executing at least some of the assigned searching tasks using the group of computers operating in parallel; and
d. collecting search results from the executed searching tasks and generating a unified sequence search result in accordance with the collected search results.
Also disclosed is an apparatus for performing the above method, wherein the apparatus comprises:
a. means for partitioning the plurality of query sequences into a set of smaller subsets of query sequences;
b. means for partitioning the at least one sequence database into a set of smaller subdatabases;
c. means for designating searching tasks to be performed by associating each of said subsets of query sequences with one or more of said subdatabases;
d. means for assigning each searching task to one of a group of computers operating in parallel, wherein each member of the group of computers operating in parallel has at least one searching task assigned thereto;
e. means for executing at least some of the assigned searching tasks using the group of computers operating in parallel;
f. means for collecting search results from the executed searching tasks; and
g. means for generating a unified sequence search result in accordance with the collected search results.
The invention also relates to the above method and apparatus, wherein the partitioning of the query sequences and the partitioning of the sequence database is done by each member of the group of computers operating in parallel. In addition, the method may also be performed wherein the partitioning of the query sequences and the partitioning of the sequence database is based on the processing capacity of each member of the group of computers operating in parallel, and each member of the group of computers operating in parallel may assign to itself which searching tasks it will perform. Each of the group of computers operating in parallel may perform one, two, or more searching tasks during the execution of the search, and each member may assign to itself another task once it finishes a searching task. The process may be reiterated, until all of the searching tasks are performed.
Each of the group of computers operating in parallel may be the same or different, and each of the group may have the same or different operating systems. Moreover, if one of the computers operating in parallel should fail, the correctness and/or precision of the search results will not be affected.
One or more of the sequence databases against which the query sequence is being compared may be derived from the databases maintained by the National Center for Biotechnology Information (NCBI). The plurality of query sequences are searched against one or more sequence databases, and each of the sequence databases may or may not be split into a set of smaller databases. The sequence databases may be searched using any desired algorithm, such as the BLAST algorithm. The unified sequence search result may be a sequence alignment. If the unified sequence search result is a sequence alignment, a raw score may be reported as part of the result. In addition, an e-score may also be reported as part of the search result, and the e-score may be normalized for each database searched as part of the generation of the unified search result. Moreover, the unified search result may be reported as a unified relevance ranked result list based on the normalized e-score.
The search results of each individual task may be collected by a single computer or by two or more computers of the group of computers operating in parallel. The unified search result may then be generated by interleaving the search results from the executed searching tasks on the basis of raw scores generated during the executed searching tasks. The method and the apparatus of the invention allow for superlinear speedup in the production of the unified search result, based on total time required to execute all searching tasks and produce the unified search result, which is equal to the duration of the period starting when the entire searching task is placed on a list of searching tasks accessible to all of the one or more computers operating in parallel and ending when the unified result for the entire searching task is placed on a list of results and a signal to exit has been sent to all of the computers operating in parallel. Superlinear speedup occurs when an increase in the number of computers operating in parallel causes a greater than pro rata reduction in the total time, as when the time required using four computers operating in parallel is less than one-half of the time required with two computers operating in parallel.