Certain analytical processes that operate on data stored in massive databases can require queries that look to the entire database even though only segments of the entire database are being analyzed at any given moment. One significant problem associated with such analytical processes is the shear size of the massive databases. For sequential processing of such massive databases, the time to complete the analysis is often too long to be of practical use. For distributed processing of such massive databases, the time to communicate the typically required information to a plurality of processing system is often too long, making the process rather inefficient or non-feasible.
One such analytical process in the field of life sciences that utilizes a massive database is a processing algorithm known as BLAST (Basic Local Alignment Search Tool), which is available from NIH. BLAST is a heuristic search algorithm that analyzes gene sequences that are part of a massive gene library. The BLAST software code forms the analytical basis of a number of search programs, namely blastp, blastn, blastx, tblastn and tblastx. The following is a brief summary of these BLAST program variations:                blastp—compares an amino acid query sequence against a protein sequence database;        blastn—compares a nucleotide query sequence against a nucleotide sequence database;        blastx—compares the six-frame conceptual translation products of a nucleotide query sequence (both strands) against a protein sequence database;        tblastn—compares a protein query sequence against a nucleotide sequence database dynamically translated in all six reading frames (both strands); and        tblastx—compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database.        
These BLAST programs ascribe significance to their findings using statistical methods and are tailored for sequence similarity searching, for example, to identify homologs to a query sequence. BLAST can use a few different input file formats, one of which is the FASTA format. The FASTA files for BLAST typically contain nucleotide, protein or amino acid data in the form of sequences. There are two sets of files which BLAST needs for a typical run: the query and the database. The query is typically a standard FASTA file, and the database is typically a set of three files which are created from a single FASTA file through the operation of the format database (formatdb) utility that is part of the BLAST software available from NIH. All four of these files are used by BLAST to produce a result file. Depending on the version of BLAST being run, there may also be a requirement for a scoring file. The BLAST result file is an application specific format that consists of header information, sequence scoring summaries, sequence details and some overall scoring data. Additional information concerning BLAST processing is available on the NIH website at the link—www.ncbi.nlm.nih.gov/blast (URL as of February 2002).
Modified BLAST processing algorithms are also available from NIH, such as a processing algorithm known as PSI-BLAST (Position Specific Iterative BLAST). PSI-BLAST refers to a feature of BLAST 2.0 in which a profile (or position specific scoring matrix, PSSM) is automatically constructed from a multiple alignment of the highest scoring hits in an initial BLAST search. The PSSM is generated by calculating position-specific scores for each position in the alignment. Highly conserved positions receive high scores and weakly conserved positions receive scores near zero. The profile is used to perform a second and further BLAST searches and the results of each “iteration” is used to refine the profile. This iterative searching strategy results in increased sensitivity.
During the BLAST or PSI-BLAST sequence analysis, the software code analyzes a query sequence against a particular segment of the gene library and makes queries that rely upon the entire gene library. Problematically, the entire BLAST sequence database will often exceed 2–3 gigabytes of data. One result of the BLAST processing which is often used by the scientists, non-profit or for-profit organizations that conduct the BLAST processing are results called the score and the “expectation value.” The score represents a scoring mechanism which accounts for the length of an identified pair of similar sequences, balanced by any differences between the two sequences (as in an imperfect, but still elated matching pair). The expectation value is generally of greater interest and represents the expected number of pair-wise alignments of related sequences with a given score. The expectation value offers a measure of significance for a pair of related sequences compared to other pairs of related sequences.
FIG. 1A (prior art) is a block diagram that represents one prior technique for decreasing the processing time by utilizing a plurality of different client machines to help process the data segments. Within the system 100, each client 112A, 112B . . . 112C receives a respective segment 116A, 116B . . . 116C of the sequence database 110 and processes that segment using the BLAST software code or some modification of that code. During this BLAST processing, each of the clients 112A, 112B . . . 112C makes queries 118A, 118B . . . 118C that require having access to the entire BLAST sequence database 110, which is available to each of the clients 112A, 112B . . . 112C. After processing each segment, the clients 112A, 112B . . . 112C provide results 120A, 120B . . . 120C to a result database 114. These results include the expectation values that are typically utilized, as indicated above. One significant problem with this technique is that each client must have direct access to the entire BLAST sequence database 110 during the BLAST processing. Because of the massive size of the entire BLAST sequence database, it becomes prohibitive to consider downloading a copy of the database to each client machine. Thus, this multiple-client configuration typically requires the use of a relatively small number of closely interconnected client machines that can rapidly access the entire BLAST sequence database.
FIG. 1B (prior art) is a block diagram that represents a network-based technique for using numerous broadly distributed computers to perform partial calculations without requiring direct access to the entire BLAST database. Within the system 140, the pre-processing and server systems 154 has access to the entire BLAST sequence database through interface 162 and can generate segment and query sequence (QS) databases 158 that include sequence segments and query sequences that will be processed by the clients 112A, 112B . . . 112C. The server systems 154 communicates with the clients 112A, 112B . . . 112C through the network 152, which can be any of a wide variety of networks or interconnected networks structures, including the Internet. The server systems 154 transfer segments and query sequences (QS) 116 through the network so that each client receives segments and query sequences represented by lines 116A, 116B and 116C, respectively. The clients 112A, 112B . . . 112C then can conduct partial BLAST processing on these segments, but cannot complete the processing because they lack an ability to perform the necessary queries to the entire BLAST sequence database 110. Thus, only partial results 150A, 150B . . . 150C are sent back through the network and ultimately to the server systems 154, as represented by line 150. These partial BLAST calculations can then be stored in a result database 160. As represented by line 164, the partial result data can be passed along to additional BLAST processing system 156, which has direct access to the entire BLAST sequence database. By using the partial calculations and by making queries along line 118 associated with those partial calculations, the additional BLAST processing system 156 can then derive the desired results of the BLAST processing, such as producing the expectation value associated with a given gene sequence. One problem with this approach is that it requires significant additional BLAST processing to be conducted with respect to the partial result data produced by the client systems.