Homology search is a process of finding out biologically similar sequences (=highly homologous sequences) from within an existing group of sequences, for example, when an unknown DNA base sequence (hereinafter, simply referred to as a “sequence”) has been found. Applying it to general retrieval, the unknown sequence corresponds to a query and the existing group of sequences corresponds to a database.
In the homology search, in many cases, a plurality of sequences are input as queries upon execution of single retrieval. The plurality of sequences are not in mutually dependent relation and hence the retrieval can be performed independently on a sequence by sequence basis. Specifically, if N sets of retrieval servers are installed, a query can be N-partitioned in units of sequences and the respective partitioned queries can be separately input into the respective servers. In this case, if the sequence lengths of the queries after partitioned are almost equal to one another, retrieving performance which is higher by a factor of N can be obtained in theory and it is expected to obtain a result of retrieval in a 1/N retrieval time. Therefore, in many cases, the homology search is accelerated by partitioning a query and using a plurality of servers.
For example, FIG. 1 is a diagram illustrating theoretical retrieval times before and after partition in the case that a query has been partitioned into two parts. In the drawing, partitioned queries 1 and 2 are queries obtained by partitioning a not-yet-partitioned query into two parts in units of a sequence. Here, in the case that the retrieval time by the not-yet-partitioned query is T, if the sequence length of the partitioned query 1 is equal to the sequence length of the partitioned query 2, the retrieval times by the partitioned query 1 and the partitioned query 2 will be respectively T/2 in theory and it will be expected to accomplish the retrieving process in a T/2 time as a whole.
Incidentally, in the homology search, there exist many algorithms such as Smith & Waterman, FASTA, BLAST and the like. Although these algorithms have their merits and demerits, the most frequently used algorithm is the BLAST. Because, although the BLAST is lower than other algorithms in retrieval accuracy, it has such an advantage that its retrieval time is shorter than those of other algorithms by one or more figures. Even in the homology search on the basis of the BLAST algorithm (hereinafter, simply referred to as the “BLAST”), accelerating by partitioning a query and using a plurality of servers is general.
Japanese Laid-open Patent Publication No. 09-50438, Japanese Laid-open Patent Publication No. 2005-84973, and International Publication Pamphlet No.WO2002/090978 discloses a related technique.
However, in case of the BLAST, it sometimes occurs that although the sequence lengths are the same as each other, the retrieval time is greatly varied depending on the details of each sequence and it is difficult to predict in advance the retrieval time. According to a result of investigation conducted by the applicants of the present invention, the value of correlation between the length of the query sequence and the retrieval time was limited to as low as about 0.6 to 0.7. In general, it cannot be said that the correlation is high unless the correlation value exceeds 0.9. Therefore, in the case that the query is partitioned on the basis of the sequence length, there is a possibility that a considerable time difference will be generated between respective retrieving processes on the basis of partitioned queries.
FIG. 2 is a diagram illustrating examples of retrieval times in the case that two sequences having the same sequence length have been used as queries. In the drawing, it is assumed that although a sequence A is different from a sequence B in details, they are equal to each other in sequence length.
In the case that these sequences have been used as queries and retrieval has been performed for a sequence database X, it is not rare that the sequences A and B are different from each other in retrieval time by a factor of two or more times (independency of retrieval time of sequence length). In addition, it may occur that in the case that the retrieval is performed for the sequence database X, the sequence A is shorter in retrieval time, while in the case that the retrieval is performed for another sequence database (a sequence database Y), the sequence B is shorter in retrieval time (sequence-database-dependent inversion of retrieval time). As described above, since the retrieval time is greatly varied depending on a combination of a query sequence with a sequence database, it is difficult to highly accurately predict the retrieval time from the features of the query sequence by the BLAST.
However, since there is no appropriate method from a practical viewpoint, when a query is to be partitioned, in many cases, the query is partitioned such that the sequence lengths of the partitioned queries are equal to one another. As a result, such a problem occurs that when the number of retrieval servers is increased, high paralleling effect cannot be obtained.
For example, FIG. 3 is a diagram illustrating retrieval times by the BLAST before and after partition in the case that a query has been partitioned into two parts on the basis of the sequence length. In the drawing, a retrieval time t2 of a partitioned query 2 is shorter than T/2. However, a retrieval time t1 of a partitioned query 1 is longer than T/2. As a result, the sum of the retrieval times is t1 which is longer than T/2.
Incidentally, there is also available a method of partitioning a query using the number of HSPs (High Scoring Pair) instead of the sequence length. By the BLAST, a part where a query sequence perfectly matches with each sequence in a sequence database in units of a specified character string length (HSP length) is found and the retrieval is performed using it as a base. This perfectly matched part is called the HSP and the number of the HSPs in a sequence database is called the HSP number.
The correlation between the HSP number and the retrieval time is as very high as about 0.95 and if query partition is performed on the basis of the HSP number (such that the HSP numbers are equal to each other, a variation in execution time between queries which have been partitioned can be reduced. However, in order to acquire the HSP number, it is necessary to collate each specified-character-string-length-based character string taken out of the query sequence with each sequence in the sequence database. Therefore, such a problem occurs that a process of acquiring the HSP number becomes overhead (tens % of the homology search time in some cases).
FIG. 4 is a diagram illustrating processing time by the BLAST before and after partition in the case that a query has been partitioned into two parts on the basis of the HSP number. In this case, the retrieval time of the partitioned query 1 and the retrieval time of the partitioned query 2 are respectively nearly T/2. However, since a partitioning process takes much time, the total processing time exceeds T/2.