Proteins in the natural world had been selected in the process of evolution and gotten to reveal specific functions. It is known that these functions of proteins depend on their three-dimensional structures. Thus, if it is possible to predict the three-dimensional structure of a protein, its functions would be predictable.
In the past, in order to investigate a protein on which no information has been ever obtained, people has wanted methods for inferring or predicting its tertiary structure by computationally determining the similarity to proteins whose tertiary structures were already known. As a powerful one out of such methods, there is known a method of comparing protein profile matrices (Rychlewski L, Jaroszewski L, Li W, Godzik A. Protein Sci. February (2000); 9(2): 232-41).
Here, a protein profile matrix is obtained by transforming the occurrence frequencies of types of amino acids in related proteins (protein family etc.) into numerical values, for every location of amino acid residues, to form a matrix. The matrix is usually formed through the following steps. Firstly, given a multiple alignment in which amino acid sequences in a plurality of related proteins are juxtaposed in multiple, the occurred numbers of each type of 20 amino acids are counted for every location of amino acid residues in the multiple alignment. Thus-counted numbers are then normalized to be transformed into the occurrence probabilities. At this time, the occurred numbers are revised with consideration for weights depending on mutual similarities between amino acid sequences in the members of a given multiple alignment. Then, a profile matrix is formed.
Here, a multiple alignment is obtained by juxtaposing amino acid sequences in a plurality of biologically mutually-related proteins with aligning the amino acid residues which are considered to correspond to each other. A multiple alignment may be readily prepared, for example, by using the existing program PSI-BLAST (Altschul et al., Nucleic Acids Res. (1997) 25(17):3389-3402) and searching the sequence database for a certain sequence as a query, or by using the existing program CLUSTALW (Higgins D., Thompson J., Gibson T. Thompson J. D., Higgins D. G., Gibson T. J. (1994). Nucleic Acids Res. 22:4673-4680) with queries of a group of amino acid sequences in a plurality of biologically mutually-related proteins. It may be also prepared based on the results of tertiary structure comparison and the like.
Table 1 schematically shows a multiple alignment prepared on the basis of a protein in which an amino acid sequence has a length n (the number of amino acid residues). Note that, in Table 1, the first column shows the names of proteins, the numbers “1 to n” in the first row designate the locations of amino acid residues in the multiple alignment, and each of the alphabet letters is the one-letter code of each type of amino acid.
TABLE 112345678. . .n20807455/14-218MIDHTLLK. . .G19551629/13-215ILDYTLLG. . .A16974933/15-229LMDLTTLN. . .A16120769/20-234LMDLTTLN. . .A
Although amino acids are present at all of the illustrated locations of amino acid residues in the example of Table 1, a gap may be designated by “• (a dot)” in case that a location of amino acid residue is not occupied by the corresponding amino acid residue. Table 2 schematically shows a profile matrix formed based on a multiple alignment having a length n which is obtained in Table 1. In Table 2, the first column shows types of amino acids (which may include gaps) and the numbers “1 to n” in the first row designate the locations of amino acid residues in the profile matrix.
TABLE 2AA/Pos.123. . .nA0.000.000.00. . .0.71R0.000.000.00. . .0.00N0.000.000.00. . .0.00D0.000.000.96. . .0.00C0.000.000.00. . .0.00Q0.000.000.00. . .0.00E0.000.000.04. . .0.00G0.000.000.00. . .0.29H0.000.000.00. . .0.00I0.290.290.00. . .0.00L0.410.290.00. . .0.00K0.000.000.00. . .0.00M0.290.410.00. . .0.00F0.000.000.00. . .0.00P0.000.000.00. . .0.00S0.000.000.00. . .0.00T0.000.000.00. . .0.00W0.000.000.00. . .0.00Y0.000.000.00. . .0.00V0.010.010.00. . .0.00
Each column in the profile matrix shows a distribution of occurrence probabilities as to all types of amino acids at each of the locations of amino acid residues in a plurality of related proteins. Table 3 lo schematically shows a profile column at a residue location “2” in the profile matrix shown in Table 2.
TABLE 3Location 220.000.000.000.000.000.000.000.000.000.290.290.000.410.000.000.000.000.000.000.01
From this it follows that, at a residue location “2” in the profile matrix shown in Table 2, the revised occurrence probability of alanine (A) is 0.00 and the revised occurrence probability of methionine (M) is 0.41.
In the past, in order to compare and/or align two profile matrices or two amino acid sequences, Dynamic Programming (Needleman S B, Wunsch C D, J Mol Biol. (1970) March; 48(3): 443-53) has been employed. When preparing an alignment, in a pair of amino acid sequences or a pair of profile matrices to be compared, it is necessary to determine which residues or profile columns should form a parallel to each other (in this case, it may happen that a residue and a gap form a parallel to each other), and there are a great number of thinkable ways for making them form a parallel to each other. Dynamic Programming is an algorithm capable of automatically and efficiently finding out such a paralleling way as maximizes a similarity score out of these ways. The result itself obtained by the said paralleling way is an alignment that is to be finally wanted.
Dynamic Programming requires inputs composed of two amino acid sequences to be compared and a score matrix consisting of similarity scores (marks indicating degrees of similarity) for the respective pairs of residues between two amino acid sequences to be paralleled in the case of a usual amino acid sequence comparison, or inputs composed of two model amino acid sequences to be compared and a score matrix consisting of similarity scores for the respective pairs of profile columns between two profile matrices to be paralleled in the case of a profile matrix comparison. According to these inputs, Dynamic Programming outputs an alignment of a pair of amino acid sequences to be compared and its final scores (the scores obtained by finding such an optimal path as gives the maximum similarity score) in the case of a usual amino acid sequence comparison, or an alignment of a pair of model amino acid sequences to be compared and its final scores in the case of a profile matrix comparison.
Thus, in order to compare profile matrices by a method employing Dynamic Programming, it is necessary to form a score matrix which measures with high accuracy the similarity between two profile matrices to be compared.
As one of the methods for calculating a score matrix which indicates the degree of similarity between two profiles, there is known a method developed by Rychlewski et al. (Rychlewski et al. (2000), 9:p232-241). This method comprises the steps of calculating, as the value of a similarity score between a pair of profile columns to be paralleled, a dot product of the said pair of profile columns and then forming a score matrix between two profile matrices to be compared.
For example, given two profile matrices X=x1x2 . . . xn (wherein xp . . . designates a profile column at a location p of an amino acid residue) and Y=y1y2 . . . yq . . . ym (wherein yq designates a profile column at a location q of an amino acid residue), a similarity score Dqp (a similarity score between a profile column xp and a profile column yq), which is a component of a score matrix of n rows and m columns, is represented by the following equation:
      D    pq    =            ∑      a      j        ⁢                  x        pa            ⁢              y        qa            wherein xpa designates a component of a profile column xp,
yqa designates a component of a profile column yq, and
j is the number of components in a profile column (usually 20).
According to the above-described method, in the case wherein there are only a quite limited types of occurred residues and a weak generation of amino acid substitution in both of a pair of profile columns to be paralleled, the dot product has such a high numeral value as to give a high similarity score. Such a residue location, which is highly conservative because of occurred residues of quite limited types and a weak generation of amino acid denaturation, is considered to be a highly conservative place from functional or physicochemical needs in vivo and also a biologically important location. In such a region, it is considered that the above-described method makes it possible to measure the similarity with high accuracy.
However, according to the above-described method, there was a possibility of measuring with high accuracy a location where types of occurred residues are limited, while there was a problem that it was impossible to measure with high accuracy such a region as generates heavy amino acid substitution but seems to have a commonness in its substituting pattern, such as a non-conservative location which exists in a motif, a location which is meaningfully exposed in a protein's tertiary structure and has a large significance only in its polarity, a location which exists, contrarily, in an embedded portion of the tertiary structure and is conservative with regard to only its hydrophobicity, and so on, even though it is a biologically important location.
In addition, because it was required that the average of all the components (similarity scores) in a score matrix has a negative value and the standard deviation is almost constant, the similarity scores had to be normalized. Thus, there was also a problem of being troublesome.
Therefore, it has been desired to develop a highly accurate and simple method for measuring the similarity between profile matrices at not only conservative regions but also non-conservative regions.