1. Fields of the Invention
The present invention relates to the method for a structural alignment with a double dynamic programming algorithm. The method is suitable for comparative analysis of protein structures in order to obtain information about the structure, function and evolution of the proteins.
2. Description of the Related Art
Proteins are major components of living organisms, which are involved in various aspects of biological activities. Ordinarily, living organisms use 20 kinds of amino acids as the components of proteins. The amino acids are sequentially connected by peptide bonds to form proteins. The amino acid sequence of a protein folds into a tertiary structure to exert its activity.
The databases for the amino acid sequences and tertiary structures have grown quite rapidly, due to the recent development in the techniques for determining nucleotide sequences of DNAs and those for the tertiary structures of proteins. In order to manage and analyze a huge amount of sequence and structure data, computers have been introduced into the field of molecular biology. Then, an interdisciplinary area between information science and molecular biology, so called, "computational molecular biology" or "bioinformatics", has been developed. In the area, comparative analysis of proteins occupies an important position as a method to extract structural and functional information of proteins. It is known that two proteins are similar in sequence and/or structure to each other, when they share a common ancestral genes or a common functional constraints. Conversely, we can obtain functional, structural, and/or evolutionary information through comparison of similar sequences or similar structures.
Alignment is a basic operation for comparative study, which produces residue-to-residue correspondence among similar biological macromolecules. In the procedure, a residue of a protein is disposed in parallel to show the correspondence. The residues without equivalent ones are aligned with empty marks called "gap". Alignment is classified into three types, (1) sequence alignment, (2) threading (comparison between sequence and structure), and (3) structural alignment. Sequence alignment is a major tool for sequence analysis and is widely utilized in the field of molecular biology. The invention of threading is relatively recent, which is used to search for sequences to suit a given tertiary structure. However, the method still has many problems in the accuracy of the alignment and the reliability of the prediction. In both approaches, residue-to-residue correspondence is produced by a method called "dynamic programming algorithm (DP)". The detail of the method will be described below. The invention of the structural alignment is also relatively recent, and the DP also occupies an important position to generate residue-to-residue correspondence in the structural alignment. Considering the rapid growth of structure database, it is expected that the structural alignment will be an important tool for the structure analysis.
FIG. 1 shows the idea of structural alignment. Consider two proteins shown in FIG. 1, proteins A and B. Protein A has an amino acid sequence, N-terminus-A-C-E-L-S-I-S-R-N-Y-D-T-I-P-D-C-terminus (SEQ ID no:1). The capital letters indicate one letter expression of amino acid residues. The amino acid sequence folds into a structure shown in FIG. 1(a). Similarly, protein B, whose amino acid sequence is N-terminus-V-A-S-Q-I-G-W-D-E-D-I-H-L-E-P-I-G-E-S-C-terminus (SEQ ID no:2), folds into a structure shown in FIG. 1(b). The figures suggest that the fold of protein A is similar to that of protein B. Structural alignment automatically detects the structurally equivalent residues between the proteins, and produce the residue-to-residue correspondence as follows;
A-CELSISR--NYD-TIPD SEQ ID no:1 PA1 VASQIGWDEDIHLEPIGES SEQ ID no:2 PA1 performing distance cut-off approximation in which a sphere having a predetermined radius r and centered at the side chain of a residue i of a protein is defined, and residues with side chain centers that are present within the sphere are selected as constituent elements of a structural environment of the residue i; and PA1 performing .DELTA.N cut-off approximation to select residue pairs with similar number of residues constituting the local environments.
where `-` indicates a gap.
Many methods have been elaborated for the structural alignment. Some of them do not use DP. However, any methods suffer a common problem, that is, it requires a huge amount of computational time to construct a structural alignment. The present inventors have developed a technique to reduce the computational time by introducing two approximations into the double dynamic programming algorithm (DDP).
DDP is an algorithm for the structural alignment, which was invented by Taylor and Orengo in 1989. The algorithm is regarded as an extension of DP used for sequence alignment and threading. To facilitate the understanding of DDP, the explanation of DDP will be started from the description of DP. Consider two similar amino acid sequences. In order to align the sequences, a two-dimensional matrix, D, is required. FIG. 2 shows the matrix D. The upper left corner of the matrix corresponds with the N-termini of the proteins. Each residue of protein A corresponds with a row of the matrix, according to the order in the primary structure. Similarly, each residue of protein B corresponds with a column of the matrix. The elements of the matrix are successively determined by solving the recurrence equation as follows; EQU D(i,j)=max{s(i,j)+D(i-1,j-1), D(i-1,j)-.beta.,D(i,j-1)-.beta.}
where .beta. is a gap penalty, and s(i, j) is the similarity between the amino acid residue i of protein A and the residue j of protein B. The set of the numerical value indicating the similarity between every pair of amino acid residues is called "score table". The greater the similarity between an amino acid pair is, the larger the value is. The value of s(i, j) is obtained from a score table. Then, the three arguments in the recurrence equation correspond to three different operations; i.e., (1) connecting residue pairs in a diagonal direction without inserting gap, (2) inserting a gap in a corresponding row, and (3) inserting a gap in a corresponding column. These operations, (1)-(3), also indicate the movements in diagonal, horizontal and vertical directions on the matrix. By solving the equation, the numerical values are accumulated from the upper left toward the lower right in the matrix D. At the same time, the selection of the arguments in the Max operation, that is, the movement on the matrix, are stored in another two-dimensional matrix with the same size as the matrix D. The matrix is called "path matrix", which makes it easy to do back tracking. The numerical value of the lower right corner of the matrix D suggests that the similarity between two amino acid sequences. From the corner, back tracking is performed using the path matrix. Then, an optimal alignment or residue-to-residue correspondence is generated. The time complexity for the calculation is O(L.sup.2 M+LM.sup.2), where L and M are the lengths of proteins A and B.
DDP is basically the same as DP (see FIG. 3), although the subjects of DDP are the tertiary structures of proteins. Like the case of DP, DDP requires a two-dimensional matrix, D, and each residue of the structures under consideration are also corresponded with a row or a column of the matrix. Then, a recurrence equation, which is similar to that for DP described above, is solved. However, s(i, j) is not the similarity between two amino acid residues, but represents the similarity in structural environments between residues i and j.
FIG. 4 shows the definition of structural environment of a residue, which was given by Taylor and Orengo in 1989. They defined the structural environment of amino acid residue i of protein A as a set of vectors from .beta.-carbon of the residue i to those of all the other residues in the proteins. That is, the structural environment of a residue i indicates the relative position of the residue i in the protein A. The similarity in structural environment between two residues is evaluated by DP. As shown in FIG. 5, a two-dimensional matrix is required for the evaluation. Like the case of sequence alignment, each vector constituting the structural environments is corresponded with a row or a column of the two-dimensional matrix, according to the order in the primary structure. Then, similar recurrence equation is solved, and the scores are also accumulated from upper left toward the lower right. However, the similarity between two vectors, say x and y, is calculated by the equation shown in the FIG. 5. Through analogy, the value stored in the lower right corner of the matrix D is considered to indicate the similarity in structural environment between two residues. Therefore, the value is used as s(i, j) shown in FIG. 3.
FIG. 6 summarizes the procedure of structural alignment by DDP. As shown in the figure, DP is used for two different stages of the calculation. The DP to evaluate the similarity in structural environment is called "lower level DP", while the DP to make residue-to-residue correspondence is called "upper level DP". It is the reason why the method is called Double Dynamic Programming algorithm, DDP. The time complexity of the calculation is estimated to be O(L.sup.3 M.sup.2 +L.sup.2 M.sup.3), which is greater than that of sequence alignment, O(L.sup.2 M+LM.sup.2). That is, the computational time is one of the major constraints of the structural alignment by DDP. Therefore, Taylor and Orengo, the inventors of DDP, have improved the method, focusing on this point (see FIG. 7).
At first, they introduced a window into the matrix D, and applied DDP calculation to the residue pairs within the window (FIG. 7(a)). Next, they further restricted residue pairs by selecting those having similar torsional angles and surface areas within the window (FIG. 7(b)). In their latest approach, they aligned secondary structures at first. Then, they selected residues pairs within the aligned secondary structures and with similar torsional angles and surface areas. Thus, they have reduced the computational time by restricting residue pairs to which DDP calculation is applied. Their improvements have remarkably reduced the computational time. However, the methods include many complicated and time-consuming procedures before actual structural alignment such as the assignment and alignment of secondary structures, and the assignment and comparison of torsional angles and surface areas.