Prediction of three-dimensional structures of proteins from their amino acid sequences is not believed to be theoretically impossible. However, at present, any means to reliably predict the three-dimensional structures of proteins from sequence information has not been developed, and the means to know the three-dimensional structure of proteins are limited to experimental methods such as X-ray crystallographic analysis and NMR analysis. The information on the three-dimensional structure of proteins is essential for understanding their functions on atomic level, as well as for designing medicinal molecules targeting that protein or useful proteins with excellent functions. Recently, as the result of rapid progresses of analytical means of genetic information, numbers of proteins are increasing whose sequence information are elucidated without isolation. Therefore, development of effective means to predict three-dimensional structure and functions from sequence information is desired earnestly at present.
When the existence of a protein with a certain amino acid sequence is revealed, it is a common practice to search for proteins with homology from sequence databases. In case a protein having reasonable degree of identity in amino acid sequence is found, alignments are performed by considering homology and gaps with the protein, and alignments of higher homology are further searched. It can be assumed that when the homology of the sequence of the target protein with a protein with known function is high, its function resembles that of the known protein, and when the homology of the target protein with a protein of a known three-dimensional structure is high, its three-dimensional structure resembles that of the known protein. As the homology is higher, the possibility of the resemblance in functions or three-dimensional structure is also higher, and the reliability of predictions is believed to be high.
When the homology to the protein sequence with known three-dimensional structure is recognized to a certain extent (generally about 30%) or more, homology modeling methods are performed to construct a three-dimensional structure using the three-dimensional structure as a template. When the residues differ from those corresponded in the template in view of its three-dimensional structure, the three-dimensional structure can be constructed virtually by substituting side chains. Gaps in the alignment need to be treated separately because no corresponding amino acid residues exists in the template three-dimensional structure or the template has excess amino acid residues. Since the existence of gaps makes the template-based modeling difficult, and since it also lowers reliability, alignment methods giving some penalty to the gaps are recommended in order to reduce the number of gaps as small as possible.
When any protein with known three-dimensional structure is not found which has a fairly high sequence homology with the amino acid sequence in question, homology modeling is impossible. On the other hand, as the crystal structure information on proteins is accumulated, there have been revealed by a lot of researches that plural proteins with little homology and completely different functions to each other have similar three-dimensional structures. This fact indicates a possibility that a three-dimensional structure fitting as a template can be chosen from proteins with known three-dimensional structures, even though homology of amino acid sequence is low, by consideration of physicochemical factors for proteins to form stable three-dimensional structures.
Recently, by using scores considering the coincidence of physical properties such as hydrophobicity for each amino acid residue, methods have been developed for choosing template proteins from proteins with known three-dimensional structures, which template proteins have high similarity in three-dimensional structure even though they have low homology in amino acid sequence. A typical method includes the 3D-1D method (R. Luthy, J. U. Bowie and D. Eisenberg, Nature, 356, 83, 1992) by Eisenberg et al. This method, in addition to the consideration of homology of amino acid sequences, contains the calculating process of similarity scores between the corresponded amino acid residues using parameters expressing the secondary structure to which each amino acid residue belongs, and the environment of the location of the residues in proteins with known three-dimensional structure, together with parameters given to each amino acid residue in each secondary structure in the query sequence. This method can avoid the problem of huge degrees of freedom in folding peptide chains of proteins by utilizing the known crystal structures as the template, and thus a modeling is enabled by including the physical parameter such as hydrophobicity as a factor of estimation even when the homology in sequence is low.
However, even in case the three-dimensional structure is similar, since there are few proteins in which the number of amino acid residues, a secondary structure, or the lengths of or loops are the same, a lot of problems will arise when the 3D-1D method are practically applied based on the correspondence between the amino acid sequences. For example, although it is necessary to correspond amino acid residues by considering the deletion of partial sequences (gaps) in either sequence, as well as the simple slide between the amino acid sequences (threading), the introduction of gaps reduces the reliability as similarly observed in the homology-modeling. When the homology in sequence is low, how to make the correspondence of the sequences with consideration of the necessary and minimum gap is a problem. Furthermore, in the aforementioned method, no advancement of predictability is expected by improving parameters because it depends on numerous parameters such as hydrophobicity and hydrophilicity, as well as parameters given for each of twenty amino acid residue in each secondary structure.
The history of study to predict the three-dimensional structure of proteins from amino acid sequences started with the prediction of which fragment of the sequence would be in what secondary structure. That is, by employing parameters which shows the susceptibility of adopting α-helix or β-sheet for each amino acid residue or each set of several amino acid residues, obtained statistically from crystallographic information a lot of proteins, the continuous region is detected which shows remarkable tendency from the query amino acid sequence, and the secondary structure is chosen for each region. A typical example include a secondary structure prediction method by Chou and Fasman (P. Y. Chou, & G. D. Fasman, Adv. Enzymol. 47, 45, 1978). However, this sort of method gives no information about three dimensional assemblies of secondary structures, and since the average coincidence between secondary structures predicted from amino acid sequences and those found in crystal structures is approximately 60%, it has almost no value as a prediction method of three-dimensional structures.
Methods of predicting stable folding structures of proteins by pure calculation without preconception (so-called ab initio prediction method) have been attempted. However, since proteins are molecules with extremely huge degrees of freedom (even for the protein with about 100 residues, the number of parameters to be considered for the degrees of freedom is more than 400), it is impossible to search possible structures sufficiently considering all degrees of freedom by means of presently available computers. Moreover, from the reasons that studies for the factors related to the stabilization of protein structure (for example, physicochemical properties of water, hydrophobic interaction, electrostatic interaction) are not advanced enough to estimate the stability of possible three-dimensional structures correctly, success of this kind of structure prediction is not expected at present.
In recent years, three-dimensional structures of a lot of proteins have been analyzed, and the results are available from Protein Data Bank. At present, structures of about 6,000 proteins and nucleic acids are stored, however, independent proteins with different functions are approximately 400. From the three-dimensional structures of these proteins, many proteins have been revealed to have the same structural motif, although they have no homology and seem to have no relation evolutionally and functionally to each other.