1. Field of the Invention
This invention relates to methods for constructing models for the secondary structure, supersecondary structure, and tertiary structure of proteins in the absence of any crystallographic information for any member of the protein family. More particularly, the present invention pertains to methods for extracting structural information from a set of aligned sequences of homologous (related by common ancestry) proteins for these purposes. More particularly, the present invention pertains to methods that extract structural information concerning a protein fold from the patterns of conservation and divergence of amino acid sequence within a set of homologous protein sequences, where this information is extracted using algorithms that reflect the evolutionary processes by which the protein family emerged.
2. Description of the Related Art
Proteins are linear polypeptide chains composed of 20 different amino acid building blocks. Determining the sequence of amino acids in a protein is now experimentally routine, both by direct chemical analysis of the proteins themselves, or by translation of a gene that encodes the protein. Existing data bases contain over 10 million amino acids.
The linear polypeptide sequence provides only a small part of the structural information that is important to the biochemist, however. The polypeptide chain folds to give secondary structural units (most commonly alpha helices and beta strands) which then fold to give supersecondary structures (for example, beta sheets) and a tertiary structure. Most of the behaviors of a protein are determined by its secondary and tertiary structure, including those that are important for allowing the protein to function in a living system. Further, the folded structure must be known before pharmaceuticals can be designed to bind to the protein.
The utility of methods able to predict secondary structure of a protein from sequence data alone is obvious to any individual ordinarily skilled in the art. High quality secondary structure predictions are useful for identifying antigenic sites on a protein molecule, as guides for site directed mutagenesis studies, and for understanding the interaction of a protein with other molecules. Further, high quality secondary structure predictions are prerequisites for building tertiary structural models for proteins.
The importance of methods for predicting the folded structure of proteins from sequence data has been appreciated by biochemists for over 30 years, and a corresponding three decades of labor have been devoted to efforts to develop such methods. The problem has proven to be extremely difficult to solve. In the process, a very large number of publications have appeared describing approaches towards solution to the structure prediction problem. Many of the classical approaches attempting to develop methods for predicting the folded structure of proteins from sequence data are summarized in a book G. Fasman, editor, Prediction of Protein Structure and the Principles of Protein Conformation, NY Plenum (1989)!, which is incorporated herein by reference.
It is clearly impossible to provide in this disclosure a complete description of the entire body of classical work in the field of protein structure prediction. What is presented is a summary of these classical methods, together with a brief comment on their efficacy, the prior art known to the Applicant that serves as the closest precedent for the method of the present invention, and criteria that allow the present invention to be distinguished from all aspects of the prior art known to the Applicant.
First, the method of the present invention is de novo, that is, it allows the prediction of the folded structure of a protein without the need for any crystallographic data. De novo predictive methods are distinct from other methods that rely on crystallographic information to be useful, in particular, "knowledge based" method, where a structural model for a protein whose structure is unknown is built by extrapolation from a structure of a homologous proteins whose crystal structure has already been solved T. L. Blundell, B. L. Sibanda, M. J. E. Sternberg, J. M. Thornton: "Knowledge-based Prediction of Protein Structures and the Design of Novel Molecules", Nature 1987, 326: 347-352!.
De novo methods in the prior art that attempt to predict the folded structure of proteins from sequence data fall into three general categories:
(i) Computational methods attempt to model the folded structure of proteins by calculating the relative energies of various possible conformations of the polypeptide chain in the search for the accessible conformation with the lowest energy. This method has been largely unsuccessful due to the large conformational space that a polypeptide chain can occupy, and difficulties in modeling the interaction of the protein with the solvent, water.
(ii) Statistical methods examine the proteins whose folded structures are known, and tabulate from these structures the probability that each of the 20 proteinogenic amino acids occurs in a particular secondary or supersecondary structure. In predictive work, these statistical structural propensities can be influenced by the sequence of the protein immediately before or after the amino acid residue in question, and can be averaged over an alignment of homologous protein sequences. Such methods normally are only partly successful, in part because the statistical preferences for individual amino acids to occupy particular secondary structures is small.
(iii) Methods based on physical chemical properties of the side chains of different amino acids place amino acid side chains that are hydrophilic outside the folded protein structure, where they are presumed to interact with solvent water, and hydrophobic side chains inside the folded structure. Secondary structures are assigned by patterns in the hydrophobicity or hydrophilicity of the side chains. For example, 3.6 residue periodicity is indicative of an alpha helix, while alternate periodicity is indicative of a beta strand. Such methods are only partly successful because evolutionary forces tend to introduce amino acid residues into polypeptide sequences that violate the 3.6 residue periodicity to achieve proteins with the desired level of conformational instability S. A. Benner, "Patterns of Divergence in Homologous Proteins as Indicators of Tertiary and Quaternary Structure," Adv. Enzym. Regulation, 28, 219-236 (1989)!.
From the large body of classical work, it is worth noting the classical work of Lim and of Schiffer and Edmundson, as the method of the present invention derives considerable inspiration from these authors. Schiffer and Edmundson M. Schiffer, A. B. Edmundson, "Use of helical wheels to represent the structures of proteins and to identify segments with helical potential", Biophys. J. 7, 121-135 (1967)! was the first to use helix wheels to illustrate the properties of alpha helices. Lim V. I. Lim, "Structural principles of the Globular organization of proteins chains: A stereochemical theory of globular protein secondary structure", J. Mol. Biol., 88, 857-872 (1974); V. I. Lim, "Algorithms for prediction of .alpha.-helical and .beta.-structural regions in globular proteins", J. Mol. Biol., 88, 873-894 (1974)! was one of the first to formalize a method for identifying alpha helices in proteins structures that searches for a property of a polypeptide sequence that displays 3.6 residue periodicity.
In recent years, a number of papers appeared that transcended classical approaches, in that they have examined the sequences of more than one homologous protein in an effort to extract structural information. These papers have focused on the fact that homologous proteins have similar folded structures Chothia & Lesk, EMBO J. 5, 823 (1986); N. L. Summers, W. D. Carlson, M. Karplus, J. Mol. Biol. 196, 175 (1987)!.
For example, Crawford et al. I. P. Crawford, T. Niermann, K. Kirschner, "Prediction of Secondary Structure by Evolutionary Comparison: Application to the .alpha. Subunit of Tryptophan Synthase", Proteins, 2, 118-129 (1987)! examined a set of aligned homologous sequences of tryptophan synthase, and predicted that this protein folded to yield an eight fold alpha-beta barrel. This is, to the Applicant's knowledge, the first time that a correct de novo prediction has been made for the folded structure of a protein in advance of crystallographic data. Further, Crawford et. al. (1987) op. cit.! used gaps in the alignment to separate individual secondary structural elements. To assign secondary structure to each of these elements, Crawford et al. used a classical statistical algorithm developed by Gamier et al. J. Gamier, D. J. Osguthorpe, B. Robson, "Analysis of the Accuracy and Implications of Simple Methods for Predicting the Secondary Structure of Globular Proteins", J. Mol. Biol., 120, 97-120 (1978)!, assigned secondary structures to each homologous sequence individually, and then averaged the secondary structural assignments over all of the protein sequences to give an average secondary structure prediction. The secondary structure prediction had eight alpha helices interspersed with eight beta strands, a pattern well known in one class of protein fold, the eight fold alpha-beta barrel fold. This pattern was used by Crawford et al. to build a corresponding tertiary structure model for tryptophan synthase.
Zvelebil et al. M. J. Zvelebil, G. J. Barton, W. R. Taylor, M. J. E. Sternberg, "Prediction of Protein Secondary Structure and Active Sites using the Alignment of Homologous Sequences", J. Mol. Biol., 195, 957-961 (1987)! proposed a similar approach of averaging secondary structure predictions made by the method of Gamier et al. (1978) op. cit.!, but did not attempt to predict any unknown structures with this approach. Likewise, Taylor and Green W. R. Taylor, N. M. Green, "The predicted secondary structures of the nucleotide-binding sites of six cation-transporting ATPases lead to a probably tertiary fold", Eur. J. Biochem. 179, 241-248 (1989)! average secondary structure predictions made individually for each of a set of protein sequences to obtain a consensus secondary structure prediction, and use a knowledge-based approach to model a tertiary structure.
The main disadvantage of the method advocated by Crawford et al., Zvelebil et. al., and Taylor and Green is that it does not work well on a wide range of protein structures.
Patthy also suggested that gaps in an alignment of several homologous protein sequence can indicate which residues form surface loops L. Patthy, "Prediction of surface loops of protein-folds from multiple alignments of homologous sequences," Acta Biochim. Biophys. Hung. 24, 3-13 (1989)!, but did not attempt to predict any surface loops in any unknown structures with this approach.
Further, it is generally appreciated in the art that certain residues at the active site of an enzyme are highly conserved during divergent evolution. Zvelebil and Sternberg M. J. Zvelebil, M. J. E. Sternberg, Prot. Eng. 2, 127-138 (1988)! have reported an algorithm that scans for conserved polar residues in regions of high conservation in a multiple alignment as an indicator of active site residues. Of 16 active site residues in a test sample of alignments, 13 were correctly predicted, with an overprediction of 50 positions. This level of success is inadequate to permit a prediction of a structure using this algorithm as a component of the predictive method.
Overington et al. J. Overington, M. S. Johnson, A. Sali, T. L. Blundell, "Tertiary Structural Constraints on protein evolutionary diversity: Templates, key residues, and structure prediction" Proc. Roy. Soc. B.. 241, 132-145 (1990)! have studied patterns of sequence divergence in proteins with known crystal structures. Depiereux and Feytmans E. Depiereux, E. Feytmans, "Simultaneous and multivariate alignment of protein sequences: Correspondence between physicochemical profiles and structurally conserved regions", Prot. Eng., 4, 603-613 (1991)! have also noticed patterns in conservation of individual residues during divergent evolution. Neither have provided a method for predicting secondary or tertiary structure.