Proteins (or polypeptides) are linear polymers of amino acids. The polymerization reaction which produces a protein results in the loss of one molecule of water from each amino acid, and hence proteins are often said to be composed of amino acid residues. Natural protein molecules may contain as many as 20 different types of amino acid residues. The particular linear sequence of amino acid residues in a protein defines the primary sequence, or primary structure, of the protein. The primary structure of a protein can be determined with relative ease using known methods.
Proteins fold into a three-dimensional structure. The folding is determined by the sequence of amino acids and by the protein's environment. Examination of the three-dimensional structure of numerous natural proteins has revealed a number of recurring patterns. Patterns known as alpha helices, parallel beta sheets, and anti-parallel beta sheets are the most common observed. A description of such protein patterns is provided by Dickerson, R. E., et al. in The Structure and Action of Proteins, W. A. Benjamin, Inc. California (1969). The assignment of each amino acid residue to one of these patterns defines the secondary structure of the protein. The helices, sheets, and turns of a protein's secondary structure pack together to produce the folded three-dimensional, or tertiary, structure of the protein.
In the past, the three-dimensional structure of proteins has been determined in a number of ways. Perhaps the best known way of determining protein structure involves the use of the technique of x-ray crystallography. A general review of this technique can be found in Physical Biochemistry, Van Holde, K. E. (Prentice-Hall, New Jersey 1971), pp. 221-239, or in Physical Chemistry with Applications to the Life Sciences, D. Eisenberg & D. C. Crothers (Benjamin Cummings, Menlo Park 1979). Using this technique, it is possible to elucidate the three-dimensional structure with good precision. Additionally, protein structure may be determined through the use of the techniques of neutron diffraction, or by nuclear magnetic resonance (NMR). See, e.g., Physical Chemistry, 4th Ed. Moore, W. J. (Prentice-Hall, New Jersey 1972) and NMR of Proteins and Nucleic Acids, K. Wuthrich (Wiley-Interscience, New York 1986).
The three-dimensional structure of many proteins may be characterized as having internal surfaces (directed away from the aqueous environment in which the protein is normally found) and external surfaces (which are exposed to the aqueous environment). Through the study of many natural proteins, researchers have discovered that hydrophobic residues (such as tryptophan, phenylalanine, leucine, isoleucine, valine, or methionine) are most frequently found on the internal surface of protein molecules. In contrast, hydrophilic residues (such as aspartate, asparagine, glutamate, glutamine, lysine, arginine, serine, and threonine) are most frequently found on the external protein surfaces. The amino acids alanine, cysteine, glycine, histidine, proline, serine, tyrosine, and threonine are encountered with more nearly equal frequency on both the internal and external protein surfaces.
The biological properties of proteins depend directly on the proteins three-dimensional (3D) conformation. The 3D conformation determines the activity of enzymes, the capacity and specificity of binding proteins, and the structural attributes of receptor molecules. Because the three-dimensional structure of a protein molecule is so significant, it has long been recognized that a means for readily determining a protein's three-dimensional structure from its known amino acid sequence would be highly desirable. However, it has proved extremely difficult to make such a determination. One difficulty is that each protein has an astronomical number of possible conformations (about 10.sup.16 for a small protein of 100 residues; see K. A. Dill, Biochemistry, 24, 1501-1509, 1985), and there is no reliable method for picking the one conformation stable in aqueous solution. A second difficulty is that there are no accurate and reliable force laws for the interaction of one part of a protein with another part, and with water. Proteins exist in a dynamic equilibrium between a folded, ordered state and an unfolded, disordered state. These and other factors have contributed to the enormous complexity of determining the most probable relative 3D location of each residue in a known protein sequence.
Sequence alignment represents one approach that has generated some success at determining a protein's three-dimensional structure from an associated amino acid sequence. Typically, sequence alignment aligns a target residue sequence of unknown three-dimensional protein structure with residue sequences of known three-dimensional protein structures. If a sequence relationship can be found, it can often be inferred that the protein of known sequence but unknown structure adopts a fold similar to the protein of known structure. This strategy works well for closely related sequences, but structural similarities can go undetected as the level of sequence identity drops below about 25 percent. In this case, a similar technique referred to as a threading procedure can be used. In a threading procedure, a target sequence of unknown protein structure is aligned with a one-dimensional representation of a protein structure.
In one such threading procedure, target sequences of unknown protein structure are aligned with profiles representing the structural environments of the residues in known three-dimensional protein structures. The method starts with a known three-dimensional protein structure and determines three key features of each residue's environment within the structure: (1) the total area of the residue's side-chain that is buried by other protein atoms, inaccessible to solvent; (2) the fraction of the side-chain area that is covered by polar atoms (O, N) or water; and (3) the local secondary structure. Based on these parameters, each residue position is categorized into an environment class. In this manner, a three-dimensional protein structure is converted into a one-dimensional environment string, which represents the environment class of each residue in the folded protein structure. A 3D structure profile table is then created containing score values that represent the frequency of finding any of the 20 common amino acids structures at each position of the environment string. These frequencies are determined from a database of known protein structures and aligned sequences. The method determines the most favorable alignment of a target protein sequence to the residue positions defined by the environment string by calculating a "best fit" alignment score for the target sequence.
The above method has been successful in associating protein folds with compatible sequences in some particular cases and in other cases has performed poorly. Thus, the method is not reliable enough for widespread application. Accordingly, it would be desirable to develop a method that has a higher assurance of predicting the protein structure of a sequence having an unknown protein structure.
An object of the present invention is to provide a method and system that predicts the three-dimensional protein structure that an amino acid sequence folds into.
It is another object of the present invention to utilize structural information inherent in a family of aligned amino acid residue sequences in order to predict the protein structure of a residue sequence of unknown structure.
It is another object of the present invention to utilize residue variability information inherent within homologous protein sequences for protein structure prediction.
Another object of the present invention is to model protein structures as a simplified structural environmental-sequence for use in protein structure prediction.
It is a further object of the present invention to model known protein structures as a one-dimensional environmental string utilizing structural characteristics of the amino acid residues with respect to each residue's degree of exposure within the structure.