1. Field of the Invention
This invention relates to a computer-assisted method for identifying protein sequences that fold into a known three-dimensional structure.
2. Related Art
Proteins (or polypeptides) are linear polymers of amino acids. The polymerization reaction which produces a protein results in the loss of one molecule of water from each amino acid, and hence proteins are often said to be composed of amino acid "residues." Natural protein molecules may contain as many as 20 different types of amino acid residues, each of which contains a distinctive side chain. The particular linear sequence of amino acid residues in a protein defines the primary sequence, or primary structure, of the protein. The primary structure of a protein can be determined with relative ease using known methods.
Proteins fold into a three-dimensional structure. The folding is determined by the sequence of amino acids and by the protein's environment. Examination of the three-dimensional structure of numerous natural proteins has revealed a number of recurring patterns. Patterns known as alpha helices, parallel beta sheets, and anti-parallel beta sheets are the most common observed. A description of such protein patterns is provided by Dickerson, R. E., et al. in The Structure and Action of Proteins, W. A. Benjamin, Inc. California (1969). The assignment of each amino acid residue to one of these patterns defines the secondary structure of the protein.
The helices, sheets, and turns of a protein's secondary structure pack together to produce the folded three-dimensional, or tertiary, structure of the protein.
In the past, the three-dimensional structure of proteins has been determined in a number of ways. Perhaps the best known way of determining protein structure involves the use of the technique of x-ray crystallography. A general review of this technique can be found in Physical Bio-chemistry, Van Holde, K. E. (Prentice-Hall, New Jersey 1971), pp. 221-239, or in Physical Chemistry with Applications to the Life Sciences, D. Eisenberg & D. C. Crothers (Benjamin Cummings, Menlo Partk 1979). Using this technique, it is possible to elucidate three-dimensional structure with good precision. Additionally, protein structure may be determined through the use of the techniques of neutron diffraction, or by nuclear magnetic resonance (NMR). See, e.g., Physical Chemistry, 4th Ed. Moore, W. J. (Prentice-Hall, New Jersey 1972) and NMR of Proteins and Nucleic Acids, K. Wuthrich (Wiley-Interscience, New York 1986).
The three-dimensional structure of many proteins may be characterized as having internal surfaces (directed away from the aqueous environment in which the protein is normally found) and external surfaces (which are exposed to the aqueous environment). Through the study of many natural proteins, researchers have discovered that hydrophobic residues (such as tryptophan, phenylalanine, tyrosine, leucine, isoleucine, valine, or methionine) are most frequently found on the internal surface of protein molecules. In contrast, hydrophilic residues (such as aspartate, asparagine, glutamate, glutamine, lysine, arginine, histidine, serine, threonine, glycine, and proline) are most frequently found on the external protein surfaces. The amino acids alanine, glycine, serine, and threonine are encountered with more nearly equal frequency on both the internal and external protein surfaces.
The biological properties of proteins depend directly on the proteins three-dimensional (3D) conformation. The 3D conformation determines the activity of enzymes, the capacity and specificity of binding proteins, and the structural attributes of receptor molecules. Because the three-dimensional structure of a protein molecule is so significant, it has long been recognized that a means for readily determining a protein's three-dimensional structure from its known amino acid sequence would be highly desirable. However, it has proved extremely difficult to make such a determination. One difficulty is that each protein has an astronomical number of possible conformations (about 10.sup.16 for a small protein of 100 residues; see K. A. Dill, Biochemistry, 24, 1501-1509, 1985), and there is no reliable method for picking the one conformation stable in aqueous solution. A second difficulty is that there are no accurate and reliable force laws for the interaction of one part of a protein with another part, and with water. Proteins exist in a dynamic equilibrium between a folded, ordered state and an unfolded, disordered state. These and other factors have contributed to the enormous complexity of determining the most probable relative 3D location of each residue in a known protein sequence.
The protein folding problem, the problem of determining a proteins three-dimensional tertiary structure from its amino acid sequence, or primary structure, has defied solution for over 30 years. In the last decade, however, the increase in the number of known protein sequences, and the fact that many sequences have been found to fold into the same basic three-dimensional structure, have focused attention on a related problem: the inverse protein folding problem. The inverse protein folding problem asks, given a known three dimensional protein structure, which amino acid sequences fold into that structure?
As a result of the molecular biology revolution, the number of known protein sequences is about 50 times greater than the number of known three-dimensional protein structures. This disparity hinders progress in many areas of biochemistry because a protein sequence has little meaning outside the context of the three-dimensional structure. The disparity is less severe than the numbers might suggest, however, because different proteins often adopt similar three-dimensional folds. As a result, each new protein structure can serve as a model for other protein structures. These structural similarities occur because the current array of protein structures probably evolved from a small number of primordial folds. If the number of folds is indeed limited, it is possible that x-ray crystallographers and NMR spectroscopists may eventually describe examples of essentially every fold. In that event, protein structure prediction theoretically would reduce, at least in crude form, to the inverse protein folding problem--the problem of identifying which fold in this limited repertoire a particular amino acid sequence adopts.
The inverse protein folding problem is most often approached by seeking sequences that are similar to the sequence of a protein whose structure is known. If a sequence relationship can be found, it can often be inferred that the protein of known sequence but unknown structure adopts a fold similar to the protein of known structure. The strategy works well for closely related sequences, but structural similarities can go undetected as the level of sequence identity drops below about 25 percent.
A more direct attack on the inverse protein folding problem has been to search for sequences that are compatible with a given structure. In this "tertiary template" method, the backbone of a known protein structure--the amino acid residues less the side chains--is kept fixed and the side-chains in the protein core were then replaced and tested combinatorially by computer, to find which combination of new side-chains could fit into the core. A set of core sequences is thereby enumerated that could in principle be tolerated in the protein structure. In this manner, the method of tertiary templates provides a direct link between possible three-dimensional structure and known sequence. See J. W. Ponder, F. M. Richards, J. Mol. Biol., 93, 775-791 (1987).
The rules used to relate one-dimensional amino acid sequences to possible three-dimensional structures in the tertiary template method may be excessively rigid. Proteins that fold into similar structures can have large differences in the size and shape of residues at equivalent positions. These changes are tolerated not only because of replacements or movements in nearby side-chains, but also as a result of shifts in the protein backbone. Moreover, insertions and deletions in the amino acid sequence, which are commonly found in related protein structures, are not considered in the implementation of tertiary templates. To describe realistically the sequence requirements of a particular fold, the constraints of a rigid backbone and a fixed spacing between core residues must somehow be relaxed.
Another approach, suggested by work done by one of the present inventors, is a profile method that characterizes the amino acid sequences of families of proteins aligned by sequence or structural similarities. The profile method builds a table of weighted values that reflect the frequency that amino acid residues are likely to be located at a particular position in the sequence of amino acids forming the proteins. The profile table thus characterizes the entire family of proteins upon which the table is based. A target amino acid sequence is compared to the profile, using a known dynamic programming method, to determine a final "best fit" score. Insertions and deletions of amino acids in the target sequence are provided for by appropriate "gap opening" and "gap extension" penalties that affect the final score. See M. Gribskov, A. D. McLachlan, and D. Eisenberg, Proc. Natl. Acad. Sci. U.S.A., 84, 4355 (1987); M. Gribskov, M. Homyak, J. Edenfield, and D. Eisenberg, CABIOS 4, (1988); M. Gribskov and D. Eisenberg, in "Techniques in Protein Chemistry" (T. E. Hugli, ed.), p. 108. Academic Press, San Diego, Calif., 1989; M. Gribskov, R. Luthy, and D. Eisenberg, Meth. in Enz. 183, 146 (1990).
The profile method is useful for learning whether a target protein sequence belongs to a known family of sequences, and some inferences can be made that the target sequence has a three-dimensional structure similar to the structures of the known family of sequences. However, the profile method does not directly take into account specific structural characteristics of the known family of sequences, since the profile table is constructed based only upon alignments of amino acid sequences within selected proteins of known structure. Thus, a large amount of information inherent in a known structure is simply ignored in a sequence profile.
Thus, it would be desirable to develop a method for relating a one-dimensional target sequence directly to a known 3D structure which effectively utilizes the information about the accommodation of sequence changes that is inherent in a known 3D structure.
The present invention provides such a method, using a novel method of profiling structural characteristics of families of proteins with known three-dimensional structures, and a computer-assisted search procedure for comparing target amino acid sequences to such profiles.