This invention relates to a computer-assisted method for identifying protein sequences that fold into a known three-dimensional structure, and more particularly to a computer-assisted method for assigning an amino acid probe sequence to a known three-dimensional protein structure.
Proteins (or polypeptides) are linear polymers of amino acids. The polymerization reaction which produces a protein results in the loss of one molecule of water from each amino acid, and hence proteins are often said to be composed of amino acid xe2x80x9cresidues.xe2x80x9d Natural protein molecules may contain as many as 20 different types of amino acid residues, each of which contains a distinctive side chain. The particular linear sequence of amino acid residues in a protein defines the primary sequence, or primary structure, of the protein. The primary structure of a protein can be determined with relative ease using known methods.
Proteins fold into a three-dimensional structure. The folding is determined by the sequence of amino acids and by the protein""s environment. Examination of the three-dimensional structure of numerous natural proteins has revealed a number of recurring patterns, or secondary structure. Secondary structures known as alpha helices, parallel beta sheets, and anti-parallel beta sheets are the most common observed. A description of such secondary structures is provided by Dickerson, R. E., et al. in The Structure and Action of Proteins, W. A. Benjamin, Inc. Calif. (1969). The helices, sheets, and turns of a protein""s secondary structure pack together to produce the folded three-dimensional, or tertiary, structure of the protein.
In the past, the three-dimensional structure of proteins has been determined in a number of ways. Perhaps the best known way of determining protein structure involves the use of the technique of x-ray crystallography. A general review of this technique can be found in Physical Bio-chemistry, Van Holde, K. E. (Prentice-Hall, N.J. 1971), pp. 221-239, or in Physical Chemistry with Applications to the Life Sciences, D. Eisenberg and D. C. Crothers (Benjamin Cummings, Menlo Park 1979). Using this technique, it is possible to elucidate three-dimensional structure with good precision. Additionally, protein structure may be determined through the use of the techniques of neutron diffraction, or by nuclear magnetic resonance (NMR). See, e g., Physical Chemistry, 4th Ed. Moore, W. J. (Prentice-Hall, N.J. 1972) and NMR of Proteins and Nucleic Acids, K. Wxc3xcthrich (Wiley-Interscience, NY 1986).
The biological properties of proteins depend directly on the protein""s three-dimensional (3D) conformation. The 3D conformation determines the activity of enzymes, the capacity and specificity of binding proteins, and the structural attributes of receptor molecules. Because the three-dimensional structure of a protein molecule is so significant, it has long been recognized that a means for readily determining a protein""s three-dimensional structure from its known amino acid sequence would be highly desirable. However, it has proved extremely difficult to make such a determination. One difficulty is that each protein has an astronomical number of possible conformations (about 1016 for a small protein of 100 residues; see K. A. Dill, Biochemistry, 24, 1501-1509, 1985), and there is no reliable method for picking the one conformation stable in aqueous solution. A second difficulty is that there are no accurate and reliable force laws for the interaction of one part of a protein with another part, and with water. Proteins exist in a dynamic equilibrium between a folded, ordered state and an unfolded, disordered state. These and other factors have contributed to the enormous complexity of determining the most probable relative 3D location of each residue in a known protein sequence.
The protein folding problem, the problem of determining a protein""s three-dimensional tertiary structure from its amino acid sequence, or primary structure, has defied solution for over 30 years. In the last decade, however, the increase in the number of known protein sequences, and the fact that many sequences have been found to fold into the same basic three-dimensional structure, have focused attention on a related problem: the inverse protein folding problem. The inverse protein folding problem asks, given a known three dimensional protein structure, which amino acid sequences fold into that structure?
As a result of the molecular biology revolution, the number of known protein sequences is about 50 times greater than the number of known three-dimensional protein structures. This disparity hinders progress in many areas of biochemistry because a protein sequence has little meaning outside the context of the three-dimensional structure. The disparity is less severe than the numbers might suggest, however, because different proteins often adopt similar three-dimensional folds. As a result, each new protein structure can serve as a model for other protein structures. These structural similarities occur because the current array of protein structures probably evolved from a small number of primordial folds. If the number of folds is indeed limited, it is possible that x-ray crystallographers and NMR spectroscopists may eventually describe examples of essentially every fold. In that event, protein structure prediction theoretically would reduce, at least in crude form, to the inverse protein folding problemxe2x80x94the problem of identifying which fold in this limited repertoire a particular amino acid sequence adopts. Thus, protein fold recognition aims to assign each new amino acid sequence to the known 3D fold that the sequence most closely resembles.
The inverse protein folding problem is most often approached by seeking sequences that are similar to the sequence of a protein whose structure is known. If a sequence relationship can be found, it can often be inferred that the protein of known sequence but unknown structure adopts a fold similar to the protein of known structure. The strategy works well for closely related sequences, but structural similarities can go undetected as the level of sequence identity drops below about 25 percent.
A more direct attack on the inverse protein folding problem has been to search for sequences that are compatible with a given structure. In this xe2x80x9ctertiary templatexe2x80x9d method, the backbone of a known protein structurexe2x80x94the amino acid residues less the side chainsxe2x80x94is kept fixed and the side-chains in the protein core are then replaced and tested combinatorially by computer, to find which combination of new side-chains could fit into the core. A set of core sequences is thereby enumerated that could in principle be tolerated in the protein structure. In this manner, the method of tertiary templates provides a direct link between possible three-dimensional structure and known sequence. See Ponder and Richards, J. Mol. Biol., 93, 775-791 (1987).
The rules used to relate one-dimensional amino acid sequences to possible three-dimensional structures in the tertiary template method may be excessively rigid. Proteins that fold into similar structures can have large differences in the size and shape of residues at equivalent positions. These changes are tolerated not only because of replacements or movements in nearby side-chains, but also as a result of shifts in the protein backbone. Moreover, insertions and deletions in the amino acid sequence, which are commonly found in related protein structures, are not considered in the implementation of tertiary templates. To describe realistically the sequence requirements of a particular fold, the constraints of a rigid backbone and a fixed spacing between core residues must somehow be relaxed.
Another approach, suggested by work done by one of the present inventors, is a profile method that characterizes the amino acid sequences of families of proteins aligned by sequence or structural similarities. The profile method builds a table of weighted values that reflect the frequency that amino acid residues are likely to be located at a particular position in the sequence of amino acids forming the proteins. The profile table thus characterizes the entire family of proteins upon which the table is based. A target amino acid sequence is compared to the profile, using a known dynamic programming method, to determine a final xe2x80x9cbest fitxe2x80x9d score. Insertions and deletions of amino acids in the target sequence are provided for by appropriate xe2x80x9cgap openingxe2x80x9d and xe2x80x9cgap extensionxe2x80x9d penalties that affect the final score. See Gribskov et al., Proc. Natl. Acad. Sci. U.S.A., 84, 4355 (1987); Gribskov et al., CABIOS 4, (1988); Gribskov and Eisenberg, in xe2x80x9cTechniques in Protein Chemistryxe2x80x9d (T. E. Hugli, ed.), p. 108. Academic Press, San Diego, Calif., 1989; Gribskov et al., Meth. in Enz. 183, 146 (1990).
The profile method is useful for learning whether a target protein sequence belongs to a known family of sequences, and some inferences can be made that the target sequence has a three-dimensional structure similar to the structures of the known family of sequences. However, the profile method does not directly take into account specific structural characteristics of the known family of target sequences, since the profile table is constructed based only upon alignments of amino acid sequences within selected proteins of known structure. Thus, a large amount of information inherent in a known structure is simply ignored in a sequence profile.
Bowie et al. developed an alternative method to the technique of assigning a new sequence to a known 3D fold by establishing similarity between the sequence to some sequence of known structure: score the compatibility of the new sequence against known 3D structures. Bowie et al. xe2x80x9cA Method to Identify Protein Sequences that Fold into a Known Three-dimensional Structurexe2x80x9d, Science 253: 164-170 (1991). See also U.S. Pat. No. 5,436,850 , entitled xe2x80x9cMethod to Identify Protein Sequences that Fold into a Known Three-dimensional Structurexe2x80x9d, issued Jul. 25, 1995, which is hereby incorporated by reference.
Since then, a variety of fold-recognition methods have been published, and several reviews on these methods have appeared. The approaches used differ in at least one of the components of fold recognition, namely: the representation of the protein, the function used to evaluate the compatibility between the unknown probe sequence and a fold, the algorithm used to search for the optimal alignment, the way ranking is computed, and the way significance is assessed.
More particularly, in sequence comparison fold recognition, compatibility of sequence to structure is assessed by optimally aligning the sequence of interest (the probe sequence) to each structure in a library of known folds (the target structures). The compatibility is computed by adding the compatibility score of each aligned position of probe to target and subtracting a penalty for any gaps in the alignment. The compatibility score can be defined by a one-positional compatibility function, f(pi, tj), where pi represents the ith position of the probe sequence, and tj represents the jth position of the target structure. The value tj refers to some structural properties of residue j in the target structure, such as the local secondary structure, the solvent accessibility, the polarity, or the amino acid type (see, e.g., Bowie et al., supra). The value tj can also encode other information, such as some structural properties of neighboring residues. Thus, various compatibility functions describe a variety of structural properties for tj. However, sequence comparison methods consider only the amino acid sequence of the probe sequence; that is, pi refers exclusively to the amino acid type at position i.
Thus, the basic difference between straight sequence comparison and sequence comparison fold recognition lies in the type of information used: in straight sequence comparison, a probe sequence is aligned to the sequences of the proteins in a library; in sequence comparison fold recognition, a probe sequence is aligned to some structural properties of the proteins in a library of known target 3D structures.
The inventors have recognized that, with the increasing size of sequence databases, users are likely to find several homologues of a probe sequence to known 3D structures, and thus derive multiple alignments from such databases. From multiple alignments, sequence-derived properties (such as secondary structure, solvent accessibility, and hydrophobic moments) for the probe sequences can be predicted with accuracies greater than 70%. The inventors have recognized that such predictions of sequence-derived properties may be used to improve the accuracy of fold recognition algorithms. Accordingly, the present invention is directed to a method for using the amino acid sequence of a probe plus sequence-derived properties of the probe in making fold assignments. The method includes a computer-assisted procedure for assigning target amino acid sequences to known target 3D fold structures.
More particularly, in one aspect the invention includes a computer-assisted method for assigning an amino acid probe sequence to a known three-dimensional protein structure, including the steps of inputting into a computer system a string p1, p2 . . . pn describing the amino acid sequence of the probe sequence and at least one sequence-derived property for the probe sequence; inputting into a computer system a string t1, t2 . . . tm of structural properties for each member of a library of known 3D structures; executing an alignment algorithm in the computer to compute an alignment score indicating the optimal alignment of the string p1, p2 . . . pn to each string t1, t2 . . . tm by applying a combined compatibility function g(pi, tj); determining the statistical significance of each alignment score to determine a best-fit alignment score; and applying the best-fit alignment score to indicate or select the corresponding 3D protein structure from the library for output to a user. The invention includes a computer program implementation of such a method.
The details of one or more embodiments of the invention are set forth in the accompanying drawings and the description below. Other features, objects, and advantages of the invention will be apparent from the description and drawings, and from the claims.