1. Field of the Invention
This invention relates to the area of bioinformatics, more specifically to methods for analyzing the sequences of evolutionarily related proteins, and most specifically for identifying evolutionary and functional relationships between proteins and the genes that encode them.
2. Background
Proteins are linear polypeptide chains composed of 20 different amino acid building blocks. Determining the sequence of amino acids in a protein is now experimentally routine, both by direct chemical analysis of the proteins themselves, and by translation of genes that encode proteins. The size of protein sequence databases will grow explosively over the next decade as genome sequencing projects are completed.
The polypeptide chain in a protein folds to give secondary structural units (most commonly alpha helices and beta strands) which then fold to give supersecondary structures (for example, a beta sheet or a strand-turn-helix) and a tertiary structure. These are collectively termed xe2x80x9cconformationxe2x80x9d or, more colloquially, the xe2x80x9cfoldxe2x80x9d. Most behaviors of a protein are determined by the fold, including those that are important for allowing the protein to function in a living system. The folded structure must be known before pharmaceuticals can be rationally designed to bind to the protein, for example.
In principle, the linear polypeptide sequence, by providing the constitution of the protein, also determines all of its other properties, including secondary and tertiary structure, stability, interaction with other molecules, and through these and other properties, biological activity. The connection between amino acid sequence and these other properties is not transparent, however. For example, some 30 years have been spent developing tools that allow the biochemist to predict secondary structure of proteins starting from sequence data. Many of the classical approaches attempting to predict secondary structure from sequence, of example, were summarized in the disclosure of Ser. No. 07/857,224, filed Mar. 25, 1992, which is herein incorporated by reference.
In the mid 1970""s, a relationship between evolutionary ancestry and protein conformation was established. Rossman noted that lactate, glyceraldehyde-3-phosphate, and alcohol dehydrogenases acting on quite different substrates all have a domain that folds to give a parallel sheet flanked by helices (a xe2x80x9cRossman foldxe2x80x9d). [Rossman, M. G., and Argos, P. (1976). Exploring structural homology of proteins. 105, 75-95].
It is now widely appreciated that homologous proteins can have diverged so much that no significant sequence similarity remains between them, even though their overall folds might be the same. Since 1976, many have attempted to exploit the fact that homologous proteins have the same fold as a tool for predicting fold. For cases where the target protein was sufficiently similar in sequence to a protein with a known conformation to establish homology with reasonable statistical similarity, xe2x80x9chomology modellingxe2x80x9d was used. Homology modeling is best defined strictly as a process for building a model of the conformation of a target protein that begins by identifying a protein with known conformation that is a homolog of a target, and uses the homolog as a starting point to model the conformation of the target.(May, and Blundell, 1995; Sali, 1995) [May, A. C. W., and Blundell, T. L. (1995). Automated comparative modelling of protein structures. 5, 355-360. Sali, A. (1995). Modeling mutations and homologous proteins. 6, 437-51.]
As is well known to those skilled in the art, sequence analysis becomes ineffective as a tool to establish homology after sequence identity between two homologous proteins drops below approximately 25% for a protein of typical length. At this point (the xe2x80x9ctwilight zonexe2x80x9d), non-homologous sequences share the same level of sequence similarity with a target protein as homologous sequences, making it impossible to determine from sequence data alone whether two proteins are homologous or not. Thus, while a high similarity score (corresponding to a high sequence identity in an alignment with few gaps) is generally a strong indicator of homology, a low score is generally not a reliable indicator of non-homology. Much of the sequence analysis tools presently being developed attempt to extract evidence of homology from sequence data for proteins that have statistically marginal or sub-significant similarities, and to use this to predict conformation.
One approach for identifying long distance homologs when alignment scores are statistically marginal is to do a xe2x80x9cprofile analysisxe2x80x9d [Gribskov, M., McLachlan, A. D., Eisenberg, D. Profile analysis: Detection of distantly related proteins. Proc. Nat. Acad. Sci. 84, 4355-4358 (1987)]. In this approach, a set of sequences of members of a protein family is examined. The sequence similarities in this set of proteins must be sufficient to establish that the proteins in the set are homologous and adopt the same fold. A multiple alignment of the sequences is constructed. Then, for each position in the multiple alignment, a position-specific scoring matrix is constructed using as input the amino acids at that position for each protein in the multiple alignment. A xe2x80x9cprofilexe2x80x9d of the protein is the collection of each of these matrices for each position for the entire protein sequence alignment. The sequence of a protein that is a possible homolog of family (but whose sequence is too dissimilar from that of any individual member of the family to give a score that is statistically adequate) is then matched against the profile and scored. If the score is high, the hypothesis that the protein is a possible homolog of the family is strengthened.
In practice, profile analyses identify many proteins in a database that are possible homologs, where the correct xe2x80x9chitsxe2x80x9d are buried in a large number of false positives. For this reason, profile analysis is virtually useless as a tool for excluding the possibility that two proteins are homologous, or contain the same core fold.
Another approach for identifying long distance homologs when alignment scores are statistically marginal is to search for sequence xe2x80x9ctemplatesxe2x80x9d or xe2x80x9cmotifsxe2x80x9d, short segments of polypeptide chain that might be conserved over long distances [Taylor, W. R. J. Mol. Biol. 188, 233-258 (1986); Taylor, W. R., Thornton, J. M. Mol. Biol. 173, 487-514 (1984); Wierenga, R. K., Terpstra, P., Hol, W. G. J., J. Mol. Biol. 187, 101-107 (1986)]. Here, the presence of analogous motifs in two protein sequences can be used to infer long distance homology between a target protein and a protein with known conformation, and from this inference, a model of the target protein can be modelled on the structure of the other. As with profile modelling, the presence of a template is not a reliable indicator of long distance homology and similar fold. For example, in the first example presented in Ser. No. 07/857,224 (for protein kinase), several groups had noted that the protein has a sequence motif Gly-Xxx-Gly-Xxx-Xxx-Gly (where Xxx is any amino acid) [Sternberg, M. J. E., Taylor, W. R. Modeling the ATP binding site of oncogene products, the epidermal growth-factor receptor and related proteins FEBS Lett. 1984, 175, 387-392.]. Further it was noted that a similar motif was found in adenylate kinase, where a crystal structure was known. Therefore, it was proposed that the two structures are homologous. From this proposal, it was deduced in the literature that protein kinase would adopt the same fold as adenylate kinase. This proposal was proposed in Ser. No. 07/857,224 to be incorrect, and later shown to be incorrect experimentally [Knighton, D. R., Zheng, J., Ten Eyck, L., Ashford, F. V. A., Xuong, N. H. Taylor, S. S., Sowadski, J. M. (1991) Crystal structure of the catalytic subunit of cyclic adenosine-monophosphate dependent protein-kinase. Science 253, 407-414.].
Further, motif analysis has not (prior to Ser. No. 07/857,224) been used as part of any tool to infer the absence of homology. The statistics of motif analysis are such that they could not be without supporting analysis.
The majority of effort to exploit the relationship between evolutionary history and conformation implicit in Rossman""s observation has been applied to attempting to establish homology based on sequence similarity, and then to infer conformation. Very few investigators have pursued the inverse problem, developing tools to use the similarity of two folds as an indicator of distant homology.
Some efforts had been made to use predicted structures (as opposed to experimental structures) to detect long distance homology. For example, Pearl and Taylor [Pearl, L. H., and Taylor, W. R. (1987). A structural model for the retroviral proteases. 329, 351-4] and Bazan and Fletterick [Bazan, J. F., and Fletterick, R. J. (1988). Viral cysteine proteases are homologous to the trypsin-like family of serine proteases: structural and functional implications. 85, 7872-7876] were able to interpret a secondary structure prediction made by consensus GOR prediction for viral proteases with unknown structure to confirm the speculation that these proteases are homologs of aspartic proteases with known experimental structures. Sheridan et al. [Sheridan, R. P., Dixon, J. S., Venkataraghavan, R. Generating plausible protein folds by secondary structure similarity. Int. J. Pept. Prot. Res. 25, 132-143 (1985)] were perhaps the first to suggest than an array of predicted secondary structural elements might be used as a query to search proteins of known conformation to detect possible distant homologs. In none of these studies, however, was it recognized that core secondary structural elements must be weighted strongly in this comparison.
Prior to Ser. No. 07/857,224, no art had concerned itself with the question of how to use predicted structures to show that two proteins were not homologous. While secondary structure predictions, coupled with experimental data, could on occasion detect similar folds (primarily all helical folds), they were clearly insufficiently reliable to permit the exclusion of homologous folds in proteins that had a potential for distant relationship. Both threading and profile analyses methods usually generate long lists of potential targets, without clearly excluding any as homologs.
Tools able to rule out homology will become more important as genome projects begin to produce large numbers of data. As is well appreciated by those of ordinary skill in the art, genome sequencing projects frequently identify the sequence of a protein for which little or nothing is known about its physiological function. Under these circumstances, the most reliable approach for assigning physiological function to a protein is to identify a homologous protein with known function. It is frequently the case that no homolog with known function is known with a sequence similarity that allows a statistically significant case to be made for homology. In these cases, tools that rule out long distance homology are as useful as tools that establish it, as they limit the number possible long distance homologs.
A method for making a model for the folded structure of a set of proteins from an evolutionary analysis of a set of aligned homologous protein sequences was claimed in Ser. No. 07/857,224. The instant application concerns methods for using these models. The first method is used to confirm or deny a hypothesis that two proteins are homologous, and is comprised of comparing a predicted structure model for one family of proteins with a predicted structure model for a second family of proteins, or an experimental structure for the second family, and deducing the presence or absence of homology based on the presence or absence of structural similarity flanking key residues in the polypeptide sequence. The second method identifies mutations during the divergent evolution of a protein sequence that are potentially adaptive by identifying episodes during the divergent evolution of a family of proteins where there is a high absolute rate of amino acid substitution, or a high ratio of non-silent substitutions to non-silent substitutions. Amino acids that are changing during this episode are likely to be adaptive. The third is a method for identifying specific in vitro properties of the protein that are likely to play a physiological role in vivo in an organism. This methods involves synthesizing in the laboratory proteins having the reconstructed amino acid sequences of a protein before and after a period of rapid sequence evolution that characterizes adaptive substitution, measuring the in vitro properties of the protein before the episode of rapid sequence evolution, and then measuring the in vivo properties of the protein after the episode of rapid sequence evolution. The in vitro behaviors that remained unchanged through this episode are not likely to have adaptive significance physiologically. The in vitro behaviors that changed through this episode are likely to have adaptive significance physiologically. The fourth concerns method for organizing genome sized sequence databases.