1. Field of the Invention
The invention relates generally to analyzing biological sequences. This invention relates more particularly to methods for analyzing biological sequences using algorithms, which sequences include, but are not limited to, proteins, ribonucleic acids (RNA), deoxyribonucleic acids (DNA), lipids, and polysaccharides (sugars).
2. Description of Related Art
The ability of all cells to recognize their environment and to make appropriate responses to stimuli depends on the organized activity of networks of proteins that we conventionally refer to as the cellular signal transduction machinery. These protein networks show remarkable signal processing properties such as the ability to extract small signals from noise and to adjust their sensitivity to changes in background stimulation while preserving excellent specificity. As used herein, “specificity” is the ability of proteins or protein networks to selectively respond to one stimulus in the background of other potentially competing stimuli. Defects in signaling proteins are commonly the basis for many human diseases, highlighting the need for a fundamental understanding of the mechanisms of signal recognition and processing.
The basic paradigm of signaling involves the sequential establishment of molecular interactions and the allosteric control of enzyme activities. At an atomic level, these processes reduce to the orderly flow of energy within and between proteins whose structural basis is not generally well understood. For example, the effect of ligand binding at extracellular sites in a transmembrane receptor molecule presumably propagates via the motion of coupled structural elements to induce functional changes in intracellular domains and the subsequent interaction with downstream target proteins. The interaction of one protein with another can be thought of as an energetic perturbation to each binding surface that propagates through the three-dimensional structure to cause specific changes in protein function (Holt, J. M. and Ackers, G. K., Faseb J. 9: 210-218, 1995; Monod, J. et al., J. Mol. Biol. 12: 88-118, 1965; Perry, K. M. et al., Biochem. 28: 7961-7968, 1989; Pettigrew, D. W. et al., Proc. Natl. Acad. Sci. U.S.A. 79: 1849-1853, 1982; LiCata, V. J. and Ackers, G. K., Biochemistry 34: 3133-3139, 1995; Turner, G. J. et al., Proteins 14: 333-350, 1992). The structural basis of this energy propagation is largely unknown, but is likely to be critical in understanding the relationship between protein function and structure.
At specific protein-protein interfaces, large-scale mutagenesis together with structure determination has begun to define some features of energy parsing. (As used herein, “energy parsing” describes the way that energy is parceled out amongst the amino-acid residues at a particular protein-protein interface. Mutagenesis is a method of generating DNA-level changes to a gene encoding a protein in order to change the identity of an amino acid at a chosen position on the protein.) For example, studies of the interaction of human growth hormone with its receptor show that binding energy is not smoothly distributed over the interaction surface; instead, a few residues comprising only a small fraction of the interaction surface account for the majority of the free energy change (Atwell, S. et al., Science 278: 1125-1128, 1997; Clackson, T. and Wells, J. A., Science 267: 383-386, 1995; Wells, J. A., Proc. Natl. Acad. Sci. U.S.A. 93: 1-6, 1996; J. A. Wells, Biotechnol. 13: 647-651, 1995).
Similarly, potassium channel pores interact with peptide scorpion toxins with high affinity, but most of the binding energy depends on two amino acid positions on the toxin molecule though fifteen residues are likely buried upon binding (Goldstein, S. A. et al., Neuron 12: 1377-1388, 1994; Hidalgo, P. and MacKinnon, R., Science 268: 307-310, 1995; Ranganathan, R. et al., Neuron 16: 131-139, 1996; Stampe, P. et al., Biochemistry 33: 443-450, 1994). Thus, protein interaction surfaces contain functional epitopes or “hot spots” of binding energy that are generally not predictable from the atomic structure.
In addition, a large body of evidence suggests that the change in free energy at a protein interaction surface propagates through the tertiary structure in a seemingly arbitrary manner. For example, studies addressing mechanisms of substrate specificity in serine proteases show that many positions distantly positioned from the active site contribute to determining the energetics of catalytic residues (Hedstrom, L., Biol. Chem. 377: 465-470, 1996; Hedstrom, L. et al., Science 255: 1249-1253, 1992; Perona, J. J. et al., Biochemistry 34: 1489-1499, 1995).
Indeed, the conversion of trypsin to chymotrypsin specificity required a large set of simultaneous mutations, many at unexpected positions. Similarly, mutations introduced during maturation of antibody specificity have been shown to occur at sites distant in tertiary structure from the antigen-binding site despite substantial increases in binding energy (Patten, P. A. et al., Science 271: 1086-1091, 1996). Thus, protein function appears to depend on the energetic interactions of a set of amino acid positions that are structurally dispersed and that, like binding hot spots, are unpredictable from even high-resolution crystal structures.
One potential approach to mapping these energetic interactions in a protein is through massive mutagenesis. Indeed, thermodynamic mutant cycle analysis (Hidalgo, P. and MacKinnon, R., Science 268: 307-310, 1995; Carter, P. J. et al., Cell 38: 835-840, 1984; Schreiber, G. and Fersht, A. R., J. Mol. Biol. 248: 478-486, 1995), a technique that measures the energetic interaction of two mutations, provides a direct method to systematically probe energetic relationships of protein sites. However, practical considerations, such as the number of mutants that can be reasonably generated and studied per unit time in the laboratory, limit this technique to small-scale studies, obviating a full mapping of all energetic interactions on a complete protein.
Statistical methods have been reported for the analysis of biological sequences, typically in the determination of homologous protein families and evolutionary conservation.
Ortiz, A. R. et al. (Pac. Symp. Biocomput., 316-327, 1997) describes a method of predicting the low resolution three dimensional structure of proteins starting from a multiple sequence alignment. Secondary structure predictions and minimized Monte Carlo energy calculations are used to predict protein structures.
Sunyaev, S. R. et al. (Protein Eng., 12: 387-394, 1999) describes the use of position-specific independent counts at a given position in a sequence alignment in identifying distantly related protein sequences.
Karlin, S. and Brendel, V. (Science, 257: 39-49, 1992) discuss the use of statistical methods for characterizing anomalies in sequences, for determining compositional biases in proteins, and for analyzing spacings of sequence markers. Karlin (Curr. Opin. Struct. Biol., 5: 360-371, 1995; Philos. Trans. R. Soc. Lond. B. Biol. Sci. 344: 391-402, 1994) further describes the use of statistical methods for the identification of common segments between protein sequences, and the use of distributional theory in multiple sequence alignments.
Bailey, T. L. and Gribskov, M. (Bioinformatics, 14: 48-54, 1998) propose the use of the QFAST statistical algorithm for accurate and sensitive sequence homology searches.
Hughey, R. and Krogh, A. (Comput. Appl. Biosci. 12: 95-107, 1996) discuss the use of Hidden Markov models (HMMs) to identify protein sequences with a given domain, or to perform a multiple alignment of sequences.
Vingron, M. and Waterman, M. S. (J. Mol. Biol. 235: 1-12, 1994) describe statistical analyses of DNA and protein alignments. Statistics are used to optimize alignment parameters.
Leluk, J. (Comput. Chem. 22(1):123-131, 1998) describes statistical analyses of proteins taking advantage of the correlation between amino acids and their corresponding DNA codons. The analyses are useful for determining corresponding sequences between proteins, and for investigating evolutionary divergence between proteins.
Bohm, G. and Jaenicke, R. (Protein Sci. 1: 1269-1278, 1992) propose the use of statistical methods for the discrimination between native protein three dimensional structures and corresponding misfolded structures.
U.S. Pat. No. 5,523,208 (issued Jun. 4, 1996) discusses the use of amino acid hydropathy values to search protein databases for proteins predicted to interact with each other.
The foregoing shows that a need exists for improved methods for the identification of evolutionarily-conserved and interacting positions in biological sequences, such as interacting amino acid positions in protein sequences. The identification of evolutionarily-conserved amino acid positions may be used to identify key regions in the protein for protein-drug interactions, to identify potential sites in proteins that lead to hereditary mutation diseases, and the identification of catalytic sites to improve enzyme activities, to name but several examples. The identification of interacting amino acid positions is useful to predict how a protein folds into a three dimensional structure, to predict how distant sites may interact to form a catalytic active site in an enzyme, and to predict effects of a drug interaction with an amino acid position may affect other amino acid positions, to name but a few examples.