The present invention relates generally to the field of molecular biology. More particularly, the invention relates to novel methods of predicting alpha helical segments, beta-sheets, and the tertiary structure of a polypeptide ab initio, provided with only the amino acid sequence of the polypeptide.
Proteins are essential molecules exhibiting complex structural and functional relationships. Biological functionality is defined by the native three-dimensional structure of the protein, which in turn depends on an intricate balance of molecular interactions. It is well known that many proteins fold spontaneously from random disordered states into compact (native) states of unique shape. However, the ability to explain the mechanisms that govern this transformation has not yet been realized. The goal in solving this protein folding problem is to understand this folding process and to predict the three dimensional structure of proteins from their one dimensional amino acid sequence.
Structure prediction of polypeptides and proteins from their amino acid sequences is regarded as a holy grail in the computational chemistry, molecular and structural biology communities. According to the thermodynamic hypothesis, the native structure of a protein in a given environment corresponds to the global minimum free energy of the system. Anfinsen, Science 181, 223 (1973). Recent reviews assess the advances made in this area by researchers across several disciplines. Dill, Prot Sci 8, 1166 (1999); Koehl and Levitt, Nature Structural Biology 6, 108 (1999); Wales and Scheraga, Science 285, 1368 (1999). In spite of pioneering contributions and decades of effort, the ab initio prediction of the folded structure of a protein remains a very challenging problem. The current approaches for the structure prediction of polypeptides can be classified as : (i) homology or comparative modeling methods; (ii) fold recognition or threading methods; (iii) ab initio methods that utilize knowledge-based information from structural databases (e.g., secondary and/or tertiary structure restraints); and (iv) ab initio methods without the aid of knowledge-based information.
Knowledge-based ab initio methods exploit information available from protein databases regarding secondary structure, introduce distance constraints, and extract similar fragments from multiple sequence alignments in an attempt to simplify the prediction of the folded three-dimensional protein structure. Contributions to the art include the work reported in Levitt, J. Mol. Biol. 170, 723 (1983); Hinds and Levitt, J. Mol. Biol. 243, 668 (1994); Ortizet al., Proc. Natl. Acad. Sci. USA 95, 1020 (1998a); Skolnick et al., J. Mol. Biol. 265, 217 (1997); Simons et al., Proteins 34, 82 (1999a); Shortle et al., Proc. Natl. Acad. Sci. USA 95, 11158 (1998); Sun et al., Protein Engineering 8, 769 (1995); Monge et al., Proc. Natl. Acad. Sci. USA 91, 5027 (1994); Monge et al., J. Mol. Biology 247, 995 (1995); M. Standley et al., J Mol Bio 285, 1691 (1999).
Ab initio methods that are not guided by knowledge-based information represent the most challenging category. Important advances include, among others, Scheraga et al., J Global Optimization 15, 235 (1999); Liwo et al., Proc. Natl. Acad. Sci. USA 96, 5482 (1999); Lee et al., Biopolymers 46, 103 (1998); Srinivasan and Rose, PNAS 96, 14258 (1999); Yue and Dill, Protein Science 5, 254 (1996); Dill et al., J. Computational Biology 4, 227 (1997). A recent assessment of the current status of both types of ab initio protein structure prediction approaches may be found in Orengo et al., Proteins Suppl. 3, 149 (1999).
The above methods fail to predict accurately and reliably the tertiary structure of a polypeptide, as determined by lack of agreement with experimental observations of the tertiary structure. Thus, there is a need for more accurate and reliable methods of determining the tertiary structure of polypeptides ab initio.
The present invention provides the novel ASTRO-FOLD approach for the ab initio prediction of the three dimensional structures of proteins. The four stages of the approach are exemplified in FIG. 1. The first stage involves the identification of helical segments. This aspect of the invention is accomplished by first partitioning the amino acid sequence into oligopeptides, for example, pentapeptides, such that consecutive pentapeptides possess an overlap of four aminoacids. Then, atomistic level modeling is performed using a selected force field. Many force field parameterizations exist for protein systems. In one aspect of the invention, the ECEPP/3 force field, which includes non-bonded, hydrogen-bonding, electrostatic, torsional and disulfide loop-closing terms is employed. Nemethy et al., J. Phys. Chem. 96, 6472 (1992). The next steps involve generating an ensemble of low energy conformations, then calculating free energies that include entropic, cavity formation, polarization and ionization contributions for each pentapeptide, and finally the calculation of helix propensities for each residue using equilibrium occupational probabilities of helical clusters.
The partitioning of the amino acid sequence into overlapping oligopeptides is based on the idea that helix nucleation relies on local interactions and positioning within the overall sequence. The explicit consideration of local interactions through overlapping oligopeptides allows for detection of cases in which identical amino acid sequences adopt different conformations in different proteins. See, e.g., Minor and Kim, Nature 380, 730 (1996). This is consistent with the observation that local interactions extending beyond the boundaries of the helical segment retain information regarding conformational preferences. See, e.g., Baldwin and Rose, TIBS 24, 77 (1999). The partitioning pattern is generalizable and can be extended to heptapeptide, nonapeptide, and even longer systems, such as those equivalent in length to the longest known helical segments. See, e.g., Anfinsen and Scheraga, Advances In Protein Chemistry 29, 205 (1975). Other methods have utilized partitioning schemes, but these only provide for discrete states of the central residue of non-overlapping peptides and have not considered the implications of free energy modeling. The overall methodology for the ab initio prediction of helical segments encompasses the following steps:
1. The overlapping pentapeptides are modeled as neutral peptides surrounded by a vacuum environment using the ECEPP/3 force field. An ensemble of low potential energy pentapeptide conformations, along with the global minimum potential energy conformation, is identified using a modification of the xcex1BB global optimization approach [Klepeis and Floudas, J Chem Phys 110, 7491 (1999)] and the conformational space annealing approach [Lee et al., Biopolymers 46, 103 (1998)]. For the set of unique conformers, Z, determined by removing all duplicate and symmetric minima, including the clustering of any two minima that do not differ by more than 50 degrees for at least one dihedral angle, free energies (Fharvac) are calculated using the harmonic approximation for vibrational entropy. See Klepeis and Floudas, J Chem Phys 110, 7491 (1999).
2. The energy for cavity formation in an aqueous environment is modeled using a solvent accessible surface area expression, Fcavity=xcex3A+b, wherein A is the surface area of the protein exposed to the solvent. This macroscopic free energy term is based on a fit to experimental free energies of the transfer of alkane molecules into water. The values for the xcex3 and b parameters are taken to be 0.005 kcal/molAA and 0.860 kcal/mol, respectively.
3. For the set of unique conformers, Z, the total free energy is calculated from the expression
FtotalFvachar+Fcavity+Fsolv+Fionize,
wherein Fsol represents the difference in polarization energies caused by the transition from a vacuum to a solvated environment, and Fionize represents the ionization energy. Solvation and ionization free energies are typically calculated through the solution of the nonlinear Poisson Boltzmann equation, using finite difference or multigrid boundary element solution methods. See, e.g., Gilson et al., J Comp Chem 9, 327 (1987). Here the finite difference solution of the Poisson Boltzmann equation, as implemented in the DELPHI package, is adopted. Honig and Nicholls, Science 268, 11144 (1995).
4. For each oligopeptide, total free energy values (Ftotal) are used to evaluate the equilibrium occupational probability for conformers having three central residues within the helical region of the xcfx86-"psgr" space. Helix propensities for each residue are determined from the average probability of those systems in which the residue constitutes a core position.
In the second stage of tertiary structure prediction, xcex2 sheets and disulfide bridges are identified through a novel superstructure-based mathematical framework originally established for chemical process synthesis problems. Floudas, Nonlinear and mixed-integer optimization (Oxford University Press, New York, 1995). In this aspect of the invention, two types of superstructure are introduced, both of which emanate from the principle that hydrophobic interactions drive the formation of xcex2 structure. The first type, denoted hydrophobic residue-based superstructure, encompasses all potential contacts between pairs of hydrophobic residues (i.e., a contact between two hydrophobic residues may or may not exist) that are not contained in helices (except cystines which are allowed to have cystine-cystine contacts even though they may be in helices). The second type, denoted xcex2 strand-based superstructure, includes all possible xcex2 strand arrangements of interest (i.e., a xcex2 strand may or may not exist) in addition to the potential contacts between hydrophobic residues. The hydrophobic residue-based and xcex2 strand-based superstructures are formulated as mathematical models which feature three types of binary variables : (i) representing the existence or nonexistence of contacts between pairs of hydrophobic residues; (ii) denoting the existence or nonexistence of the postulated xcex2 strands; and (iii) representing the potential connectivity of the postulated xcex2 strands. Several sets of constraints in the model enforce physically legitimate configurations for antiparallel or parallel xcex2 strands and disulfide bridges, while the objective function maximizes the total hydrophobic contact energy. The resulting mathematical models are Integer Linear Programming (ILP) problems which not only can be solved to global optimality, but can also provide a rank ordered list of alternate xcex2 sheet configurations.
In the hydrophobic residue-based superstructure and its mathematical model, all residues not belonging to helices (except cystines) are first classified as hydrophobic, bridge, turn or other residues. The residues, H, are considered to be hydrophobic: H=Leu, Ile, Val, Phe, Met, Cys, Tyr, Trp; the residues, B, are considered to be bridge: B=Ala, Thr; the residues, T, are considered to be turn: T=Asn, Asp, Gly, Pro, Ser; and the residues, N, are considered to be other: N=Arg, Lys, Glu, Gln, His. The xcex2 strand-based superstructure requires the introduction of a protocol for the placement of the H, B, T, and N residues. Each residue is also assigned a position dependent parameter, P(i), which is equal to the position of the hydrophobic residue in the overall sequence and is used extensively to describe allowable contacts between hydrophobic residues. Hydrophobicity parameters, Hi, are taken from hydrophobicity scales derived either from experimental transfer free energies or from statistical data. See, e.g. Karplus, Protein Science 6, 1302 (1997); Lesser and Rose, Proteins 8, 6 (1990). The interaction energy for a potential hydrophobic contact is assumed to be additive, and for cases involving a cystine-cystine contact an additional energy contribution, Hijadd, is included. The additional cystine-cystine contribution, Haddij, is based on the addition of the interaction energy for all hydrophobic residues between the potential disulfide bonding pair, [(xcexa3k, P(i)xe2x89xa6P(k)xe2x89xa6P(j)Hk)/(P(j)xe2x88x92P(i))]. The objective is to maximize the contact energy:   max  ⁢            ∑      i        ⁢                  ∑                  j          ,                                                    P                ⁢                                  (                  i                  )                                            +              2                         less than                           P              ⁢                              (                j                )                                                        ⁢                        (                                    H              i                        +                          H              j                        +                          H              ij              add                                )                ⁢                  y          ij                    
where the binary (0-1) variables, yij, represent potential hydrophobic contacts. The maximization is subject to several sets of constraints which define allowable xcex2 sheet configurations. For antiparallel xcex2 sheets the constraints
yij+yklxe2x89xa61∀P(i)+P(j)xe2x89xa0P(k)+P(l), yij OR ykl ∉{Cys, Cys}
which require that the sum of the contact position parameters must be equal, enforce the formation of symmetric, non-intersecting loops. The constraint is not enforced when the potential contact represents a disulfide bridge. For parallel xcex2 sheets the contacts must involve symmetric intersecting loops, which produces the following set of constraints:
yij+yklxe2x89xa61 ∀P(k)xe2x88x92P(i)xe2x89xa0P(j), yij OR ykl∉{Cys, Cys }
Additional restrictions include: (i) one constraint requiring the formation of at least one contact in which no less than seven residues fall between the contacting residues; (ii) a set of constraints imposing the total number of possible contacts for each residue (initially set to 1 so that each residue may participate in only one contact); and (iii) one inequality constraint to set an upper bound on the number of disulfide bridges.
The third stage of tertiary structure prediction serves as a preparative phase for atomistic-level tertiary structure prediction, and therefore focuses on the determination of pertinent information from the results of the previous two stages. This next aspect of the invention involves the introduction of lower and upper bounds on dihedral angles of residues belonging to predicted helices or xcex2 strands, as well as restraints between the Cxcex1 atoms for residues of the selected xcex2 sheet and disulfide bridge configuration. Furthermore, for segments which are not classified as helices or xcex2 strands, free energy calculations for overlapping oligopeptides are conducted to identify tighter bounds on their dihedral angles.
The fourth and final stage of the approach involves the ultimate prediction of the tertiary structure of the full polypeptide sequence. This next aspect of the invention involves the formulation of the problem relying on dihedral angle and atomic distance restraints acquired from the previous stage, as follows:             min      ϕ        ⁢          xe2x80x83        ⁢          E              ECEPP        /        3                                          subject          ⁢                      xe2x80x83                    ⁢          to          ⁢                      xe2x80x83                    ⁢                                    E              l              distance                        ⁡                          (              ϕ              )                                      ≤                              E            l            ref                    ⁢                      xe2x80x83                    ⁢          l                    =      1        ,    …    ⁢          xe2x80x83        ,          N      CON                  xe2x80x83        ⁢                                        ϕ            i            L                    ≤                      ϕ            i                    ≤                                    ϕ              i              U                        ⁢                          xe2x80x83                        ⁢            i                          =        1            ,      …      ⁢              xe2x80x83            ,              N        ϕ            
wherein i=1, . . . ,N100 refers to the set of dihedral angles, xcfx86i, with xcfx86iL and xcfx86iU and upper bounds on these dihedral angles. The total violations of the l=1, . . . ,NCON distance constraints are controlled by the parameters Erefl. See Klepeis et al., J Comp Chem 20, 1354 (1999). To overcome the multiple minima difficulty, the search is conducted using the xcex1BB global optimization approach, which offers a theoretical guarantee of convergence to an xcex5 global minimum for nonlinear optimization problems with twice-differentiable functions. Androulakis et al., J. Glob. Opt. 7, 337 (1995); Floudas, Deterministic Global Optimization: Theory, Methods and Applications in Nonconvex Optimization and its Applications (Kluwer Academic Publishers, 2000). This global optimization approach effectively brackets the global minimum by developing converging sequences of lower and upper bounds, which are refined by iteratively partitioning the initial domain. Upper bounds correspond to local minima of the original nonconvex problem, while lower bounds belong to the set of solutions of convex lower bounding problems, which are constructed by augmenting the objective and constraint functions by separable quadratic terms. To ensure non-decreasing lower bounds, the prospective region to be bisected is required to contain the infimum of the minima of lower bounds. A non-increasing sequence for the upper bound is maintained by selecting the minimum over all the previously recorded upper bounds. The generation of low energy starting points for constrained minimization is enhanced by introducing torsion angle dynamics [see Gxc3xcntert et al., J. Mol. Biol. 273, 283 (1997)] within the context of the xcex1BB global optimization framework. The xcex1BB has been successfully applied to computational chemistry problems, including microclusters, small acyclic molecules, and isolated and solvated oligopeptides. Floudas, Deterministic Global Optimization: Theory, Methods and Applications Nonconvex Optimization and its Applications (Kluwer Academic Publishers, 2000).
An important question regarding the prediction of the native folded state of a protein is how the formation of secondary and tertiary structure proceeds. Two common viewpoints provide competing explanations to this question. The classical opinion regards folding as hierarchic, implying that the process is initiated by rapid formation of secondary structural elements, followed by the slower arrangement of the tertiary fold. The opposing perspective is based on the idea of a hydrophobic collapse, and suggests that tertiary and secondary features form concurrently. The invention bridges the gap between the two viewpoints by introducing a novel ab initio approach for tertiary structure prediction in which helix nucleation is controlled by local interactions, while non local hydrophobic forces drive the formation of xcex2 structure. The agreement between the experimental and predicted structures validates the use of the ASTRO-FOLD method for generic tertiary structure prediction of polypeptides.
The ability to predict ab initio the tertiary structure of a polypeptide is a valuable contribution to the art. The methods described herein allow those of skill in the art to study conformations of polypeptides of interest, including their biological functionality, but also to advance the investigation of potential ligands and binding partners, such as receptors, the design of competing binding partners, and the discovery of compounds, such as drugs, which affect the native conformation of the polypeptide so as to modulate or regulate the polypeptide""s function.