The determination of protein tertiary structure from computer simulation is a long-standing problem in theoretical chemistry. The complexity of the problem is due to the astronomical number of possible conformations available to a molecule of the size of a typical protein and the delicate balance of forces which stabilizes the native structure. This leads to a very complicated high-dimensional potential energy surface with a large number of local minima. Any attempt to search for the global minimum therefore must be able to efficiently explore a large number of possible configurations by overcoming local barriers which act to trap the system in a local minimum. In addition, such a search must be sufficiently accurate to distinguish the global minimum from other possible solutions at the desired level of resolution.
Detailed models with all of the protein atoms and realistic force fields, such as those commonly used in molecular modeling, are computationally much too expensive to contemplate searching the huge range of conformations necessary to discover the native topology of a complicated protein domain like myoglobin when starting from an unfolded state. Therefore, there have been numerous efforts during the past two decades to construct reduced representations of the protein, and corresponding approximate potential functions. Both lattice and off-lattice models have been employed,[1] and a variety of algorithmic approaches have been used to search for the global minimum, including exact enumeration of all states of the model, Metropolis Monte Carlo with simulated annealing, genetic algorithms, and Brownian dynamics.
Most of the prior work involved relatively small proteins at a very coarse level of representation. When dealing with a larger protein, however, many of the strategies pursued for these systems (e.g. exact enumeration) become infeasible. To treat a system such as myoglobin, additional constraints are required, at least at the current level of available computational power. One approach, taken by Skolnick and coworkers,[2] was to bias the turn regions in the direction of the native topology. While these studies provided many interesting qualitative results concerning pathways of folding, it was difficult to relate the biases to terms in the actual potential energy function; it is also unclear how one might obtain such information for a protein whose structure is not known.
A second strategy is to fix the protein secondary structure. From an energetic viewpoint, one can imagine decomposing the potential function into helix to .beta.-sheet stabilization terms and the remaining terms representing long range hydrophobic, electrostatic, and van der Waals interactions. In practice, one might specify secondary structure either from NMR experimental data or via secondary structure prediction algorithms. Finally, it is reasonable to suppose that when secondary structure is fixed, the number of potential minima is drastically decreased, thus greatly ameliorating the multiple-minimum problem which has been a major barrier to solution of the folding problem. This idea underlies the collision-diffusion model of Karplus and coworkers concerning the folding of proteins in vivo;[3] the basic picture of formation of some secondary structure elements as an early folding event is supported by various experimental results.[4]
Early work in this direction was carried out by Warshel and Levitt[5] and by Cohen and coworkers. [6] While promising results have been reported, the methods used have not been sufficiently general to be applicable to a general protein of arbitrary complexity. To do this, according to the present invention, one must employ statistical mechanical and applied mathematical methods for minimization of functions, as opposed to construction of a small subset of configurations which is easily biased by knowledge of the correct structure.
The design of the present algorithm begins with the use of a hierarchical approach in which representations of the molecule are combined at different levels of detail in order to increase the computational efficiency of the simulation. This technique is an extension of earlier work carried out for simple polymers.[8] A Monte Carlo simulation is performed in which the trial moves consist of changes in the internal coordinates. Since each move can generate a very different structure, it is necessary to re-evaluate the energy and self-avoidance for the entire molecule at each step. However, in a compact structure (such as that of a protein) most trial moves will result in highly unfavorable configurations. Such moves can thus be rejected based on a fairly crude representation of the molecule and only those new structures which are comparable in energy in that representation need evaluated at the more detailed level.
In the case of proteins, there is a natural segmentation of the polymer chain into well-defined secondary structural units (i.e. helices to loops). In particular, .alpha.-proteins are considered which consist of relatively rigid helices connected by flexible loop regions. By constructing a model based on these structural units rather than the individual amino acids, the number of independent polymer segments is reduced by roughly an order of magnitude. The space of possible configurations of such a model is correspondingly much smaller, and can be much more effectively explored by computational means. In addition, this corresponds to the intuitive physical picture of a molecular structure which is dominated by the packaging of the helical regions. This suggests that in addition to reducing the size of the problem, the method of the present invention can simplify the potential surface by eliminating local minima which are due to finer details of the structure.
The use of the crude representation can thus allow for a more efficient exploration of the configurational space, but it lacks the detail necessary to identify correctly the global minimum. For this reason, it is important to maintain the individual residue representation and to periodically evaluate its energy so that the evolution of the configuration is ultimately determined by the more detailed model. The optimization of the crude model thus serves as a way to generate suitable trial moves for a higher level optimization of the more detailed model. With this approach, the evaluation of the lower level configurations need not be a perfect predictor of the higher level energy, but merely a good enough approximation to ensure that the more detailed calculation is only carried out for plausible changes in the structure. There will be a trade-off between the error introduced by minimizing an approximate potential function and the gain in computational speed due to less frequent evaluation of the more detailed model. The goal of the present invention is to maximize the overall efficiency by implementing the two-level representation.
The present invention provides simulations of myoglobin carried out on a 16-node partition of a massively parallel CM-5 supercomputer. The vastly increased computing power of the CM-5 (equivalent to roughly 16 IBM 550 workstations) has been essential in designing a successful methodology. However, this would have been insufficient by itself; major modifications of the potential function and computational algorithm have also been required. On the order of 10.sup.10 structure evaluations are required to produce reasonable low resolution myoglobin structures and that complex simulated annealing strategies are needed to insure the efficient traversing of potential barriers. It is within the scope of the present invention that the same algorithm can be used to fold a wide variety of proteins to the same resolution, including those containing .beta.-sheets.