A long-standing objective in Computational Biology has been to accurately calculate free energy and other thermodynamic properties of a protein under specified chemical and thermodynamic conditions [1]. Furthermore, there is a desire to predict regions within a protein that are flexible and rigid, and locate the correlated motions important to biological function [2]. A fundamental problem that must be overcome by any computational method is to provide ample sampling over the accessible configuration space of a protein, or other molecular system of interest, in thermodynamic equilibrium. Computational methods used to solve these two problems can be classified into either dynamic [3] or thermodynamic [4] approaches. For all but the smallest of systems, computational time required to calculate thermodynamic properties is impractical, using a dynamic approach based on a molecular mechanics force field. Due to the inability of achieving statistically significant predictions of thermodynamic equilibrium properties of proteins, application of force fields remain untrustworthy, as their accuracy cannot be adequately tested.
The dynamic approach typically invokes molecular dynamics to propagate the equations of motion to simulate the evolution of a system to sample its configuration space [5]. Configuration space can be explored more effectively using efficient sampling methods, and by employing simplified force fields that involve coarse-graining. Coarse-graining reduces the number of degrees of freedom (hereinafter, “DOF”) within a system that need to be dynamically considered, but also reduces accuracy of the simulation. Both techniques are routinely used in computational methods available in the field to advance understanding of myriad molecular systems, ranging from bulk water, polymers, nucleotides, proteins and membranes. The dynamic approach also uses Monte Carlo sampling to explore configuration space [6]. In the Monte Carlo method, each configuration is assigned an energy based on the molecular mechanics force field. From a Gibbs ensemble of generated configurations, thermodynamic and other physical properties can be readily calculated. Nevertheless, whether the equations of motion are propagated using molecular dynamics or Monte Carlo sampling is used, predictions of physical properties in thermodynamic equilibrium have not been possible for almost all practical purposes due to incomplete sampling of configuration space.
The thermodynamic approach identifies local conformation states that consist of a sub-ensemble of atomic configurations [7]. A free energy is assigned to these local conformation states, which is split into an enthalpy and entropy contribution based upon a corresponding molecular partition function. For any given conformation of a protein, or other system of interest, the total free energy is estimated as the sum of free energies over all free energy contributions that derive from each local conformation state. This procedure assumes enthalpy and entropy are additive [8]. Although additivity is commonly invoked, this assumption is generally incorrect [9,10] unless all local conformation states are non-interacting.
It is common to transform many interacting particles into a set of collective motions of particles, each with their own identity, and each acting independently. For example, a system of coupled springs between atoms is an excellent model for a solid. This collection of coupled interactions can be transformed into a collection of normal modes of vibration describing small oscillations about equilibrium. Unfortunately, this approach does not work well for macromolecules that are not completely rigid. As the conformation of a flexible molecular system varies over its accessible ensemble of configurations, the transformation into orthogonal normal modes depends on the conformation state [11]. Methods along these lines have been implemented within a dynamic approach, but still have limited ability to explore configuration space. In principle, the best free energy decomposition would be expressed in terms of independent variables to ensure that the assumption of additivity is a good approximation for conformational entropy [12].
Some thermodynamic approaches include explicit conformation dependence by expressing local free energy contributions in terms of local accessibility to solvent [13]. Generalized coordinates are used to keep the free energy contributions additive. The free energy contributions from implicit solvent molecules should be additive, however, because those molecules define the thermodynamic reservoir. Working only with solvent contributions to the free energy is incomplete. The source of non-additivity derives from intramolecular interactions related to conformational entropy. The assumption of additivity applied to just the enthalpy components appears not to pose problems. Non-additivity of conformational entropy enters because configuration space volume is overestimated when there is no account for overlap in the nominal configuration space volume associated with each isolated interaction. The additivity assumption breaks down in applications of proteins, and other biological polymers, because many competitive interactions couple to one another. Unfortunately, it is hard to separate the most important interactions because it is the collection of many weak interactions that is responsible for the observed emergent properties of interest. Consequently, the thermodynamic approach based on an additive free energy decomposition scheme will generally not be accurate, and predictions from these models remain untrustworthy.
Despite the intrinsic problems associated with a thermodynamic approach, it is frequently cast into a coarse-grained statistical mechanics problem, where all accessible conformation states of the molecular system are considered. The free energy of each conformation is estimated by adding the free energy contributions of each local conformation state. In these models, the initial atomic structural details are lost in exchange for a coarse-grained description. As a result, coverage of configuration space to account for the flexibility in protein structure is much better than the dynamic approach. However, the procedure to account for the diversity in configuration space is non-trivial, requiring summing a partition function. In general, it is not possible to sum a partition function exhaustively, even for coarse-grained (or reduced) models. Common techniques for estimating a partition function based on a thermodynamic approach are Monte Carlo sampling, various mean-field approximations or transfer matrix methods for special cases. Another common approximation scheme is to limit the type of configuration states by assuming native-like interactions can break or form, but all other non-native interactions are simply ignored. Using various techniques or combinations thereof, partition functions can be calculated with reasonable accuracy.
Unfortunately, even when a method is used to accurately calculate the partition function for a model using the thermodynamic approach [14], it still returns poor predictions because the additivity assumption of free energy is fundamentally flawed. These models are usually defined in terms of variables that classify local conformations into discrete states void of atomic coordinate information. These models, referred to as Ising-like models, are intrinsically less accurate due to their simplifying approximations. The advantage of these methods is that they provide relatively fast practical computations while giving good coverage of the accessible configuration space. For this reason, a thermodynamic approach is often referred to as an ensemble-based approached. Although the thermodynamic approach has undoubtedly been the most successful way to assess thermodynamic properties of proteins, it remains untrustworthy.
The mechanical properties of a protein that are of particular interest is to determine which parts are flexible and rigid, and which parts influence other parts through cooperative motions [15]. Determination of this type of mechanical information is routinely obtained by using a dynamic approach. The root mean squared deviation in atom positions indicates which regions are flexible or rigid. Principal component analysis of a covariance matrix of atomic motions identifies correlated motions. These results are limited to short time correlations, however, because the statistics are generally not sufficient to be consistent with thermodynamic equilibrium averages. The thermodynamic approach is generally inadequate to describe these mechanical attributes, since they typically employ binary coarse-graining into native-like or disordered states of an entire residue. Correlations at the residue level are readily determined, as well as the average degree that a residue is disordered (locally unfolded) or native-like. The residue-state measures of native-like and disordered are often treated as analogous to the mechanical property of rigid and flexible respectively. Common thermodynamic approaches have no direct connection to mechanics.
A third method exists that models protein structure as a network of distance constraints, from which rigid and flexible regions are identified using network rigidity [16,17]. Although this method provides precise calculation of rigid and flexible regions and correlated motions, it does not account for constraints breaking and new ones forming due to thermodynamic fluctuations. Rather, the distance constraints are determined based on a static protein structure in the native state. Unfortunately this restriction is severe, because proteins undergo many fluctuations in conformation, meaning that an ensemble-based approach is essential to describe thermal equilibrium properties. This mechanical a-thermal modeling of a protein as a fixed distance constraint topology has limited applicability in describing proteins in their native state. This method has been combined with a dynamic approach actively and passively for post-analysis. The former hybrid combination has been implemented as a Monte Carlo geometric simulation [18]. The efficiency in exploring the native state ensemble dramatically increases. Because native constraints do not break and new constraints do not form, however, equilibrium properties cannot be obtained.
A Distance Constraint Model (DCM) is an innovative type of thermodynamic approach that combines free energy decomposition with constraint counting [19,20]. Atomic interactions are modeled by distance constraints, each assigned an enthalpy and entropy value. As the conformation of a protein changes, different sets of constraints will be present. An ensemble of constraint topologies represents all accessible conformations. For a given constraint topology within the ensemble, network rigidity is used to determine which distance constraint is independent or redundant. Total enthalpy is taken to be the sum of enthalpy contributions from all constraints within the network. The sum of entropy contributions from independent constraints gives an upper bound estimate for the conformational entropy. Although the total numbers of independent and redundant constraints are unique, identification of which constraint is independent or redundant is not unique. Therefore, many upper bound estimates for conformational entropy are possible.
Graph algorithms that calculate generic properties of network rigidity, such as identifying independent and redundant constraints, are commonly called pebble games [21]. The operational procedure to identify independent constraints requires recursively placing distance constraints in the network one at a time until all constraints are placed. Giving preference to constraints with lower entropy to be placed in the network before those with greater entropy ensures the lowest possible upper bound estimate for conformational entropy. The pebble game algorithm combined with this preferential rule must be applied to each distinct constraint topology within the ensemble. Finally, the partition function is obtained by summing over all accessible constraint topologies within the ensemble, and there are various ways to solve this. This algorithm is set forth in part in U.S. Pat. No. 6,014,449 (Jacobs et al.) which is incorporated by reference in its entirety.
The DCM was solved using an exact transfer matrix method for the alpha helix to coil transition [19], but this method is not general. The DCM was also solved using Maxwell constraint counting, which is formally a Mean-Field Approximation (MFA) applied to the density of constraints within a network [22]. For proteins, a minimal DCM (mDCM) was solved using a MFA combined with Monte Carlo sampling [20]. This hybrid method exhibits a good balance between calculation speed and accuracy. In this method, the occupation probability of a bar is pre-calculated within a MFA using Lagrange multipliers that enforce certain numbers of native torsion constraints, Nnt, and hydrogen bonds (H-bonds), Nhb, within the protein. Given a set of occupation probabilities, {pb}, for all bars in the network, Monte Carlo sampling is employed, which requires playing the pebble game to place the bars preferentially in the network per Monte Carlo move. A pseudo uniform random number generator on the interval [0,1] is used to determine if a bar is present or not by comparing the random numbers to the occupation probabilities. By repeating the process of building up a constraint network hundreds of times, an ensemble of constraint topologies is generated for different macrostates characterized by (Nnt, Nhb). For each macrostate, average enthalpies and entropies are calculated. Note that the probability for a bar to be independent depends on occupation probabilities of all bars in the network, {pb}. Within this MFA, the set of bar occupation probabilities, {pb}, are assumed not to depend on whether these bars are independent or redundant.
The mDCM is a crude model because it treats all residues in the same way, meaning all residues are considered to have identical properties irrespective of chemical composition and location in the protein. To capture essential effects of network rigidity from the hydrogen bond network, the mDCM employs three empirical parameters determined by fitting the DCM predictions for excess heat capacity against experimentally measured heat capacity. This fitting procedure works well, and the mDCM proves to be the only all-atom model that can successfully reproduce measured heat capacity curves for protein folding while providing a QSFR analysis [23,24,25,26,27]. Although the mDCM has established the fundamental premise that non-additivity of entropy is intrinsically related to network rigidity, predictions are limited in utility and are untrustworthy due to its oversimplifications. Presumably, at the expense of computational efficiency, significant flaws and drawbacks can be eliminated as the DCM becomes more sophisticated. Compared to the state of current art in the field, the unconventional paradigm employed in the DCM for simultaneously calculating emergent thermodynamic and mechanical properties shows much promise.
It appears one can choose between two options. Construct a simple model and solve it well with good coverage in conformation space, but then accept inaccurate predictions due to errors caused by underlying flawed assumptions. Alternatively, apply an accurate model that cannot be solved without simulation, and make inaccurate predictions because not enough conformation space is explored. This invention provides the means for a third option. A computer-implemented system within a DCM is described that removes oversimplifications related to free energy decomposition by; a) improving detailed all-atom representation of intramolecular interactions, b) includes solvent interactions, c) strain energy, d) effect of vibrations, e) more accurately calculates the partition function, and f) does calculations faster than previously possible.
In terms of prior patent publications related to the invention herein, European Patent 1,025,521 (Freire 2007) uses the free energy minima for identifying target binding sites on a molecule of interest and methods for designing ligands which bind to a molecule of interest. Binding targets in the protein are identified and classified according to their optimal affinities. In calculating the Gibbs free energy, a computer program accepts the three-dimensional coordinates of each of the atoms. The computer program then determines the difference between the Gibbs free energy of the complex of the ligand and the macromolecule and the uncomplexed ligand and the uncomplexed macromolecule. The Gibbs energies of all the partially folded states are calculated using the structural parameterization of the energetics in conjunction with a set of structural parameters optimized for the unfolded states. The partially folded state with the highest probability is the one with the lowest Gibbs energy.