Computational protein design is an extremely compute intensive (NP-hard) search problem [1]. Current methods search for a global minimum in energy, not free energy, which are generally pairwise decomposable and (relative to free energies) quick to compute. Myriad conformational search algorithms (i.e. Dead-End Elimination, A*, genetic algorithms, simulated annealing, Monte Carlo sampling, etc.) are used to find the conformation with lowest energy [2]. In the past decade, the science of protein design has made tremendous strides, and several experimental investigations have confirmed the usefulness of the computational predictions [3-6]. For a complete history of the field, see the reviews by Tuchscherer et al. [7], Street and Mayo [8], Bornscheuer and Pohl [9], Pokala and Handel [10], Park et al. [11], and Cozzetto et al. [12]. Recently, a new protein design paradigm, implemented in the software RosettaDesign, has been introduced [13]. Unlike the previous conformational search algorithms, RosettaDesign is based on combined threading/refinement process that also identifies experimentally validated protein structures [6, 13, 14]. In addition to individual protein structures, it has been demonstrated that the above protein design strategies can be extended to engineer specific protein-protein interactions [13, 15, 16].
In all of the above current approaches, the search algorithms attempt to identify the best sequence/conformation pair that minimizes the energy function for a given problem. While these current approaches have enjoyed many successes, they are still not efficient or accurate enough for the majority of real-world applications. Due to computational cost, no method has ever attempted to design proteins with specific targeted free energy. Calculating free energies is prohibitive because current methods estimate conformational entropy from lengthy simulations that attempt to cover accessible phase space.
In this invention, the conformational entropy contributions are calculated without simulation for a given template structure. As such, free energy landscapes are calculated in computationally tractable timescales based on specified order parameters, as FIGS. 1 and 2 illustrate for a beta-hairpin turn. Here the free energy of a small molecular system is expressed as a function of the number of hydrogen bonds and native torsions present in the structure. Free energy landscapes can be calculated based on user-defined selection of order parameters. Order parameters characterize the general properties of a system. Locating where the lowest free energy regions are in the landscape provides information about stable or metastable thermodynamic states. As thermodynamic and solvent conditions change, the shape of the free energy landscape also changes. Multiple minimums will generally appear, as will saddles that are important for describing transitions between different thermodynamic states. The free energy landscape is perhaps the most important element to this invention because all physical observables that are thermodynamic quantities require knowing the free energy landscape. In addition, kinetic properties can be described based on saddle points from barrier heights between minima within the free energy landscape.
Related to conformational entropy is flexibility. Flexibility characteristics are critical to proper enzyme function [17]; when an enzyme becomes too rigid, key catalytic normal modes are restricted, resulting in reduced functional efficiency, or possibly, complete loss of function [18, 19]. This is the main reason that traditional protein design strategies are unlikely to robustly design functioning enzymes. For example, energetic minima are frequently associated with more rigid conformations. Meaning, enzymes with increased stability are frequently too stable (rigid) to catalyze their associated reaction. Conversely, a more flexible enzyme might catalyze a given reaction faster, but would rarely be in the appropriate thermodynamic state, making the rate increase moot.
Flexibility is marginally considered in currently employed state-of-the-art protein design methods [11]. Current approaches allow protein conformations to adjust during refinement steps using routine energy minimization algorithms, such as steepest descent, conjugate gradient or Monte Carlo sampling; for example, see [6, 20, 21]. However, dynamics that occur on much longer (functional) timescales have not been implemented into any of the currently used design algorithms, although they have been envisioned for sometime [22]. In fact, in a recent protein design review [23], Butterfoss and Kuhlman state: “A second major step is the design of dynamic properties into proteins.” They go on to conclude, “The unique challenges faced in the advanced design of dynamic behavior remain to be described.” Despite consensus on the importance of stability and flexibility for protein function, no computational method currently exists to design a protein-molecule-solvent system based on a priori targeted stability/flexibility profiles for a given solvent and/or thermodynamic condition. Moreover, no computational method exists to design optimal solvent/thermodynamic conditions to obtain desired stability/flexibility profiles for a given protein-molecule system. A protein-molecule system is used here to mean either a single protein, or a protein interacting with one or more molecules, for which the molecule may be a protein (i.e. protein-protein interactions are included here).
Employing a Distance Constraint Model (DCM) [22, 24-30] based on network rigidity, free energy decomposition, and ensemble averaging, this invention uses Quantitative Stability/Flexibility Relationships (QSFR), which provides key insight about enzyme function [31, 32], to guide rational protein design efforts. This invention leverages the DCM's unique ability to calculate QSFR metrics fast, and focuses on the high dimensional QSFR metric that encapsulates relevant information about protein function. This invention will allow design of protein-molecule-solvent systems to respond and function in a desired way by optimizing to target QSFR metrics. In the optimization process, a variety of search methods will be employed to find optimal protein-molecule-solvent systems with targeted QSFR metrics.
QSFR represents a high dimensional body of descriptive data that includes both mechanical and thermodynamic metrics. Common QSFR metrics include: (i.) both global and local thermodynamic quantities, (ii.) descriptions of backbone flexibility, (iii.) mechanical susceptibility, and (iv.) correlation functions that describe intramolecular couplings. We have demonstrated previously that several of the QSFR metrics (especially backbone flexibility) are conserved across a protein family [31, 33, 34], thus leading to conservation of function. For example, we have demonstrated that several QSFR metrics are conserved across an E. coli and T. thermophilus RNase H ortholog pair, which are explained by the stability/flexibility requirements mandated by the enzymes' conserved function. A variety of QSFR quantities that can be calculated are described in FIG. 3. All of these quantities can be calculated for specified thermodynamic states, which are associated with particular portions of the free energy landscape. A common thermodynamic state of interest is the native state of a protein, which is described by a minimum within the free energy landscape that represents many native-like conformations.
Exemplar plots of common QSFR metrics are provided in FIG. 42, which include DCM outputs for global, local, and pairwise outputs for the protein thioredoxin. In panel (a), the free energy landscape is provided for three different temperatures. At T=359K, the folded and unfolded minima are equally probable, indicating that this is the melting temperature. At T=349K, the native state basin is more favourable than the unfolded, whereas at T=369K the opposite is true. In addition, the rigid cluster susceptibility is also plotted (at T=359K). In panel (b), the probability to rotate of thioredoxin is plotted for the same three temperatures as in panel (a). Notice that the protein becomes more flexible at higher temperatures. In panel (c), the RS-RS cooperativity correlation plot is provided. The plot highlights residue pairs that have a high likelihood of being correlated in their susceptibility. For example, the dark region centered at {87, 12} indicates that these regions are very likely to be simultaneously susceptible, whereas the white region at {100, 75} indicates that there is no correlation. The solid white bands at residues 34, 40, 64, 68, and 76, which are all Proline residues, occurs because Proline is locked into a native torsion angle state due to its cyclic sidechain.
In addition to being essential to successful design of proteins with targeted stability/flexibility/function characteristics, long-timescale flexibility is also critical in the computational screening of protein-ligand interactions [35-38]. This is because very few protein receptors are consistent with the classical “lock-and-key” model [38]. As a consequence, myriad different strategies for dealing with flexibility have been introduced in the primary literature, including one based on the same graph theoretic methods employed by the DCM [39]. However, few methodologies deal with the thermodynamic consequences of protein-ligand association, despite that pronounced affects are well documented in the primary literature [40-43]. Similar to other commonly employed protein design strategies, this invention will utilize the myriad virtual ligand screening methodologies currently available. The main reason current methods neglect rigorous thermodynamic descriptions is because methods have not been available to accurately calculate free energies in tractable computing timescales. The main barrier to achieving this goal is non-additivity within free energies [44, 45], which the DCM elegantly resolves. As a consequence of the DCM strategy, it is uniquely able to quantitatively reproduce experimental heat capacity curves [25, 30-32].
The DCM also efficiently models environmental affects such as pH, ionic strength, co-solute concentration, and pressure, which are coupled to the various QSFR metrics. For example, proteins become unstable and will unfold at extreme pH, which, compared to proteins at intermediate pH values, results in drastically different QSFR properties. Specifically, the protein is likely to unfold at non-ideal thermodynamic/solvent conditions, resulting in a less compact and more flexible structure with fewer mechanical couplings between sites. Conversely, solvent conditions that stabilize the protein will generally have the opposite effects.
As described above, there are many protein design strategies that focus on finding minimum energy structures to increase stability. However, stability is a property of free energy, which includes both energetic and entropic contributions. Stability is important for reproducibility in molecular recognition and ensuring that desired molecular conformations are populated sufficiently; however, when considering function, stability is only one aspect of consideration. Protein flexibility is also critical for mediating function. Increasing stability by better contacts (the focus of current protein design) frequently will reduce flexibility, which can result in reduction or complete loss of catalytic efficiency. Despite overwhelming consensus in the biophysics/biochemistry communities regarding the importance of flexibility to function, it is ignored in current protein design efforts (by pragmatic necessity). This invention defines an innovative process for incorporating stability, flexibility and their relationships, which lays the foundation for a new protein design paradigm.
The high dimensional QSFR metric characterizes thermodynamic, mechanical and kinetic properties and their relationships in proteins and other macromolecules. There is no limit to the number of quantities that can be calculated and included in the QSFR metric. For example, quantitative characterization of key motions critical to catalysis (related to catalytic normal modes) can be identified in a computationally tractable way. A QSFR metric translates rigidity and flexibility correlation functions into an elastic network model, from which vibrational normal modes can be calculated. In contrast to all currently employed elastic network models, the normal modes calculated in this invention are dependent upon thermodynamic and solvent conditions, which themselves are dependent, as are all other quantities, on the shape of the free energy landscape. As the thermodynamic state of the protein changes, so does its characteristic intramolecular interactions that appear in the structure. Therefore, mechanical response and thermodynamic response are coupled.
QSFR metrics encapsulate relevant information regarding protein stability and molecular cooperativity important to function. Consequently, QSFR measures can be used to distinguish molecular mechanisms that control stability and function. Through judicious selection of QSFR targets, catalytic and functional specificities can be computationally selected in rational design strategies. Some fundamental QSFR measures that can be currently calculated by a prototype DCM, are listed in FIGS. 3 and 4.
It is instructive to focus on one such QSFR descriptor. Consider heat capacity (Cp). Molecular dynamics (MD) and Monte Carlo (MC) all-atom simulations are unable to accurately predict Cp curves. There is of course the issue that current force field parameterization is simply inaccurate. However, even if using current MD/MC all-atom simulations with the most accurate force-field model available could lead to predictions within 30% error from experimental measurements (so far unrealizable), this success would be a moot point due to extreme computational cost. Conservative estimate suggest it would take far more than 100 continuous years running the simulation in parallel using 100,000 CPUs for a moderate size protein of only 200 residues! However, MD/MC methods are commonly employed on reduced models that are simplified variants of all-atom models, such as Go-like models, which do produce a peak in predicted Cp curves that correspond to the protein folding transition. Using reduced models that treat solvent implicitly is essential to make calculations much faster. For example, Go-like models typically take a few months of CPU time using 60 CPUs to produce a Cp curve for small proteins of about 80 residues. Unfortunately, this speed is still far too slow to be of any practical value for protein design applications. Worse still, a more serious issue is the corresponding loss of accuracy. Predicted Cp curves using Go-like models do not resemble experimental measurements. We are unaware of any results yielding quantitative agreement to experimental data. Fast methods are also a moot point if they are not accurate.
Typically, MD/MC predicted Cp curves using reduced models contain errors that are an order of magnitude or more off when considering key features such as the peak location that defines the melting temperature, the peak height and the width of the Cp curve. In contrast, a DCM prototype reproduces experimental Cp curves typically within a few percent error in a matter of 30 days on a single CPU for a moderate size (200 residue) protein. This prototype DCM has major drawbacks because it has been oversimplified, and we usually refer to it as a minimal DCM. The minimal DCM has three phenomenological fitting parameters in the theory. Fortunately, this limitation is not enough to make the overall approach impractical. Practical protein design can be performed even when the employed model requires a few adjustable parameters. Interestingly, most of the 30 days of CPU time is a result of the fitting procedure to reproduce the Cp curve. If model parameters were transferable, fitting would be eliminated, and the total calculation time becomes less than a few hours on a single CPU to produce a high dimensional QSFR metric using the methods in the minimal DCM. Moreover, nontrivial technical improvements (see U.S. patent application Ser. No. 12/344,192, Jacobs 2008) to the minimal DCM will make statistical sampling methods obsolete, thereby speeding calculations up by an estimated factor of 100 or more, while gaining accuracy. It is envisioned that with the help of a more sophisticated modelling scheme that is calculated faster, in a more accurate fashion, a transferable set of parameters will be successfully constructed. Under these conditions, a host of QSFR properties (Cp is just one of them) should be generated in matter of a few minutes on a single CPU for a moderate size protein (200 residues) on today's computers.
FIG. 5 shows the essential components that are utilized to calculate the DCM for sophisticated models that are both more accurate and can be calculated much faster than the minimal DCM. This invention is based on the ability to provide a Flexibility And Stability Test (FAST) in very fast compute times. FAST provides both thermodynamic and mechanical information, and the relationships that are important to characterize the mechanisms that facilitate function. Given a variety of input structures, QSFR metrics are rapidly generated. New structures are explored in a design setting, ranked based on free energies for stability, and other criteria that characterize flexibility, molecular cooperativity, allosteric mechanisms, where the predicted QSFR response is subject to thermodynamic and solvent conditions.