Proteins are the building blocks of all living organisms. A single cell may contain many hundreds of proteins to perform various functions such as digesting food, producing energy, regulating chemical reactions, and building other proteins.
Structurally, proteins are linear polymers of amino acids also referred to as “polypeptides.” There are 20 different naturally occurring amino acids involved in the biological production of proteins. All amino acids contain carbon, hydrogen, oxygen, and nitrogen. Some also contain sulfur. The amino acids are assembled into a polypeptide chain on the ribosome using the codon sequence on mRNA as a template. As shown in FIG. 1, the resulting linear chain 101 forms secondary structures 103 through the formation of hydrogen bonds between amino acids in the chain. Through further interactions among amino acid side groups, these secondary structures 103 then fold into a three-dimensional structure 105. Therefore, protein structure is largely specified by amino acid sequence, but how one set of interactions of the many possible occurs is not yet fully understood.
FIG. 2 shows a conventional chemical representation of a part of a polypeptide, consisting of three amino acids. Each amino acid (or residue) consists of a common main chain part, containing the atoms N, C, O, Cα, and two hydrogen atoms, and a specific side chain R, which is also called a “pendant group.” The pendant groups (R1, R2 and R3 in FIG. 2) are always attached to the alpha carbon (Cα) atom. Amino acids can be divided into several classes based on size and other physical and chemical properties of their pendant groups. The main classification concerns the hydropathy of the residues, i.e., into hydrophobic residues which do not like to interact with the solvating water molecules, and into hydrophilic residues which have the ability to form hydrogen bonds with water. The hydrophilic residues can be further divided into charged residues which have a net electric charge (either positive or negative) and polar residues who don't have a net charge but have non-uniform charge distribution.
The amino acids are joined through the peptide bond, i.e., the planer CO-NH group. The planar peptide bond may be represented as depicted in FIG. 3. Because the O═C and the C—N atoms lie in a relatively rigid plane, free rotation does not occur about these axes. Hence, a plane 107 or 107′, schematically depicted in FIG. 3 by the dotted lines, and sometimes referred to as an “amide plane” or “peptide plane” is formed, wherein lie the oxygen (O), carbon (C), nitrogen (N), and hydrogen (H) atoms of a given amino acid or residue. At opposite corners of this amide plane 107 are located the alpha-carbon (Cα) atoms, which serve as swivel points or centers for a polypeptide chain. The two dihedral angles, φ and ψ on each side of the Cα atom, are the main degrees of freedom in forming the three-dimensional trace of the polypeptide chain. Due to steric restrictions, these angles can have values only in specific domains in the φ-ψ space. The pendant groups R branch out of the main chain from the Cα atom. These pendant groups, ranging in size from one to 18 atoms, have additional degrees of freedom, called χi angles, which enable them to adjust their local conformation to their environment.
Thus, a polypeptide structure bends, folds or flexes at each Cα atom swivel point. In a particular environment, and depending upon the particular side chains that may be attached to the polypeptide, some of these bends or folds may be stable, i.e. the φ and ψ angles will not change. In many environments, however, the φ and ψ angles will not be stable, and the polypeptide chain will dynamically fold and bend, as they are subjected to external or internal forces. Such forces may originate from numerous sources, such as ions, or molecules in the medium within which the polypeptide is located (external forces) that either attract or repel a given atom or group of atoms within the polypeptide. Often, however, these forces originate from within the polypeptide itself, or within one of its pendant groups, as the chain folds back on itself and one residue or pendant group of the polypeptide comes in close proximity to another residue or pendant group chain of the polypeptide.
In general, just as a flexible rope can assume an infinite number of shapes, a polypeptide chain can conceptually also assume an infinite number of shapes. Many of the possible shapes, however, are unstable, because the internal and external molecular attraction and/or repulsion forces will not permit such shapes to persist. These forces act to move or change the polypeptide conformation away from unstable conformations toward a stable conformation. A stable conformation is one where the internal and external molecular attraction and/or repulsion forces fail to destabilize or push the existing conformation toward another conformation.
Most polypeptide structures exhibit several conformations that are stable, some more so than others. The most stable conformations are the most probable. A conformation may change from one stable conformation to another through the application of sufficient energy to cause the change. Given the opportunity to freely move, fold and/or bend, a given polypeptide chain will eventually assume a stable conformation. The most probable conformation that is assumed is the one that would take the most energy to undo or, in other words, the conformation that has the lowest free energy.
Currently, there are two experimental methods to determine the three-dimensional structure of a protein. The first method is X-ray crystallography. The protein has to be first isolated, highly purified, and incubated under certain conditions to form a crystal. The protein is then exposed to X-ray radiation and the pattern of reflections is recorded. From these reflections, it is possible to deduce the actual three-dimensional electron density of the protein and thus to solve its structure. The second method is nuclear magnetic resonance (NMR), which is currently applicable only to small-size proteins. The underlying principle is that, by exciting one nucleus and measuring the coupling effect on a neighboring nucleus, one can estimate the distance between these nuclei. A series of such measured, pairwise distances is used to reconstruct the protein structure. Both methods are inherently time and labor consuming.
With the rapid progress in obtaining genetic information from human and other organisms, the primary sequences of a large number of proteins have been determined. However, to effectively utilize the ever-expanding database of primary sequence information, it is necessary to predict the three-dimensional structure of a protein based on its primary sequences. For example, in order to design a drug to block an active site in a receptor protein, one has to simulate the interaction of the drug with the amino acid residues in the active site. The drug design is possible only when the three-dimensional structure of the protein is determined.
A number of computational approaches have been developed to calculate the three-dimensional structure based on the assumption that the native structure of a protein has the lowest free energy among all the possible conformations of the chain. Two principal methods are currently in use, the molecular dynamics method and the Monte Carlo method.
In the molecular dynamics method, an all-atom description is usually used. Forces acting on each atom at a particular state of the system are calculated using an empirical force field. Atoms are then allowed to move with the accelerations resulting from forces according to Newton's second law. Once the atoms have moved far enough for the forces to have changed significantly, the forces are recalculated and new accelerations applied. In practice, forces have to be recalculated approximately every 10−15 second. Even with powerful supercomputers only very short time periods can be simulated, much shorter than the actual folding process. Hence, this method can currently be used only to describe some sub-folding events (e.g., the initiation of the process, or re-folding after a slight perturbation) rather than the whole process.
The Monte Carlo method is usually used with simplified models. The procedure starts with an initial conformation and makes a random “move” to another conformation. The energy of the new conformation is compared with the energy of the old one. If the new conformation is better, meaning it has a lower free energy, the new conformation replaces the old one. If the new conformation has a higher energy, it is subject to a non-deterministic decision based on the amount of the energy gained, such that a larger energy gain makes the acceptance more unlikely. If the new conformation is not accepted, the old conformation is retained. The current conformation is then subject to another random change and the procedure iterates. The Monte Carlo methods have been applied in many protein studies for different tasks with different levels of success. Yet, as a search method, even on simple models, the method is not powerful enough in most cases to find the lowest free energy conformation starting from a random conformation.
Based on the Monte Carlo principle, Lau and Dill developed a simplified two-dimensional square lattice model for protein folding (Lau. K. F. and Dill K. A. Proc. Natl. Acad. Sci. USA 87:638–642, 1990). Unger and Moult have described the application of a genetic algorithm to discover the minimum-energy conformation for such a lattice-constrained protein (Unger R. and Moult J. Proceedings of the Fifth International conference on Genetic algorithms, Forrest S. ed., pp. 581–588, 1993). However, the problem has a nondeterministic polynomial solution, which means that when a solution is given it can not be verified in polynomial time, which is computationally unfeasible for even the fastest computers. Additionally, the prior solutions are software-based emulations, and thus have been very time consuming when implemented on conventional computers.