Methods and systems described in this patent generally relate to methods and systems for identifying candidate molecules that may bind to a target molecule. Methods described in this patent may be implemented in a computer system and may be embodied as computer programs in an article of manufacture. The methods and systems described in this patent also relate to methods and systems for determining from a set of candidate molecules those molecules that may bind to a target molecule.
Drug discovery may be conceptually understood in two steps: 1) target identification and 2) lead identification and optimization. In the first step of target identification, a large molecule, which may be but is not limited to a cell-surface receptor or intra-cellular protein, referred to as the target, may be identified with a particular biological pathway or structure of interest. Once a potential target has been identified, it must be screened against a large number of small molecules to determine whether the target is drugablexe2x80x94i.e. whether any small molecules can be found to bind or interact with the target. Lead identification is very expensive and time consuming, often involving experimentally screening a target against many millions of small molecules to determine binding affinities. Once one or more drug leads to a target have been identified, it is still often necessary to optimize a lead in order to improve its pharmacological efficacy. In this step, referred to as lead optimization, synthetic chemists chemically modify the lead in order to increase or decrease its binding to the target, modify its susceptibility to degradative pathways, or modify its pharmacokinetics.
The current combinatorial methods of lead identification and to a lesser extent, lead optimization, could be improved if rationalized. Rational lead optimization, as used herein, refers to using theoretical methods to determine a set of likely lead candidates to a given target before experimentally screening lead-target binding. Rational lead optimization offers the possibility of reducing drug discovery time and expenses by reducing the number of potential lead-target screens.
Early methods of rational lead optimization sought to find a set of potential leads from a database of small molecules chemically similar to a known lead. The earliest of these techniques was performed by the synthetic chemist in the process of producing the next substance to test. The chemist would make a set of small changes to the structure and determine if those changes had a beneficial or detrimental effect on the efficacy. This process, called analog synthesis, is effective, although time-consuming and expensive. Analog synthesis is the basis for medicinal chemistry, and remains an important part of drug discovery today.
Early attempts at using computers to make the lead optimization process more efficient involved determining chemical similarity between a potential lead and a known lead using graph-theoretical treatments. Methods of this type include searching a database of compounds for those that contain the same core structure as the lead compoundxe2x80x94called substructure searching or two-dimensional (2-D) searchingxe2x80x94and searching for compounds that are generally similar based on the presence of a large number of common fragments between the potential lead expansion compound and the lead compoundxe2x80x94called 2-D similarity searching. These techniques are somewhat effective, but are limited to finding compounds that are obviously similar to the lead, thus affording the medicinal chemist few new insights for directing the synthesis project.
Another technique using computers makes use of the knowledge of how the small molecules bind to the large molecule. There are often special groups called pharmacophore groups that are responsible for most of the stabilization energy of the complex. The three-dimensional (3-D) searching techniques try to find from a large database of potential lead compounds those that have essentially the same geometrical arrangement of pharmacophore groups as the lead compound. The 3-D arrangement of the pharmacophore groups required for interaction with the large molecule is called a pharmacophore model or 3-D query. Compounds that are found to contain the correct arrangement are called xe2x80x9chits,xe2x80x9d and are candidates for screening. These hits differ from the substructure or 2-D similarity hits in that the backbone of the structure may be quite different from that of the original lead compound, and often represents an important new area of chemistry to be explored.
The simplest of the 3-D searching techniques compares the position and arrangement of the pharmacophore groups of 3-D structures as stored in the database. This is referred to as static 3-D searching. A fairly obvious shortcoming to determining chemical similarity from a database of small molecules of fixed geometry is that molecules often do not have rigid structures, but rather have a large number of accessible conformations formed from rotating the molecular framework of a molecule about its single bonds. These bonds that can easily be rotated are referred to as rotatable bonds. As used herein, molecular configurations defined by rotations about the dihedral angles of its rotatable bonds will be referred to as rotomers. Small molecules in pharmaceutical database typically contain an average of six to eight rotatable bonds per molecule. This can easily afford a set of accessible rotomers that number in the millions. Searching just one static conformation (rotomer) from among millions of possible rotomers for the correct arrangement of groups will cause many compounds that could be good leads to be missed.
In order to consider energetically accessible rotomers, early small molecule databases often contained a small subset of the accessible rotomers of each small molecule. Herein, this technique is called multi-conformational 3-D searching. However, attempting to define individually each rotomer for a given small molecule quickly becomes impractical as the number of atoms in a given small molecule increases. Accordingly, other early methods of searching small molecule databases for potential target leads attempted to define a conformation space for each small molecule. See U.S. Pat. No. 5,752,019.
A further extension of 3-D search technology involves investigation of the accessible conformational space of the potential hits as part of the searching process. These techniques, called conformationally flexible 3-D searching techniques, adjust the conformation of the potential hit according to the requirements of the 3-D pharmacophore query. One such technique that has been found to be effective uses a method called xe2x80x9cDirected Tweakxe2x80x9d. See Hurst, T. Flexible 3D Searching: The Directed Tweak Technique, 34, J. Chem. Inf. Comput. Sci. 190 (1994). This method is very effective for finding molecules of interest when the geometry of the binding site of the large molecule is not known. Directed Tweak adjusts the conformation of the small molecule by changing the angle values of the rotatable bonds. This method therefore ignores changes in conformation because of bond stretching and bond bending. Bond stretching vibrations for molecules near room temperature typically change the length of a bond by about 0.05 Angstroms (xc3x85). Bond bending between three connected atoms typically moves one of the atoms by about 0.1 xc3x85. Rotation about rotatable bonds often moves atoms by several xc3x85s or tens of xc3x85s. Thus, adjusting only the rotatable bond values includes essentially all of the accessible conformational flexibility of a small molecule.
When the binding site of the target protein is known, another way to identity potential leads is by docking potential leads into the binding site. Docking approaches can be classified based on how they characterize the ligand-binding site of the protein. Grid-search techniques fill the space around the binding site with a 3-D grid, precompute the potentials (van de Waals, electrostatic, etc.) at each grid point without a ligand, then sample different ligand conformations and orientations on the grid and compute the resulting binding energy.
Early attempts of rational lead optimization modeled small molecule docking to a target using molecular mechanics force field methodologies. Molecular mechanics methodologies (also called force field methodologies) attempt to model the short range and long range forces between a target and a small molecule using field representations. U.S. Pat. No. 5,866,343, treats the target and small molecule using separate potential energy fields. Each potential energy field comprises a 1/R long range electrostatic contribution and a short range 6-12 Lennard-Jones contribution. The interaction energy between a target and a small molecule is then calculated for the position of the small molecule relative to the large one. The position is then adjusted iteratively so as to give lower and lower energies. This continues until a low energy arrangement is found. Another example is AutoDock (Morris, G. M., Goodsell, D. S., Halliday, R. S., Huey, R., Hart, W. E., Belew, R. K. and Olson, A. J. xe2x80x9cAutomated Docking Using a Lamarckian Genetic Algorithm and Empirical Binding Free Energy Functionxe2x80x9d, J. Computational Chemistry, 19(1998), 1639-1662.) which uses simulated annealing in its previous releases, but now applies a hybrid genetic algorithm to sample over the feasible binding nodes of the ligand relative to the protein. The advantages of grid-based docking are that a template of favorable interactions in the ligand-binding site does not need to be defined, thus reducing bias in modeling the protein-ligand interactions, and evaluation of binding modes is made more efficient by precomputing the potential energy that results from interaction with the protein at each point on the grid. However, the accuracy and timing of this approach depends on the grid fineness, making this approach too computationally intensive for database screening, in which thousands of molecules (as well as ligand orientations and shapes, or conformers) need to be tested. Furthermore, precomputation of the protein grid potentials limits this approach to rigid binding sites.
When a ligand-binding site in a protein is known, a grid template may be used by constructing a template for ligand binding based upon favorable interaction points in the binding site. During the search for a favorable ligand-binding mode, different conformations of the ligand can be generated and subsets of its atoms matched to complementary template points as a basis for docking the ligand into the binding site. One of the principal advantages to a template-based approach is that a template can incorporate known features of ligand binding (for example, conserved interactions observed experimentally for known ligands). Thus, the docking search space is reduced to matching N ligand atoms onto N template points, rather than the six-dimensional orientational search space (three degrees of rotational freedom and three degrees of translational freedom) required in other approaches for sampling and evaluating ligand binding.
Another well-known docking tool is DOCK (Schoichet, B. K., Bodian, D. L., and Kuntz, I. D., J. Comput. Chem. 13 (1992) 380). In this program the template typically consists of up to 100 spheres that generate a negative image of the binding site. During the search, subsets of ligand atoms are matched to spheres, based on the distances between ligand atoms. DOCK has been extended to consider chemistry and include hydrogen-bonding interaction centers in addition to the shape template. Other current template approaches specify a set of interaction points defining favorable positions for placing polar ligand atoms or hydrophobic (nonpolar) centers, e.g., aromatic rings. Such a template can be generated automatically, e.g., by placing probe points on the solvent accessible surface of the binding site, or interactively by superimposing known protein-ligand complexes to identify favorable interaction points based on observed binding modes for known ligands. Another docking program, FlexX (licensed by Tripos Inc.xe2x80x941699 South Hanley Road, St. Louis, Mo. 63144-2913, phone: 314-647-1099) uses a template of 400 to 800 points when docking small molecules (up to 40 atoms, not including hydrogens) to define positions for favorable interactions of hydrogen-bond donors and acceptors, metal ions, aromatic rings, and methyl groups. The ligand is fragmented and incrementally reconstructed in the binding site and matched to template points based on geometric hashing (indexing) techniques, bond torsional flexibility is modeled discretely, and a tree-search algorithm is used to keep the most promising partially constructed ligand conformations during the search. Hammerhead (Welch W, Ruppert J, Jain AN. 1996. Hammerhead: Fast, fully automated docking of flexible ligands into protein binding sites. Chem Biol 3(1996), 449-462.) uses up to 300 hydrogen-bond donor and acceptor and steric (van der Waals interaction) points to define the template, and the ligand is incrementally constructed, as in FlexX. A fragment is docked based on matching ligand atoms and template points with compatible internal distances, similar to the matching algorithm used in DOCK. If a new fragment is positioned closely enough to the partially constructed ligand, the two parts are merged, and the most promising placements kept.
Other successful docking approaches, such as GOLD (G. Jones, P. Willett, R. C. Glen, A. R. Leach and R. Taylor, J. Mol. Biol 267 (1997) 727-74.), and the method of Oshiro et al. (Oshiro C, Kuntz I, Dixon J. Flexible ligand docking using a genetic algorithm. J Comput Aided Mol Des 1995;9:113-130.) use genetic algorithms to sample over possible matches of conformationally flexible ligands to the template. GOLD uses a template based on hydrogen-bond donors and acceptors of the protein and applies a genetic algorithm to sample over all possible combinations of intermolecular hydrogen bonds and ligand conformations. However, a drawback of genetic algorithm approaches, including AutoDock, is the high computation time, especially in comparison to fragment-based docking approaches.
The UNITY 3-D searching system (licensed by Tripos Inc.xe2x80x941699 South Hanley Road, St. Louis, Mo. 63144-2913, phone: 314-647-1099) has been extended to provide what is essentially a docking tool. In this approach, six parameters corresponding to the six rotational/translational degrees of freedom are added to the rotatable bond list, and these parameters are adjusted to place pharmacophoric groups at the positions giving favorable interactions with the receptor. This method produces acceptable accuracy, but is time consuming because the derivatives needed for the minimization must be calculated numerically. Each derivative calculation requires three or more small adjustments to the geometry of the small molecule.
Another current docking method, SPECITOPE (Volker Schnecke, Craig A. Swanson, Elizabeth D. Getzoff, John A. Tainer, and Leslie A. Kuhn, Screening a Peptidyl Database for Potential Ligands to Proteins with Side-chain Flexibility Proteins: Structure, Function, and Genetics, Vol. 33, No. 1, 1998, 74-87) combines grid methods with adaptive geometry techniques in order to model protein side chain flexibility. SPECITOPE uses a binding site template to limit the orientational search for each prospective ligand and uses distance geometry techniques to avoid the computationally intensive fitting of infeasible ligands into a binding site. The speed gained by distance geometry methods allows the modeling of protein side chain flexibility during docking.
The results of fast docking tools to predict ligand binding modes a priori (i.e. for cases where there is no ligand for which the binding location relative to the protein is known) are less reliable (Dixon, J. S., xe2x80x9cEvaluation of the CASP2 Dockingxe2x80x9d, Proteins: Structure, Function, and Genetics, Supplement-1 (1997), 198-204.). Today, most docking tools model full ligand flexibility, but at least in the faster approaches the binding site is kept rigid. Limited protein side-chain flexibility is exploited by GOLD, which considers rotational flexibility of side chains (Jones et al., J. Mol. Biol., 245, 43-53, 1995), while other approaches use rotomer libraries (Leach, A. R. 1994. Ligand docking to proteins with discrete side-chain flexibility. Journal of Molecular Biology 235:345-356); Jackson, R. M., Gabb, H. A. and Sternberg, M. J. E. 1998. Rapid refinement of protein interfaces incorporating solvation: Application to the docking problem. Journal of Molecular Biology 276:265-285.) Others still are based on molecular dynamics simulations (Wasserman Z R, Hodge C N. 1996. Fitting an inhibitor into the active site of thermolysin: A molecular dynamics study. Proteins 24:227-237.).
Another consideration that influences the accuracy of docking simulations is the solvation of the binding site (Ladbury, J. E. 1996. Just add water! The effect of water on the specificity of protein-ligand binding sites and its potential application to drug design. Chemistry and Biology 3:973-980.). Water bound in the ligand site is known to be a critical determinant of ligand specificity for HIV-1 protease (Lam, P. Y. S.; Jadhav, P. K.; Eyermann, C. J.; Hodge, C. N.; Ru, Y.; Bacheler, L. T.; Meek, J. L.; Otto, M. J.; Rayner, M. M.; Wong, Y. N.; Chang, C. H.; Weber, P. C.; Jackson, D. A.; Sharpe, T. R.; and Erickson-Viitanen, S. 1994. Rational design of potent, bioavailable, nonpeptide cyclic ureas as HIV protease inhibitors. Science 263:380-384.) cholera toxin (Merritt, E. A.; Sarfaty, S.; van den Akker, F.; L""Hoir, C.; Martial, J. A.; and Hol, W. G. 1994, Crystal structure of cholera toxin B-pentamer bound to receptor GM1 pentasaccharide. Protein Science 3(2):166-175), and other proteins, and is a ubiquitous component in molecular recognition (Raymer, M. L.; Sanschagrin, P. C.; Punch, W. F.; Venkataraman, S.; Goodman, E. D.; and Kuhn, L. A. 1997. Predicting conserved water-mediated and polar Ligand interactions in proteins using a k-nearest-neighbors genetic algorithm. Journal of Molecular Biology 265:445-464). For the docking tool FlexX, a technique called the particle concept has been recently proposed, which adds water molecules at favorable positions during the incremental construction of the ligand in the binding site (Rarey, M.; Kramer, B.; and Lengauer, T. 1999. The particle concept: Placing discrete water molecules during protein-ligand docking predictions. Proteins: Structure, Function, and Genetics 34:17-28.). This method was tested for 200 protein-ligand complexes and the accuracy of the predicted binding modes increased for several cases, including HIV protease. A different approach has been introduced for database screening using DOCK: here, the molecules in the database are solvated, which improves the ranking for known ligands and filters out molecules with inappropriate charge states and sizes in comparison to screening without solvation (Shiochet, B. K.; Leach, A. R.; and Kuntz, I. D. 1999. Ligand solvation in molecular docking. Proteins: Structure, Function, and Genetics 34:4-16.).
Most docking programs are still too computationally intensive to be used for practical small molecule databases. If a docking program takes approximately one minute per compound, which is about the lower limit used by the fastest docking tools, it would take approximately one week to screen a database of approximately 10,000 compounds. Since it is more efficacious to screen as diverse a set of molecules as possible, computational screening methods preferably should be effective on databases containing millions of compounds. Hence, there is a need for a computational screening method which can effectively screen a database containing a million compounds within a reasonable period of time.
The inventor has identified new methods for identifying molecules that may bind to a target molecule, and methods described in this patent may be implemented in a computer. In this section we summarize various aspects of these methods and below in the Detailed Description Section we present a more comprehensive description of the methods and their implementation and uses.
One of the methods described in this patent is a method for determining whether a candidate molecule may be docked to a target molecule comprising the steps of: (a) determining a pseudo-energy function, E, between a target molecule and a candidate molecule for one or more of the possible positions Pi of each atom, i, comprising the candidate molecule; (b) determining a function for the partial derivatives of E with respect to Pi for each atom i; (c) determining an initial set of geometric parameters, Qj, describing the conformation of the candidate molecule; (d) determining partial derivatives of Pi with respect to one or more of the Qj; (e) determining the dot product of             ∂      E              ∂              P        i              ·            ∂              P        i                    ∂              Q        j            
for one or more of the Qj; (f) summing each dot product determined in part e over the i atoms comprising the candidate molecule, thereby determining partial derivatives of E as a function of the geometric parameters comprising the set Q; (g) calculating a new set of geometric parameters based on the set of geometric parameters and the partial derivative of E as a function of the geometric parameters calculated in step f; (h) calculating a pseudo-energy for the candidate molecule with a conformation described by the new set of geometric parameters; (i) identifying the candidate molecule as docked if the pseudo-energy is less than a threshold energy; and (j) if the pseudo-energy is not less than a threshold energy, repeating steps d through j until a criteria is satisfied. In one version of this method, the criteria used in step j include (i) the difference in pseudo-energy in going from one iteration of steps d through j to the next is less than a cutoff energy, or (ii) the number of iterations of steps d through j exceeds a cutoff number.
In this method, in general, successful docking may be determined if the pseudo-energy reaches a threshold energy level. Typical threshold energy levels range from xe2x88x925.0 kcal/mol to xe2x88x925.0 kcal/mol. The pseudo-energy and its derivatives may be expressed either as a continuous function or as a finite field. When expressed as a finite field, the pseudo-energy and derivatives can be calculated in advance and used for docking each of the molecules in the set of candidate molecules without recalculation. This allows the computationally intense calculation over of the substituent parts or atoms of the target to be performed only once.
In one version of this method, the conformation space of the candidate molecule is represented in a rotatable bond representation. This representation allows efficient calculation of the derivatives of the positions of the substituent parts of the candidate molecule.
Another of the methods described in this patent is a method for identifying from a set of candidate molecules one or more candidate molecules that may bind to a target molecule, where an individual candidate molecule in the set of candidate molecules comprise one of more substituent parts selected from a set of substituent parts and an individual candidate molecule in the set of candidate molecules is described by one or more geometric parameters, the method comprising the steps of (1) for each substituent part in the set of substituent parts, calculating as a function of possible positions of the substituent part a spatial partial derivative of a pseudo-energy with respect to the position of the substituent part; (2) for each candidate molecule in the set of one or more candidate molecules, identifying an initial set of values for the geometric parameters; (3) for each candidate molecule in the set of one or more candidate molecules, determining if there is a set of values of the geometric parameters for which the candidate molecule and target molecule have a pseudo-energy meeting a pre-selected criteria by varying the values of the geometric parameters, where the values of the geometric parameters are varied depending on the partial derivatives of a pseudo-energy with respect to the values of the geometric parameters, and where the derivatives of the pseudo-energy with respect to the values of the geometric parameters is calculated from the spatial partial derivative of the pseudo-energy with respect to the positions of the substituent parts and derivative of the position of a substituent part relative to the target molecule with respect to the geometric parameters; and (4) for each candidate molecule in the set of one or more candidate molecules, if there exists a set of values of the geometric parameters for which the candidate molecule and target molecule have a pseudo-energy meeting a pre-selected criteria, identifying the candidate molecule as a molecule that may bind to the target molecule.
In one version of this method, representing a geometric parameter by the symbol Qj, representing the positions of the substituent parts in a candidate molecule by Pi, and representing the pseudo-energy by E, the derivative of the pseudo-energy with respect to the values of the geometric parameters i.e., ∂E/∂Qj, is given by             ∂      E              ∂              Q        j              =            ∑      i        ⁢          xe2x80x83        ⁢                            ∂          E                          ∂                      P            i                              ·                        ∂                      P            i                                    ∂                      Q            j                              
where the sum over i is over all substituent parts in a given candidate molecule.
Generally, the set of candidate molecules may be of any size. In one version of the methods described in this patent, the methods may be used for analyzing sets of candidate molecules, including but not limited to sets including from 2 to 100,000 members, sets including from 2 to 500,000 members, sets including from 2 to 1,000,000 members, sets including from 2 to 2,000,000 members, sets including from 2 to more than 2,000,000 members, sets including more than 50,000 members, sets including more than 100,000 members, sets including more than 500,000 members, sets including more than 1,000,000 members, and sets including more than 2,000,000 members. In one version of the methods described in this patent, the set of one or more candidate molecules includes two or more candidate molecules.
Generally the geometric parameters used for describing the candidate molecules may be any set of parameters capable of describing the geometry of the candidate molecule and the position of the candidate molecule relative to the target molecule. In one version of the methods described in this patent and described in more detail in the Detailed Description section, the geometric parameters include translational parameters, rotational parameters, and rotatable bond parameters.
Generally, the substituent parts of the candidate molecules may be any portion of the molecules, including but not limited to, individual atoms in the molecules, groups of atoms in the molecules, individual portions of high electron density in the molecules (for example, lone pairs). In one version of the methods described in this patent, the substituent parts are individual atoms or groups of atoms. In another version, the substituent parts are individual atoms.
The methods described in this patent may be implemented in a computer system and may be embodied in a computer readable medium.