An explanation of conventional drug discovery processes and their limitations is useful for understanding the present invention.
Discovering a new drug to treat or cure some biological condition, is a lengthy and expensive process, typically taking on average 12 years and $800 million per drug, and taking possibly up to 15 years or more and $1 billion to complete in some cases.
A goal of a drug discovery process is to identify and characterize a chemical compound or ligand biomolecule, that affects the function of one or more other biomolecules (i.e., a drug “target”) in an organism, usually a biopolymer, via a potential molecular interaction or combination. Herein the term biopolymer refers to a macromolecule that comprises one or more of a protein, nucleic acid (DNA or RNA), peptide or nucleotide sequence or any portions or fragments thereof. Herein the term biomolecule refers to a chemical entity that comprises one or more of a biopolymer, carbohydrate, hormone, or other molecule or chemical compound, either inorganic or organic, including, but not limited to, synthetic, medicinal, drug-like, or natural compounds, or any portions or fragments thereof.
The target molecule is typically a disease-related target protein or nucleic acid for which it is desired to affect a change in function, structure, and/or chemical activity in order to aid in the treatment of a patient disease or other disorder. In other cases, the target is a biomolecule found in a disease-causing organism, such as a virus, bacteria, or parasite, that when affected by the drug will affect the survival or activity of the infectious organism. In yet other cases, the target is a biomolecule of a defective or harmful cell such as a cancer cell. In yet other cases the target is an antigen or other environmental chemical agent that may induce an allergic reaction or other undesired immunological or biological response.
The ligand is typically a small molecule drug or chemical compound with desired drug-like properties in terms of potency, low toxicity, membrane permeability, solubility, chemical/metabolic stability, etc. In other cases, the ligand may be biologic such as an injected protein-based or peptide-based drug or even another full-fledged protein. In yet other cases the ligand may be a chemical substrate of a target enzyme. The ligand may even be covalently bound to the target or may in fact be a portion of the protein, e.g., protein secondary structure component, protein domain containing or near an active site, protein subunit of an appropriate protein quaternary structure, etc.
Throughout the remainder of the background discussion, unless otherwise specifically differentiated, a (potential) molecular combination will feature one ligand and one target, the ligand and target will be separate chemical entities, and the ligand will be assumed to be a chemical compound while the target will be a biological protein (mutant or wild type). Note that the frequency of nucleic acids (both DNA/RNA) as targets will likely increase in coming years as advances in gene therapy and pathogenic microbiology progress. Also the term “molecular complex” will refer to the bound state between the target and ligand when interacting with one another in the midst of a suitable (often aqueous) environment. A “potential” molecular complex refers to a bound state that may occur albeit with low probability and therefore may or may not actually form under normal conditions.
The drug discovery process itself typically includes four different subprocesses: (1) target validation; (2) lead generation/optimization; (3) preclinical testing; and (4) clinical trials and approval.
Target validation includes determination of one or more targets that have disease relevance and usually takes two-and-a-half years to complete. Results of the target validation phase might include a determination that the presence or action of the target molecule in an organism causes or influences some effect that initiates, exacerbates, or contributes to a disease for which a cure or treatment is sought. In some cases a natural binder or substrate for the target may also be determined via experimental methods.
Lead generation typically involves the identification of lead compounds that can bind to the target molecule and thereby alter the effects of the target through either activation, deactivation, catalysis, or inhibition of the function of the target, in which case the lead would be a viewed as a suitable candidate ligand to be used in the drug application process. Lead optimization involves the chemical and structural refinement of lead candidates into drug precursors in order to improve binding affinity to the desired target, increase selectivity, and also to address basic issues of toxicity, solubility, and metabolism. Together lead generation and lead optimization typically takes about three years to complete and might result in one or more chemically distinct leads for further consideration.
In preclinical testing, biochemical assays and animal models are used to test the selected leads for various pharmacokinetic factors related to drug absorption, distribution, metabolism, excretion, toxicity, side effects, and required dosages. This preclinical testing takes approximately one year. After the preclinical testing period, clinical trials and approval take another six to eight or more years during which the drug candidates are tested on human subjects for safety and efficacy.
Rational drug design generally uses structural information about drug targets (structure-based) and/or their natural ligands (ligand-based) as a basis for the design of effective lead candidate generation and optimization. Structure-based rational drug design generally utilizes a three-dimensional model of the structure for the target. For target proteins or nucleic acids such structures may be as the result of X-ray crystallography/NMR or other measurement procedures or may result from homology modeling, analysis of protein motifs and conserved domains, and/or computational modeling of protein folding or the nucleic acid equivalent. Model-built structures are often all that is available when considering many membrane-associated target proteins, e.g., GPCRs and ion-channels. The structure of a ligand may be generated in a similar manner or may instead be constructed ab initio from a known 2-D chemical representation using fundamental physics and chemistry principles, provided the ligand is not a biopolymer.
Rational drug design may incorporate the use of any of a number of computational components ranging from computational modeling of target-ligand molecular interactions and combinations to lead optimization to computational prediction of desired drug-like biological properties. The use of computational modeling in the context of rational drug design has been largely motivated by a desire to both reduce the required time and to improve the focus and efficiency of drug research and development, by avoiding often time consuming and costly efforts in biological “wet” lab testing and the like.
Computational modeling of target-ligand molecular combinations in the context of lead generation may involve the large-scale in-silico screening of compound libraries (i.e., library screening), whether the libraries are virtually generated and stored as one or more compound structural databases or constructed via combinatorial chemistry and organic synthesis, using computational methods to rank a selected subset of ligands based on computational prediction of bioactivity (or an equivalent measure) with respect to the intended target molecule.
Throughout the text, the term “binding mode” refers to the 3-D molecular structure of a potential molecular complex in a bound state at or near a minimum of the binding energy (i.e., maximum of the binding affinity), where the term “binding energy” (sometimes interchanged with “binding affinity” or “binding free energy”) refers to the change in free energy of a molecular system upon formation of a potential molecular complex, i.e., the transition from an unbound to a (potential) bound state for the ligand and target. Here the term free energy generally refers to both enthalpic and entropic effects as the result of physical interactions between the constituent atoms and bonds of the molecules between themselves (i.e., both intermolecular and intramolecular interactions) and with their surrounding environment. Examples of the free energy are the Gibbs free energy encountered in the canonical or grand canonical ensembles of equilibrium statistical mechanics.
In general, the optimal binding free energy of a given target-ligand pair directly correlates to the likelihood of formation of a potential molecular complex between the two molecules in chemical equilibrium, though, in truth, the binding free energy describes an ensemble of (putative) complexed structures and not one single binding mode. However, in computational modeling it is usually assumed that the change in free energy is dominated by a single structure corresponding to a minimal energy. This is certainly true for tight binders (pK˜0.1 to 10 nanomolar) but questionable for weak ones (pK˜10 to 100 micromolar). The dominating structure is usually taken to be the binding mode. In some cases it may be necessary to consider more than one alternative-binding mode when the associated system states are nearly degenerate in terms of energy.
It is desirable in the drug discovery process to identify quickly and efficiently the optimal docking configurations, i.e., binding modes, of two molecules or parts of molecules. Efficiency is especially relevant in the lead generation and lead optimization stages for a drug discovery pipeline, where it is desirable to accurately predict the binding mode for possibly millions of potential molecular complexes, before submitting promising candidates to further analysis.
Binding modes and binding affinity are of direct interest to drug discovery and rational drug design because they often help indicate how well a potential drug candidate may serve its purpose. Furthermore, where the binding mode is determinable, the action of the drug on the target can be better understood. Such understanding may be useful when, for example, it is desirable to further modify one or more characteristics of the ligand so as to improve its potency (with respect to the target), binding specificity (with respect to other targets), or other chemical and metabolic properties.
A number of laboratory methods exist for measuring or estimating affinity between a target molecule and a ligand. Often the target might be first isolated and then mixed with the ligand in vitro and the molecular interaction assessed experimentally such as in the myriad biochemical and functional assays associated with high throughput screening. However, such methods are most useful where the target is simple to isolate, the ligand is simple to manufacture and the molecular interaction easily measured, but is more problematic when the target cannot be easily isolated, isolation interferes with the biological process or disease pathway, the ligand is difficult to synthesize in sufficient quantity, or where the particular target or ligand is not well characterized ahead of time. In the latter case, many thousands or millions of experiments might be needed for all possible combinations of the target and ligands, making the use of laboratory methods unfeasible.
While a number of attempts have been made to resolve this bottleneck by first using specialized knowledge of various chemical and biological properties of the target (or even related targets such as protein family members) and/or one or more already known natural binders or substrates to the target, to reduce the number of combinations required for lab processing, this is still impractical and too expensive in most cases. Instead of actually combining molecules in a laboratory setting and measuring experimental results, another approach is to use computers to simulate or characterize molecular interactions between two or more molecules (i.e., molecular combinations modeled in silico). The use of computational methods to assess molecular combinations and interactions is usually associated with one or more stages of rational drug design, whether structure-based, ligand-based, or both.
The computational prediction of one or more binding modes and/or the computational assessment of the nature of a molecular combination and the likelihood of formation of a potential molecular complex is generally associated with the term “docking” in the art. To date, conventional “docking” methods have included a wide variety of computational techniques as described in the forthcoming section entitled “REFERENCES & PRIOR ART”.
Whatever the choice of computational docking method there are inherent trade-offs between the computational complexity of both the underlying molecular models and the intrinsic numerical algorithms, and the amount of computing resources (time, number of CPUs, number of simulations) that must be allocated to process each molecular combination. For example, while highly sophisticated molecular dynamics simulations (MD) of the two molecules surrounded by explicit water molecules and evolved over trillions of time steps may lead to higher accuracy in modeling the potential molecular combination, the resultant computational cost (i.e., time and computing power) is so enormous that such simulations are intractable for use with more than just a few molecular combinations.
One major distinction amongst docking methods as applied to computational modeling of molecular combinations is whether the ligand and target structures remain rigid throughout the course of the simulation (i.e., rigid-body docking) vs. the ligand and/or target being allowed to change their molecular conformations (i.e., flexible docking). In general, the latter scenario involves more computational complexity, though flexible docking may often achieve higher accuracy than rigid-body docking when modeling various molecular combinations.
That being said rigid-body docking can provide valuable insight into the nature of a molecular combination and/or the likelihood of formation of a potential molecular complex and has many potential uses within the context of rational drug discovery. For instance rigid-body docking may be appropriate for docking small, rigid molecules (or molecular fragments) to a simple protein with a well-defined, nearly rigid active site. As another example, rigid-body docking may also be used to more efficiently and rapidly screen out a subset of likely nonactive ligands in a molecule library for a given target, and then applying more onerous flexible docking procedures to the surviving candidate molecules. Rigid-body docking may also be suitable for de novo ligand design and combinatorial library design.
Moreover, in order to better predict the binding mode and better assess the nature and/or likelihood of a molecular combination when one or both molecules are likely to be flexible, rigid-body docking can be used in conjunction with a process for generating likely yet distinct molecular conformers of one or both molecules for straightforward and efficient virtual screening of a molecule library against a target molecule. However, as will be discussed, even rigid body docking of molecular combinations can be computationally expensive and thus there is a clear need for better and more efficient computational methods based on rigid body docking when assessing the nature and/or likelihood of molecular combinations.
As outlined in the section entitled “REFERENCES & PRIOR ART”, conventional computational methods for predicting binding modes and assessing the nature and/or likelihood of molecular combinations in the context of rigid-body docking include a wide variety of techniques. These include methods based on pattern matching (often graph-based), maximization of shape complementarity (i.e., shape correlations), geometric hashing, pose clustering, and even the use of one or more flexible docking methods with the simplifying run-time condition that both molecules are rigid.
Of special interest to this invention is class of rigid-body docking techniques based on the maximization of shape complementarity via evaluation of the spatial correlation between two representative molecular surfaces at different relative positions and orientations. One example is the “Hex” docking software described in Ritchie, D. W. and Kemp. G. J. L, “Protein Docking Using Spherical Polar Fourier Correlations”, (2000), Proteins: Structure, Function, and Genetics, 39, 178-194; (hereinafter, “Ritchie et al”), all of which is hereby incorporated by reference in their entirety.
Further examples include the “FTDOCK” docking software of the Cambridge Crystallographic Data Center described in Aloy, P., Moont, G., Gabb, H. A., Querol, E., Aviles, F. X., and Sternberg, M. J. E., “Modeling Protein Docking using Shape Complementarity, Electrostatics and Biochemical Information,” (1998), Proteins: Structure, Function, and Genetics, 33(4) 535-549; all of which is hereby incorporated by reference in their entirety.
Such shape complementarity based methods while typically treating molecules as rigid and thus perhaps less rigorous than their flexible docking counterparts, especially in the context of flexible molecules, is still potentially valuable for the fast, efficient screening of two molecules in order to make a preliminary assessment of the nature and/or likelihood of formation of a potential molecular complex of the two molecules or to make an initial prediction of the preferred binding mode for the molecular combination. Such a preliminary assessment may significantly reduce the number of candidates that must be further screened in silico by another more computationally costly docking method. Moreover, the utility of computing shape complementarity has been demonstrated with respect to multiple protein-protein systems, including both enzyme-inhibitor and antibody-antigen, as per FTDOCK and Ritchie et al.
Previous formulations for the computation of shape complementarity generally fall into four categories.
The first category involves the calculation of a spatial correlation in the spatial domain between two volumetric functions describing a representative molecular surface for each molecule, where the spatial correlation between two 3-D complex functions, f({right arrow over (r)}) and g({right arrow over (r)}), is calculated as follows:
                              f          *          g                =                              ∫                          r              →                                ⁢                                                    f                _                            ⁡                              (                                  r                  →                                )                                      ⁢                          g              ⁡                              (                                                      r                    →                                    +                                      x                    →                                                  )                                      ⁢                          ⅆ                              r                →                                                                        [                  Eqn          .                                          ⁢          1                ]            where f denotes the complex conjugate and * denotes convolution. In the spatial domain, the spatial correlation is performed by converting the integrals into summations and directly computing over a sampling space comprising three translation variables with a specified sampling resolution. A search is then conducted by reevaluating the spatial correlation for each new and different relative orientation of the molecules. Those configurations that show the highest net spatial correlation are typically selected as possible candidate binding modes.
However, the method of directly computing the spatial correlation in the spatial domain is often very computationally intensive, since if the sampling translation grid is a M×M×M grid, the above spatial correlation calculation requires O(M6) operations. For instance, when M=256, there are more than 2.8×1014 multiplication operations required. Furthermore, the O(M6) calculations must be performed for every relative orientation of the two molecules, making the total number of calculations impractical at best.
The second category involves the calculation of a spatial correlation in the frequency domain between two volumetric functions describing a representative molecular surface for each molecule, where the spatial correlation between two 3-D functions, f and g, is still defined as before, but a frequency space decomposition, such as a Fourier transform, is used in order to reduce the number of calculations. For a full description of the Fourier transform refer to Press, W. H., Flannery, B. P., Teukolsky, S. A., and Vetterling, W. T., “Numerical Recipes in C: The Art of Scientific Computing”, Cambridge University Press (1993) (hereinafter, “Press et al”), hereby incorporated by reference in its entirety.
To compute a 3-D spatial correlation in Fourier space, one can use the following relation, also known as the Correlation Theorem (Press et al). The convolution of two complex 3-D functions f({right arrow over (r)}) with g({right arrow over (r)}) is given by:f*g=F−1( F(υ)G(υ))  [Eqn. 2]where F−1 refers to the inverse Discrete Fourier Transform (Press et al), F(υ) is Discrete Fourier Transform of the complex conjugate of f({right arrow over (r)}) and G(υ) is the Discrete Fourier Transform of g({right arrow over (r)}). Similar formulae can be generated for other frequency decompositions besides Fourier, such as a Laplace transform, a Discrete Cosine transform, and others, as described in Arfken, G. B., and Weber H. J., “Mathematical Method for Physicist”, Harcourt/Academic Press (2000), (hereinafter, “Arfken et al”), hereby incorporated by reference in its entirety.
For the same M×M×M grid, the frequency based evaluation of the spatial correlation will require approximately O(3M3 ln(M)) operations where ln(M) denotes the natural logarithm of M. While the number of operations decreases substantially when the DFTs are used as opposed to direct computation in the spatial domain, the amount of memory storage and/or the amount of data that must be read from storage must still be taken into account, i.e., the input/output (I/O) bandwidth requirement.
For example, for M=256 at 16 bit precision, 800 Mbits are required for computing the 3-D spatial correlation using DFTs for just one relative orientation. Generally, this is a very large amount of data for storage directly in memory and would require millions of clock cycles to fetch from one or more DRAM chips with current DRAM and I/O bus technology. Moreover, the O(3M3 ln(M)) calculations and the access of hundreds of millions of data bits must be performed for every relative orientation of the two molecules, making the entire process onerous when considering possibly millions of such relative orientations in the course of a high resolution search of the shape complementarity space. For this reason, Fourier based methods for evaluating shape complementarity often take hours on conventional computer software in order to complete for large protein systems, for instance, as in FTDOCK, and as such are not suitable for large-scale screening.
The third category involves the least-square minimization (or equivalent minimization) of separation distances between critical surface and/or fitting points that represent the molecular surfaces of the two molecules. Examples include the Patchdock docking software written by Nussinov-Wolfson Structural Bioinformatics Group at Tel-Aviv University, based on principles described in Lin, S. L., Nussinov, R., Fischer, D., and Wolfson, H. J., “Molecular surface representations by sparse critical points”, (1994) Proteins: Structure, Function, and Genetics, 18, 94-101; all of which are hereby incorporated by reference in their entirety.
Such methods often suffer from degraded accuracy, especially when the molecular surface geometry is complex or when the ligand molecule is very small relative to the protein receptor and/or characterized by poor binding affinities. Moreover, the cost of computing the surface critical points is often itself very expensive.
The number of computations associated with the three method categories described above renders the process impractical for use with conventional computer software and hardware configurations when performing large-scale screening. Moreover, the above methods are not practical for high accuracy prediction of the binding mode due to the requirement of a high resolution of the associated sampling space.
A fourth category has been developed for the efficient estimation of shape complementarity based on the decomposition of two volumetric functions describing a representative molecular surface for each molecule onto an appropriate orthogonal basis set, such as a radial-spherical harmonics expansion, as described in Ritchie et al. The chief advantage of this type of method is that the required number of calculations scale linearly with the desired number of sampled configurations, thus allowing for a dense sampling of the geometric shape complementarity. Moreover, the computing time is roughly invariant with respect to the sizes of the two molecules and is thus suitable for protein-protein docking. However, to achieve high accuracy for complex molecular surface geometries, it is necessary to perform the orthogonal basis expansion with a large expansion order and as such the total computing time can be quite large. Furthermore, current methods such as those outlined in Ritchie et al are not amenable to implementation in customized or other application specific hardware for use in large-scale screening.
In summary, it is desirable in the drug discovery process to identify quickly and efficiently the optimal configurations, i.e., binding modes, of two molecules or parts of molecules. Efficiency is especially relevant in the lead generation and lead optimization stages for a drug discovery pipeline, where it may be desirable to accurately predict the binding mode and binding affinity for possibly millions of potential target-ligand molecular combinations, before submitting promising candidates to further analysis. There is a clear need then to have more efficient systems and methods for computational modeling of the molecular combinations with reasonable accuracy.
In general, the present invention relates to an efficient computational method for analysis of molecular combinations based on maximization of shape complementarity over a set of configurations of a molecular combination through computation of a basis expansion representing molecular shapes in a coordinate system. Here the analysis of the molecular combination may involve the prediction of likelihood of formation of a potential molecular complex, the prediction of the binding mode (or even additional alternative modes) for the combination, the characterization of the nature of the interaction or binding of various components of the molecular combination, or even an approximation of binding affinity for the molecular combination based on a shape complementarity score or an equivalent measure. The invention also addresses and solves the various hardware implementation hurdles and bottlenecks associated with current conventional methods.