An explanation of conventional drug discovery processes and their limitations is useful for understanding the present invention.
Discovering a new drug to treat or cure some biological condition, is a lengthy and expensive process, typically taking on average 12 years and $800 million per drug, and taking possibly up to 15 years or more and $1 billion to complete in some cases. The process may include wet lab testing/experiments, various biochemical and cell-based assays, animal models, and also computational modeling in the form of computational tools in order to identify, assess, and optimize potential chemical compounds that either serve as drugs themselves or as precursors to eventual drug molecules.
A goal of a drug discovery process is to identify and characterize a chemical compound or ligand, i.e., binder, biomolecule, that affects the function of one or more other biomolecules (i.e., a drug “target”) in an organism, usually a biopolymer, via a potential molecular interaction or combination. Herein the term biopolymer refers to a macromolecule that comprises one or more of a protein, nucleic acid (DNA or RNA), peptide or nucleotide sequence or any portions or fragments thereof. Herein the term biomolecule refers to a chemical entity that comprises one or more of a biopolymer, carbohydrate, hormone, or other molecule or chemical compound, either inorganic or organic, including, but not limited to, synthetic, medicinal, drug-like, or natural compounds, or any portions or fragments thereof. The target molecule is typically a disease-related target protein or nucleic acid for which it is desired to affect a change in function, structure, and/or chemical activity in order to aid in the treatment of a patient disease or other disorder. In other cases, the target is a biomolecule found in a disease-causing organism, such as a virus, bacteria, or parasite, that when affected by the drug will affect the survival or activity of the infectious organism. In yet other cases, the target is a biomolecule of a defective or harmful cell such as a cancer cell. In yet other cases the target is an antigen or other environmental chemical agent that may induce an allergic reaction or other undesired immunological or biological response.
The ligand is typically what is known as a small molecule drug or chemical compound with desired drug-like properties in terms of potency, low toxicity, membrane permeability, solubility, chemical/metabolic stability, etc. In other cases, the ligand may be biologic such as an injected protein-based or peptide-based drug or even another full-fledged protein. In yet other cases, the ligand may be a chemical substrate of a target enzyme. The ligand may even be covalently bound to the target or may in fact be a portion of the protein, e.g., protein secondary structure component, protein domain containing or near an active site, protein sub-unit of an appropriate protein quaternary structure, etc.
Throughout the remainder of the background discussion, unless otherwise specifically differentiated, a (potential) molecular combination will feature one ligand and one target, the ligand and target will be separate chemical entities, and the ligand will be assumed to be a chemical compound while the target will typically be a biological protein (mutant or wild type). Note that the frequency of nucleic acids (both DNA/RNA) as targets will likely increase in coming years as advances in gene therapy and pathogenic microbiology progress. Also the term “molecular complex” will refer to the bound state between the target and ligand when interacting with one another in the midst of a suitable (often aqueous) environment. A “potential” molecular complex refers to a bound state that may occur albeit with low probability and therefore may or may not actually form under normal conditions.
The drug discovery process itself typically includes four different sub-processes: (1) target validation; (2) lead generation/optimization; (3) preclinical testing; and (4) clinical trials and approval.
Target validation includes determination of one or more targets that have disease relevance and usually takes two-and-a-half years to complete. Results of the target validation phase might include a determination that the presence or action of the target molecule in an organism causes or influences some effect that initiates, exacerbates, or contributes to a disease for which a cure or treatment is sought. In some cases a natural binder or substrate for the target may also be determined via experimental methods.
Lead generation typically involves the identification of lead compounds, i.e., ligands that can bind to the target molecule, that may alter the effects of the target through either activation, deactivation, catalysis, or inhibition of the function of the target, in which case the lead would be a viewed as a suitable candidate ligand to be used in the drug application process. Lead optimization involves the chemical and structural refinement of lead candidates into drug precursors in order to improve binding affinity to the desired target, increase selectivity, and to address basic issues of toxicity, solubility, and metabolism. Together lead generation and lead optimization typically takes about three years to complete and might result in one or more chemically distinct leads for further consideration.
In preclinical testing, biochemical assays and animal models are used to test the selected leads for various pharmacokinetic factors related to drug absorption and membrane permeability, distribution, metabolism, excretion, toxicity, side effects, and required dosages. This preclinical testing takes approximately one year. After the preclinical testing period, clinical trials and approval take another six to eight or more years during which the drug candidates are tested on human subjects for safety and efficacy. Part of the exorbitant expense of today's drug discovery is that many optimized leads still fail in preclinical and clinical testing, due to side effects or other reasons. In fact, the number of drugs that survive clinical trials and are ultimately approved is very low when compared to the projects that are initiated on validated target proteins near the beginning of the drug discovery process.
Rational drug design generally uses structural information about drug targets (structure-based) and/or their natural ligands (ligand-based) as a basis for the design of effective lead candidate generation and optimization. Structure-based rational drug design generally utilizes a three-dimensional model of the structure for the target. For target proteins or nucleic acids such structures may be as the result of X-ray crystallography/NMR or other measurement procedures or may result from homology modeling, analysis of protein motifs and conserved domains, and/or computational modeling of protein folding or the nucleic acid equivalent. Model-built structures are often all that is available when considering many membrane-associated target proteins, e.g., GPCRs and ion-channels. The structure of a ligand may be generated in a similar manner or may instead be constructed ab initio from a known 2-D chemical representation using fundamental physics and chemistry principles, provided the ligand is not a biopolymer.
Rational drug design may incorporate the use of any of a number of computational components ranging from computational modeling of target-ligand molecular interactions and combinations to lead optimization to computational prediction of desired drug-like biological and pharmacokinetic properties. The use of computational modeling in the context of rational drug design has been largely motivated by a desire to both reduce the required time and to improve the focus and efficiency of drug research and development, by avoiding often time consuming and costly efforts in biological “wet” lab testing and the like.
Computational modeling of target-ligand molecular combinations in the context of lead generation may involve the large-scale in-silico screening of molecule libraries (i.e., library screening), whether the libraries are virtually generated and stored as one or more structural databases or constructed via combinatorial chemistry and organic synthesis, using computational methods to rank a selected subset of ligands based on computational prediction of bioactivity (or an equivalent measure) with respect to the intended target molecule.
Throughout the text, the term “binding mode” refers to the 3-D molecular structure of a potential molecular complex in a bound state at or near a minimum of the binding energy (i.e., maximum of the binding affinity), where the term “binding energy” (sometimes interchanged with “binding free energy” or with its conceptually antipodal counterpart “binding affinity”) refers to the change in free energy of a molecular system upon formation of a potential molecular complex, i.e., the transition from an unbound to a (potential) bound state for the ligand and target. The term “system pose” is also sometimes used to refer to the binding mode. Here the term free energy generally refers to both enthalpic and entropic effects as the result of physical interactions between the constituent atoms and bonds of the molecules between themselves (i.e., both intermolecular and intramolecular interactions) and with their surrounding environment, meaning the physical and chemical surroundings of the site of reaction between one or more molecules. Examples of the free energy are the Gibbs free energy encountered in the canonical or grand canonical ensembles of equilibrium statistical mechanics.
In general, the optimal binding free energy of a given target-ligand pair directly correlates to the likelihood of combination or formation of a potential molecular complex between the two molecules in chemical equilibrium, though, in truth, the binding free energy describes an ensemble of (putative) complex structures and not one single binding mode. However, in computational modeling it is usually assumed that the change in free energy is dominated by a single structure corresponding to a minimal energy. This is certainly true for tight binders (pK˜0.1 to 10 nanomolar) but questionable for weak ones (pK˜10 to 100 micromolar). The dominating structure is usually taken to be the binding mode. In some cases, it may be necessary to consider more than one alternative-binding mode when the associated system states are nearly degenerate in terms of energy.
Binding affinity is of direct interest to drug discovery and rational drug design because the interaction of two molecules, such as a protein that is part of a biological process or pathway and a drug candidate sought for targeting a modification of the biological process or pathway, often helps indicate how well the drug candidate will serve its purpose. Furthermore, where the binding mode is determinable, the action of the drug on the target can be better understood. Such understanding may be useful when, for example, it is desirable to further modify one or more characteristics of the ligand to improve its potency (with respect to the target), binding specificity (with respect to other target biopolymers), or other chemical and metabolic properties.
A number of laboratory methods exist for measuring or estimating affinity between a target molecule and a ligand. Often the target might be first isolated and then mixed with the ligand in vitro and the molecular interaction assessed experimentally such as in the myriad biochemical and functional assays associated with high throughput screening. However, such methods are most useful where the target is simple to isolate, the ligand is simple to manufacture and the molecular interaction easily measured, but is more problematic when the target cannot be easily isolated, isolation interferes with the biological process or disease pathway, the ligand is difficult to synthesize in sufficient quantity, or where the particular target or ligand is not well characterized ahead of time. In the latter case, many thousands or millions of experiments might be needed for all possible combinations of the target and ligands, making the use of laboratory methods unfeasible.
While a number of attempts have been made to resolve this bottleneck by first using specialized knowledge of various chemical and biological properties of the target (or even related targets such as protein family members) and/or one or more already known natural binders or substrates to the target, to reduce the number of combinations required for lab processing, this is still impractical and too expensive in most cases. Instead of actually combining molecules in a laboratory setting and measuring experimental results, another approach is to use computers to simulate or characterize molecular interactions between two or more molecules (i.e., molecular combinations modeled in silico). The use of computational methods to assess molecular combinations and interactions is usually associated with one or more stages of rational drug design, whether structure-based, ligand-based, or both.
When computationally modeling the nature and/or likelihood of a potential molecular combination for a given target-ligand pair, the actual computational prediction of binding mode and affinity is customarily accomplished in two parts: (a) “docking”, in which the computational system attempts to predict the optimal binding mode for the ligand and the target and (b) “scoring”, in which the computational system attempts to estimate the binding affinity associated with the computed binding mode. During library screening, scoring may also be used to predict a relative binding affinity for one ligand vs. another ligand with respect to the target molecule and thereby rank prioritize the ligands or assign a probability for binding.
Docking may involve a search or function optimization algorithm, whether deterministic or stochastic in nature, with the intent to find one or more system poses that have favorable affinity.
Scoring may involve a more refined estimation of an affinity function, where the affinity is represented in terms of a combination of one or more empirical, molecular-mechanics-based, quantum mechanics-based, or knowledge-based expressions, i.e., a scoring function. Individuals scoring functions may themselves be combined to form a more robust consensus-scoring scheme using a variety of formulations. In practice, there are many different docking strategies and scoring schemes employed in the context of today's computational drug design.
Another important area of application for computational docking and/or scoring, beyond the scope of virtual library screening, is in the process of in silico lead optimization, in which lead candidates are computationally examined with more scrutiny with respect to their binding affinity to the target. Most biomolecules are only relevant as potential drugs if they can outcompete other biomolecules that may interact with the target protein in the same or a nearby active site, including either the target's usual binding partner, or another a natural compound or antagonist, or even another drug. The results of high throughput screening usually only identify lead candidates with micromolar or worse binding affinities (i.e., IC50˜1-100 micromolar). Lead optimization is involved with refining or modifying the lead candidate(s) in order to generate submicromolar and even nanomolar (i.e., IC50˜10−9) drugs. Typical computational methods for estimating binding affinity of modified leads for the purpose of lead optimization include QSAR [53, 54], QM, MM, or QM/MM simulations [55], estimation of the change in free energy of the system using perturbation theory [56, 57], and other methods for structure-based molecular design [58].
However, even if a potential lead candidate binds well to one or more desired target biopolymers, i.e., demonstrates good bioactivity, the candidate molecule must ultimately meet further rigorous requirements regarding metabolism, toxicity, unwanted side effects, host distribution and delivery to the intended target site, inter and intra cellular transport, and excretion. In order to assess the viability of the lead candidate as a potential drug molecule and thereby possibly both shorten the timeline to market and reduce wasted R&D efforts, computational modeling has been employed to generate so-called ADME/Tox (Absorption Distribution Metabolism Excretion/Toxicology) profiles [59].
There are many measures involved in generating an ADME/Tox profile in silico. These include empirically guided predictions or estimations of bioaccumulation, bioavailability, metabolism, pKa, carcinogeneticity, mutagenicity, Log D, n-octanol/water partition coefficient or Log P, water solubility, permeability with the respect to blood brain barrier and other membranes, intestinal absorption, skin sensitivity, and even chemical and structural similarity to other known drugs or organic compounds. Some measures involve more precise numerical computation or prediction of physical or energetic properties of the biomolecule, such as pKa, Log D, Log P, etc. [60]. Other measures typically involve usage of knowledge or rule-based approaches, such as prediction of metabolic properties based on known biological pathways and transformations, qualitative estimation of distribution properties based on Lipinski's empirical “rule of five” [61], and prediction of toxicity using chemoinformatics knowledge databases to assess carcinogeneticity, mutagenicity, teratogeneticity, skin sensitivity, etc. Yet other measures are based on chemical and structural similarity to known biomolecules that are for example permeable with respect to the blood brain barrier or perhaps adversely affect the function of various organs, such as the liver or kidneys.
However, even with all of these computational tools in place, the failure rate due to safety and efficacy considerations in both preclinical testing and clinical trials featuring human (or other) subjects is still staggering in terms of both cost and lost market opportunity. A key reason for these failures is the existence of adverse cross-reactions between the lead candidate and other biomolecules (often other biopolymers) in the host organism, due to lack of specificity in binding to the desired target. These adverse cross-reactions can often lead to reduced efficacy of the drug candidate, to unwanted side effects, and even illness or death.
For example, the lead candidate while inhibiting the desired target protein may also unfortunately bind to one or more proteins or cell surface receptors in the liver, leading to a chemical imbalance or even a serious medical condition such as cirrhosis. In yet another example, the lead candidate may be misidentified by the host organism's immune system as a pathogen or other antigen leading to allergic reactions and other immune disorders. Ironically, even many of the commercial drugs that make it past clinical trials and approval often have various undesired side effects, such as insomnia, nausea, dehydration, erectile dysfunction, and drowsiness. Others such as cancer fighting drugs or strong antiviral agents have far more serious side effects which are only tolerated by the medical community since the underlying disease is otherwise lethal, e.g., chemotherapy drugs for various cancers, HIV cocktail drugs, etc.
Thus advanced knowledge of potential adverse cross reactions could not only reduce the time and cost involved in preclinical testing and clinical trials, but in also selecting drug candidates with less unwanted side effects and higher efficacy that would directly benefit the commercial viability of the drug.
As of Sep. 30, 2003, there are 20,501 protein structures, 948 protein/nucleic acid complexes, and 1233 nucleic acids (plus 18 carbohydrates) for a total 22,700 macromolecular structures in the Protein Data Bank (PDB) with an estimated doubling rate of approximately three years, as per the information published at the PDB web site at www.rcsb.org/pdb/[62]. Approximately 83% are obtained via X-ray diffraction, 15% by NMR, and the remainder via other experimental methods. The nearly 20 K protein structures come from more than a 1000 individual species, with 4767 from Homo sapiens, 2411 from E. Coli, 1030 from mouse, etc. whereas roughly 1900 are synthetic and for an additional 4658 entries the species is not clearly identified; for a complete breakdown by species please see www.biochem.ucl.ac.uk/bsm/pdbsum/species/. Further analysis and categorization of protein structure information available in the PDB may be found in [65] and [66]. The protein structures available in the PDB for nonhuman proteins often carry a wealth of information for purposes of drug design, especially considering that many proteins from other organisms such as mouse, pig, chicken, C elegans, yeast, etc. are homologous to human proteins in terms of both structure and function. Furthermore certain disease targets involve nonhuman proteins such as viral or bacterial proteins. New advances in structural proteomics and functional genomics continue to expand the number of available quality 3-D protein structures with functional annotation. Also further advancements in homology modeling, protein motifs, and protein threading continue to supplement the 3-D protein and nucleic acid structures obtained via experiment. Thus, the list of well-characterized potential cross-reactants for many drug candidates is already becoming substantial and will only continue to grow in the coming years.
Prior to this invention, there have been no systematic methods for precisely and effectively calculating the adverse reactions of lead molecules (i.e., potential drug candidates) on a computer based system.
Recently, a method named “inverse docking” has focused on the problem of much more limited scope, the identification of alternative targets for a single drug-like molecule and was described in Chen, Y. Z. and Zhi, D. G., “Ligand-Protein Inverse Docking and Its Potential Use in the Computer Search of Protein Targets of a Small Molecule”, Proteins Vol. 43, 217-226 (2001) and further in Chen, Y. Z., Zhi, D. G., Ung, C. Y., “Computational Method for Drug Target Search and Application in Drug Discovery”, Journal of Theoretical and Computational Chemistry, Vol. 1, No. 1, 213-224 (2002), (hereinafter, “Chen et al.”); all of which is hereby incorporated by reference in their entirety. Even in this limited scope, these methods suffer from significant short comings as we will presently describe. The method of Chen et al. relies on a very specific geometric algorithm that essentially utilizes a mixture of the ‘sphgen’ algorithm generally associated with the UCSF software-docking tool DOCK [5, 6, 7] and a hybrid docking method described in Wang et al. [33] for evaluating the interactions between a known drug candidate and potential alternative targets. It has already been demonstrated in the art that such an approach will not be robust in predicting an accurate binding mode, let alone an accurate measure for binding energy, for most ligands.
Such a method relies on a simple molecular mechanics prediction of the binding energy, which is known in the art to predict the change in free energy with an error of ±3 kcal/mol for most systems [34-39]. Yet even a difference of two kcal/mol leads to a nearly 30-fold difference in the dissociation or inhibition constant of a molecular system. Thus, the method of Chen et al. cannot be expected to well estimate the absolute change in binding energy of the system. This is in fact supported by closer scrutiny of the authors' own published data. In Table 1 of Chen et al. (2002), four out of the nine systems (1hvr, 4phv, 1dhf, 3cpa) have publicly available experimental binding free energy values (respectively −13.13, −12.86, −3.08, and −5.39 kcal/mol as published at www-mitchell.ch.cam.ac.uk/pld/energy.php). Yet, this is in stark contrast to the reported predictions of respectively −70.2, −94.51, −48.67, and −40.63 kcal/mol. Moreover, the ligands in pdb entries 1hvr and 4phv are experimentally measured to have similar binding affinity for the same HIV-1 protease target protein, yet the corresponding predictions are off by more than 34%; a significant error.
Additionally, the method of Chen et al. relies on accurate prior characterization of the relevant active site of the potential target protein. While such prior information is often available for proteins in relation to their natural binders or agonists/antagonists, such is not often the case when exploring potential cross-reactions between various biopolymers and lead candidates designed for other, specific target proteins.
Furthermore, the method of Chen et al. utilizes an empirical threshold based on fitting a polynomial function of the number of ligand atoms to results generated by their docking protocol on a subset of pdb structures in order to remove certain false positive candidates. Yet, as anyone skilled in the art will recognize, no such empirical measure based on the number of ligand atoms correlates well with the observed experimental binding free energy data compiled to date.
Lastly, for the reasons listed above the method of Chen et al. requires the existence of at least one pdb entry featuring a complex of the potential with at least one other ‘competing’ ligand in order to better calibrate their prediction of binding affinity for the ligand in question further limiting its applicability. However, many pdb structures do not contain a relevant bound ligand and many other are bound to ligands at entirely different active sites than what may be relevant to the lead candidate. Moreover, in many systems, the target protein undergoes significant conformational changes when in transition from an unbound to bound state, i.e., an “induced fit”. Thus lacking a mechanism for appropriate modeling of receptor flexibility, utilization of the bound conformational state will be inappropriate in many systems. Examples include carboxypeptidase-A-complexed various small peptides and calmodulin complexed with tamoxifen, both of which are included in the listed results of Chen et al. Together these factors might severely limit the scope of potential targets that can be screened by the method published in Chen et al.
Chen et al.'s method sacrifices both predictive accuracy and robustness in order to be able to rapidly process a given drug candidate against multiple potential protein targets, with the intent of ‘fishing’ for additional secondary therapeutic targets of the lead candidate. But the requirement of at least one competitive ligand in complex with the potential target in the relevant prior characterized active site makes the prospects very unattractive for identification of potential unwanted side effects and other adverse cross reactions against a large collection of alternative target biopolymers.
Moreover, as will be discussed in more detail later, prior art to date has not addressed three other very important issues regarding the prediction of adverse cross reactions and the application to improving the drug discovery process. The first being how to utilize the wealth of annotational information regarding function, source, homology to other structures, and family classification for macromolecular targets that exists in both the public and proprietary domains in order to make better judgments about potential adverse cross-reactions. The second being how to create an effective comparative evaluation of a collection of lead molecules based on their profiles of potential cross-reactions. The third being how to use the results of sophisticated computational modeling to infer how lead candidates that demonstrate one or more potential adverse cross-reactions due to lack of binding specificity can be reengineered or redesigned in order to increase binding discrimination between the desired therapeutic target and the potentially adverse cross-reactants, thereby making the biomolecule a viable drug candidate for clinical trials.
The importance of advance knowledge of potential cross-reactions in significantly improving the process of drug discovery has already been described above. Such knowledge will clearly help design better drugs with lower adverse side effects, at a reduced cost and with higher chances of success in clinical testing. The present invention will describe systems and methods that address this critical need.