An explanation of conventional drug discovery processes and their limitations is useful for understanding the present invention.
Discovering a new drug to treat or cure some biological condition, is a lengthy and expensive process, typically taking on average 12 years and $800 million per drug, and taking possibly up to 15 years or more and $1 billion to complete in some cases. The process may include wet lab testing/experiments, various biochemical and cell-based assays, animal models, and also computational modeling in the form of computational tools in order to identify, assess, and optimize potential chemical compounds that either serve as drugs themselves or as precursors to eventual drug molecules.
A goal of a drug discovery process is to identify and characterize a chemical compound or ligand, i.e., binder, biomolecule, that affects the function of one or more other biomolecules (i.e., a drug “target”) in an organism, usually a biopolymer, via a potential molecular interaction or combination. Herein the term biopolymer refers to a macromolecule that comprises one or more of a protein, nucleic acid (DNA or RNA), peptide or nucleotide sequence or any portions or fragments thereof. Herein the term biomolecule refers to a chemical entity that comprises one or more of a biopolymer, carbohydrate, hormone, or other molecule or chemical compound, either inorganic or organic, including, but not limited to, synthetic, medicinal, drug-like, or natural compounds, or any portions or fragments thereof. The target molecule is typically a disease-related target protein or nucleic acid for which it is desired to affect a change in function, structure, and/or chemical activity in order to aid in the treatment of a patient disease or other disorder. In other cases, the target is a biomolecule found in a disease-causing organism, such as a virus, bacteria, or parasite, that when affected by the drug will affect the survival or activity of the infectious organism. In yet other cases, the target is a biomolecule of a defective or harmful cell such as a cancer cell. In yet other cases, the target is an antigen or other environmental chemical agent that may induce an allergic reaction or other undesired immunological or biological response.
The target molecule is typically a disease-related target protein or nucleic acid for which it is desired to affect a change in function, structure, and/or chemical activity in order to aid in the treatment of a patient disease or other disorder. In other cases, the target is a biomolecule found in a disease-causing organism, such as a virus, bacteria, or parasite, that when affected by the drug will affect the survival or activity of the infectious organism. In yet other cases, the target is a biomolecule of a defective or harmful cell such as a cancer cell. In yet other cases the target is an antigen or other environmental chemical agent that may induce an allergic reaction or other undesired immunological or biological response.
The ligand is typically what is known as a small molecule drug or chemical compound with desired drug-like properties in terms of potency, low toxicity, membrane permeability, solubility, chemical/metabolic stability, etc. In other cases, the ligand may be biologic such as an injected protein-based or peptide-based drug or even another full-fledged protein. In yet other cases the ligand may be a chemical substrate of a target enzyme. The ligand may even be covalently bound to the target or may in fact be a portion of the protein, e.g., protein secondary structure component, protein domain containing or near an active site, protein subunit of an appropriate protein quaternary structure, etc.
Throughout the remainder of the background discussion, unless otherwise specifically differentiated, a (potential) molecular combination will feature one ligand and one target, the ligand and target will be separate chemical entities, and the ligand will be assumed to be a chemical compound while the target will be typically a biological protein (mutant or wild type). Note that the frequency of nucleic acids (both DNA/RNA) as targets will likely increase in coming years as advances in gene therapy and pathogenic microbiology progress. Also the term “molecular complex” will refer to the bound state between the target and ligand when interacting with one another in the midst of a suitable (often aqueous) environment. A “potential” molecular complex refers to a bound state that may occur albeit with low probability and therefore may or may not actually form under normal conditions.
The drug discovery process itself typically includes four different subprocesses: (1) target validation; (2) lead generation/optimization; (3) preclinical testing; and (4) clinical trials and approval.
Target validation includes determination of one or more targets that have disease relevance and usually takes two-and-a-half years to complete. Results of the target validation phase might include a determination that the presence or action of the target molecule in an organism causes or influences some effect that initiates, exacerbates, or contributes to a disease for which a cure or treatment is sought. In some cases a natural binder or substrate for the target may also be determined via experimental methods.
Lead generation typically involves the identification of lead compounds that can bind to the target molecule and thereby alter the effects of the target through either activation, deactivation, catalysis, or inhibition of the function of the target, in which case the lead would be a viewed as a suitable candidate ligand to be used in the drug application process. Lead optimization involves the chemical and structural refinement of lead candidates into drug precursors in order to improve binding affinity to the desired target, increase selectivity, and also to address basic issues of toxicity, solubility, and metabolism. Together lead generation and lead optimization typically takes about three years to complete and might result in one or more chemically distinct leads for further consideration.
In preclinical testing, biochemical assays and animal models are used to test the selected leads for various pharmacokinetic factors related to drug absorption, distribution, metabolism, excretion, toxicity, side effects, and required dosages. This preclinical testing takes approximately one year. After the preclinical testing period, clinical trials and approval take another six to eight or more years during which the drug candidates are tested on human subjects for safety and efficacy.
Rational drug design generally uses structural information about drug targets (structure-based) and/or their natural ligands (ligand-based) as a basis for the design of effective lead candidate generation and optimization. Structure-based rational drug design generally utilizes a three-dimensional model of the structure for the target. For target proteins or nucleic acids such structures may be as the result of X-ray crystallography/NMR or other measurement procedures or may result from homology modeling, analysis of protein motifs and conserved domains, and/or computational modeling of protein folding or the nucleic acid equivalent. Model-built structures are often all that is available when considering many membrane-associated target proteins, e.g., GPCRs and ion channels. The structure of a ligand may be generated in a similar manner or may instead be constructed ab initio from a known 2-D chemical representation using fundamental physics and chemistry principles, provided the ligand is not a biopolymer.
Rational drug design may incorporate the use of any of a number of computational components ranging from computational modeling of target-ligand molecular interactions and combinations to lead optimization to computational prediction of desired drug-like biological properties. The use of computational modeling in the context of rational drug design has been largely motivated by a desire to both reduce the required time and to improve the focus and efficiency of drug research and development, by avoiding often time consuming and costly efforts in biological “wet” lab testing and the like.
Computational modeling of target-ligand molecular combinations in the context of lead generation may involve the large-scale in-silico screening of compound libraries (i.e., library screening), whether the libraries are virtually generated and stored as one or more compound structural databases or constructed via combinatorial chemistry and organic synthesis, using computational methods to rank a selected subset of ligands based on computational prediction of bioactivity (or an equivalent measure) with respect to the intended target molecule.
Throughout the text, the term “binding mode” refers to the 3-D molecular structure of a potential molecular complex in a bound state at or near a minimum of the binding energy (i.e., maximum of the binding affinity), where the term ‘binding energy’ (sometimes interchanged with ‘binding free energy’ or with its conceptually antipodal counterpart ‘binding affinity’) refers to the change in free energy of a molecular system upon formation of a potential molecular complex, i.e., the transition from an unbound to a (potential) bound state for the ligand and target.
Binding affinity is of direct interest to drug discovery and rational drug design because the interaction of two molecules, such as a protein that is part of a biological process or pathway and a drug candidate sought for targeting a modification of the biological process or pathway, often helps indicate how well the drug candidate will serve its purpose. Furthermore, where the binding mode is determinable, the action of the drug on the target can be better understood. Such understanding may be useful when, for example, it is desirable to further modify one or more characteristics of the ligand so as to improve its potency (with respect to the target), binding specificity (with respect to other target biopolymers), or other chemical and metabolic properties.
A number of laboratory methods exist for measuring or estimating affinity between a target molecule and a ligand. Often the target might be first isolated and then mixed with the ligand in vitro and the molecular interaction assessed experimentally such as in the myriad biochemical and functional assays associated with high throughput screening. However, such methods are most useful where the target is simple to isolate, the ligand is simple to manufacture and the molecular interaction easily measured, but is more problematic when the target cannot be easily isolated, isolation interferes with the biological process or disease pathway, the ligand is difficult to synthesize in sufficient quantity, or where the particular target or ligand is not well characterized ahead of time. In the latter case, many thousands or millions of experiments might be needed for all possible combinations of the target and ligands, making the use of laboratory methods unfeasible.
While a number of attempts have been made to resolve this bottleneck by first using specialized knowledge of various chemical and biological properties of the target (or even related targets such as protein family members) and/or one or more already known natural binders or substrates to the target, to reduce the number of combinations required for lab processing, this is still impractical and too expensive in most cases. Instead of actually combining molecules in a laboratory setting and measuring experimental results, another approach is to use computers to simulate or characterize molecular interactions between two or more molecules (i.e., molecular combinations modeled in silico). The use of computational methods to assess molecular combinations and interactions is usually associated with one or more stages of rational drug design, whether structure-based, ligand-based, or both.
When computationally modeling the nature and/or likelihood of a potential molecular combination for a given target-ligand pair, the actual computational prediction of binding mode and affinity is customarily accomplished in two parts: (a) “docking”, in which the computational system attempts to predict the optimal binding mode for the ligand and the target and (b) “scoring”, in which the computational system attempts to estimate the binding affinity associated with the computed binding mode. During library screening, scoring may also be used to predict a relative binding affinity for one ligand vs. another ligand with respect to the target molecule and thereby rank prioritize the ligands or assign a probability for binding.
Docking may involve a search or function optimization algorithm, whether deterministic or stochastic in nature, with the intent to find one or more system poses that have favorable affinity. Scoring may involve more refined estimation of an affinity function, where the affinity is represented in terms of a combination of one or more empirical, molecular-mechanics-based, quantum mechanics-based, or knowledge-based expressions, i.e., a scoring function. Individuals scoring functions may themselves be combined to form a more robust consensus-scoring scheme using a variety of formulations. In practice there are many different docking strategies and scoring schemes employed in the context of today's computational drug design.
Whatever the choice of computational method there are inherent trade-offs between the computational complexity of both the underlying molecular models and the intrinsic numerical algorithms, and the amount of computing resources (time, number of CPUs, number of simulations) that must be allocated to process each molecular combination. For example, while highly sophisticated molecular dynamics simulations (MD) of the two molecules surrounded by explicit water molecules and evolved over trillions of time steps may lead to higher accuracy in modeling the potential molecular combination, the resultant computational cost (i.e., time and computing power) is so enormous that such simulations are intractable for use with more than just a few molecular combinations. On the other hand, the use of more primitive models for representing molecular interactions, in conjunction with multiple, and often error-prone, modeling shortcuts and approximations, may result in more acceptable computational cost but will invariably cause significant performance degradation in terms of modeling accuracy and predictive power. Currently, even the process of checking a library of drug candidates against a target protein takes too long for the required accuracy using current computational systems.
Trade-offs between accuracy and speed also exist for other computational steps in rational drug design. For example, large virtual libraries need to be clustered both accurately and rapidly into groups of similar molecules for fast virtual screening. In another example, lead refinement requires searching a molecule library accurately and rapidly for molecules similar to ones judged to have docked well in the lead generation stage. Current techniques for library screening and searching are so inaccurate and inefficient that they are not viable as part of a rational drug discovery solution.
This invention is generally concerned with providing a method to generate molecular representations in a manner to enable efficient molecular processing in a variety of scenarios. Nearly all computational processes involved in rational drug design and discovery—library construction, molecular matching, library search, docking, scoring—can benefit from a method to process molecular representations efficiently. Here processing molecular representation may mean transforming the structure of the molecules or parts of molecules by rotating bonds, lengthening or contracting bonds, rotating groups of atoms, etc. It may also involve calculating affinity functions between molecules or parts of molecules. Because of the wide variety of potential inputs—tens of millions of molecules of different sizes and structures—and many different types of molecular processing, demands on a computational system's resources can vary widely. For example, it typically takes less computational resources to calculate the binding affinity for a smaller molecule than for a larger molecule, against the same target. In another example, it is generally computationally cheaper to calculate spatial transformations for a smaller molecule than a large molecule.
It is generally understood by those skilled in the art that variable computational cost tasks tend to be inefficient whether in software executing on a general purpose microprocessor, or in specially designed hardware. When implemented as software, the unpredictability of computational cost for a task can result in poor code locality and poor data locality, can result in unpredictable memory accesses (for example, when page faults occur), and limits how much the software can be optimized, which can severely constrain the software's applications. When a variable computational cost task is implemented in specially designed hardware, it greatly increases the complexity of hardware design, leading to longer and costlier design process and the final design tends to be much less efficient than for constant cost tasks. Therefore it is advantageous that a variable cost task be implemented as a collection of one or more constant cost tasks.
FIG. 1 shows an example of a general processing system 100, which consists of a series of processing engines 101, 102, 103, 104 such that the output of each processing engine is the input of the following processing engine. The input 110 for the first engine 101 is from an input block, which may be a database server in one embodiment, a file server in another embodiment, and storage on a system board for yet another embodiment. The output of the final engine 104 goes to an output block 120, which may be a database server in one embodiment, storage on the processor in another embodiment, and storage on the system board in another embodiment. Such a series of engines 100 is also known as a pipeline.
The amount of time taken by a pipeline stage to produce output from its input is defined as a pipeline stage interval (or, stage interval). Input to the pipeline stage is read at the start of the stage interval; input data is guaranteed to be available for reading once the stage interval starts, not before. Output from the pipeline stage is guaranteed to be available only after the end of the stage interval, not before.
It should be evident that processing engines 101, 102, 103, and 104 are never idle if the stage interval for each processing engine is of exactly the same duration, i.e., if each stage is performing a constant cost task. The next input is available for processing as soon as a particular processing engine has produced output from an input; no time is spent idle by the engine waiting for the next input. If one or more of the engines take longer than other engines in the pipeline to produce their output, some of the engines spend some time sitting idle, thus making for less than 100% utilization of idling processing engines. Processor engine utilization can be improved by reducing the time taken by slower engines to match the time taken by faster engines. In one example, stage interval for each engine 101, 103, 104, is 10 cycles, and the stage interval for 102 is 20 cycles. Here a cycle means the fundamental period of time recognized by a computer, generally determined by the system clock rate. In the current example, engines 101, 103, and 104 will be idle for 10 out of every 20 cycles resulting in only a 50% utilization of three out of four engines in the pipeline. In one example, decreasing the stage interval for 102 to 15 cycles improves utilization of 101, 103, and 104 to 66.7%. In another example, decreasing the stage interval for 102 to 10 cycles improves utilization of 101, 103, and 104 to 100%. Further decreasing the stage interval for 102 to 5 cycles improves utilization of 101, 103, and 104 to 100% but decreases utilization of 102 to 50%. Thus utilization of engines in the pipeline can be improved by designing the engines and their input data such that, as far as possible, each stage interval is of the same duration. Maximal engine utilization is achieved when the stage interval for all engines is of the same duration.
In one embodiment, partitioning input data into smaller sets can decrease the stage interval for an engine in the pipeline. Greater efficiency can also be obtained by partitioning the input such that the engine takes approximately the same time for each partition. In an embodiment of the pipeline, it may be desirable to make the pipeline maximally efficient by making the engine take exactly, not approximately, the same amount of time for each partition. Another method of decreasing the duration of a stage interval is to devote more computational units to the pipeline stage for doing the same amount of computational work.
A pipeline can also be made more efficient by increasing the duration of the stage interval for a stage that is faster than other stages in the pipeline. A method of increasing the stage interval duration is to devote fewer computational units to the stage for doing the same amount of computational work. Another method of increasing the duration of a stage interval is to let the engine idle for some time
Recall that a wide variety of potential inputs—for example, tens of millions of molecules of different sizes and structures—can make widely varying demands on the computational system. Demands on the system may include widely varying amounts of storage, and transmission bandwidth for input data. For example, if the system processes molecules in their entirety, then a larger molecule will need more storage on the processor, than a smaller molecule. Therefore, in order to be able to process the widest variety of molecules, the processor must be able to store data associated with the largest molecule, even if many of the input molecules may be much smaller than the largest molecule. Clearly, designing storage to hold the largest molecule is inefficient and wasteful.
Storage and transmission bandwidth requirements can be reduced by partitioning input molecular data into smaller parts, such that each part can be processed in a pipelined manner. In such a case we need to transmit and store only those parts of molecular data that are being processed by the pipeline at any given time, thus obviating the need to transmit and store the entire molecule. Additionally, the size of molecule that the engine can process is no longer determined by the size of storage on the processor or the system board. The processing engine is able to process molecules of any size—small or large—as long as they are partitioned into smaller parts.
We have discussed how pipelined processing can be enabled by partitioning input data into smaller parts. The pipeline implementation itself imposes limits on the size of a partition. It will be understood by those skilled in the arts that if the partition size is very small, then greater number of pipeline stages are needed to perform the desired computations. The stage interval for each pipeline stage will be very short because each stage needs to process very small amount of data. But the increased number of pipeline stages implies more complexity in the design of the pipeline. Increased complexity in the pipeline can be due to various reasons, for example, the increased amount of routing between pipeline stages, possible increased amount of storage between pipeline stages, etc. Increased complexity generally results in a costlier and longer design cycle, and finally a more expensive product.
The invention described in this patent seeks to increase the computational efficiency of molecular processing by providing a method to partition the input, i.e., representation of a molecule, such that each partition makes approximately the same computational demands on the system. In one example, computational demand can be measured by the amount of storage on or off the processors. In another example, computational demand can be measured by the amount of bandwidth needed to transfer data to and from one or more processors. In yet another example, computational demand can be measured by the number of computational units, which in turn is measured by the number of gates, routing requirements, size of compute blocks on the processors, etc.
Current computational methods for ligand-target docking use digital representations of molecules that are designed for their particular docking method. For example, FlexX computes the binding mode of a potential drug molecule by incrementally docking fragments of the molecule. FlexX constructs its fragments by breaking all bonds in the molecule that are deemed to be flexible, thus constructing fragments that are themselves rigid. Another computational docking method, similar to FlexX, that makes use of molecular fragments is the place-and-join method [22]. Molecular fragments used in the place-and-join method are constructed by breaking the molecule at an atom that has two adjacent flexible bonds. The fragments are then ‘placed’ incrementally and ‘joined’ at the break points in an attempt to reconstruct the molecule's binding mode. Incremental docking methods create fragments that are not guaranteed to make approximately the same demand on computational resources, therefore, they are unsuitable for a docking implementation that relies on a pipelined implementation.
There also exist some molecular representation schemes that are inspired by physical or chemical properties of molecules rather than the need to speed up certain kinds of computations. RECAP partitions molecules based on a set of chemical rules [58]. RECAP rules are intended to create fragments that can be synthesized chemically. The rules do not depend on the rigidity or flexibility of resulting fragments. RECAP rules are also not intended to facilitate more efficient molecular processing computations, but for providing a guide for combinatorial drug design and synthesis.
This invention enables partitioning of molecules into smaller parts such that the parts can be stored, transmitted, and otherwise processed in specially designed hardware with greater efficiency than the entire molecule. The partitioned representation is constructed by taking into account the structure of the molecule, the processing to be performed on the molecule, and the design of the pipeline. In a preferred embodiment, first a graph representation of the molecule is constructed. The graph representation is first partitioned using an invariant link removal operator such that it produces subgraphs that satisfy certain partitioning criteria. If one or more subgraphs need further partitioning, a node-cleaving operator is applied such that it produces further subgraphs that also satisfy a set of partitioning criteria. Finally, if any subgraphs still need further partitioning, all types of links, not just invariant links, can be removed, and nodes can be cleaved until the resulting subgraphs satisfy a final set of criteria. Graph partitioning results in smaller partitions that are far more efficient to store, transmit, and process, than entire molecules. The increase in efficiency makes it possible to design and run applications which require complex molecular processing, such as rational drug discovery, virtual library design, etc.