1. Field of the Invention
Embodiments of the present invention are generally related to machine learning. More specifically, embodiments of the present invention are related to machine learning techniques used to predict the biological effects of a molecule.
2. Description of the Related Art
Molecules are continually being introduced into the marketplace or environment, (e.g., industrial detergents, industrial discharge, pharmaceuticals and cosmetics). Sometimes, such molecules may have unknown or undesirable biological effects (e.g. they may have some level of toxicity on humans, flora or fauna). It is of great benefit for organizations introducing such molecules, and for society in general, to anticipate such effects as early as possible. In this way it may be possible to take remedial action (e.g., not introducing the molecule, re-designing the molecule to remove the effect, or limiting the introduction of the molecule). Also, it is possible to identify molecules that have desirable biological effects, and the pharmaceutical industry spends billons of dollars each year to test and identify potentially useful molecules.
The high-level effects of a molecule, both desirable (e.g. anti-inflammatory) and undesirable (e.g., toxicity), are overwhelmingly related to some lower level bio-chemical pathway. More specifically, high-level effects often result from the interaction of a molecule with a binding site on a protein present in some bio-chemical pathway. And a high-level effect of a molecule may result from the interaction of the molecule with multiple proteins in multiple pathways. In many cases, the particular protein(s) and pathway(s) may not be fully known or understood, even though the correlation between a high-level effect and the molecule may be well documented. For example, gabapentin NEURONTIN from Parke-Davis/Pfizer is used to treat epilepsy and neuropathic pain; however, the protein targets underlying the actions of these compounds are unknown.
Currently, two general approaches are used for identifying the high level effects of a molecule. The first is to perform laboratory experiments using the molecule. The effects of the molecule may also be analyzed in various clinical trials, including trials with human subjects. For example, the pharmaceutical testing required by the United States Food and Drug Administration requires a variety of clinical studies be performed before a molecule may be distributed for medical purposes. However, one drawback to this approach is that physical laboratory experiments and clinical trials are typically both costly and time consuming, making them prohibitive to perform for more than limited number of candidate molecules. Accordingly, this approach is often used only after identifying a candidate molecule as being potentially beneficial.
A second approach is to perform in silico simulations configured to generate predictions about the properties of a molecule. The term “in silico” is used to reference simulations performed using computer software applications that model the real-world behavior of the molecule. The simulation may be based on the physical characteristics of the molecule (e.g., structure, molecular weight, electron density, etc) and the characteristics of the simulated environment (e.g., the shape, position and characteristics of a particular protein receptor). Thus, an in silico simulation may be used to simulate the interaction between a molecule and a single protein target. The output of the simulation may include a prediction regarding a biological effect or property of the molecule, e.g., the binding affinity of the molecule against the protein target. Models have been developed that can predict these kinds of low-level properties with a reasonable degree of accuracy. However, the accuracy of in silico simulations used to predict high-level effects have typically been very poor. Thus, even though some protein/molecule interaction may be known to be related to an observed high-level effect, no one has currently been able to bridge the gap between using an in silico simulation to predict a low-level activity regarding a molecule and using an in silico simulation to predict whether a molecule is likely to have a given a high-level effect when introduced into a biological system (e.g., a human individual).
The state of the art in in silico prediction for low-level effects is to construct models based on a topological representation of a molecule, or based on simple three-dimensional models of a molecule. For example, current in silico simulations typically rely on data that may include the position, orientation, or electrostatic properties of the molecule in 3D space. This approach, however, has typically resulted in inaccurate predictions regarding high-level biological effects. A number of reasons may account for this. For example, the representation of the molecule is too high dimensional for the high level effect being modeled, too few data points may be used to model a high-level effect, the representation fails to capture the relevant information, e.g., the “cause” of the biological effect is not a property (or function) of the orientation or electrostatic properties of the molecule, these and other shortcomings may all contribute to the poor results obtained from current in silico simulations.
Accordingly, there remains a need for improved techniques for predicting the biological effects of molecules in general, and for modeling biological effects that may result from the interaction between a test molecule and a biological system.