1. Field of the Invention
Embodiments of the present invention generally relate to machine learning techniques and, more particularly, to a method, article of manufacture and apparatus for modeling molecular properties using ranked data and ranking algorithms.
2. Description of the Related Art
Many industries use machine learning techniques to construct predictive models of relevant phenomena. For example, machine learning applications have been developed to detect fraudulent credit card transactions, predict creditworthiness, or recognize words spoken by an individual. Machine learning techniques have also been applied to create predictive models of chemical and biological systems. Generally, machine learning techniques are used to construct a software application that improves its ability to perform a task as it analyzes more data related to the task. Often, the task is to predict an unknown attribute or quantity from known information (e.g., credit risk predictions based on prior lending history and payment performance), or to classify an object as belonging to a particular group (e.g., speech recognition software that classifies speech into individual words). Typically, a machine learning application improves its performance using a set of training examples. Each training example may include an example of an object, along with a value for the otherwise unknown classification of the object. By processing a set of training examples that include both an object and a classification for the object, the model “learns” what attributes or characteristics of the object are associated with a particular classification. This “learning” may then be used to predict the attribute or to predict a classification for other objects. For example, speech recognition software may be trained by having a user recite a pre-selected paragraph of text. By examining the attributes of the recited text, the software learns to recognize the words spoken by the individual speaker.
In the fields of bioinformatics and computational chemistry, machine learning applications have been used to develop models of various molecular properties. Oftentimes, such models are built in an attempt to predict whether a particular molecule will exhibit the property being modeled. For example, models may be developed to predict biological properties such as pharmacokinetic or pharmacodynamic properties, physiological or pharmacological activity, toxicity or selectivity. Other examples include models that predict chemical properties such as reactivity, binding affinity, or properties of specific atoms or bonds in a molecule, e.g. bond stability. Similarly, models may be developed that predict physical properties such as the melting point or solubility of a substance. Further, molecular models may also be developed that predict properties useful in physics-based simulations such as force-field parameters or the free energy states of different possible conformations of a molecule.
The training examples used to train a molecular properties model each typically include a description for a molecule (e.g., the atoms in a particular molecule along with the bonds between them) and data regarding the property of interest for the molecule. Collectively, the training examples are commonly referred to as a “training set” or as “training data.” Data regarding the property of interest typically takes one of two forms: (i) a value from a continuous range (e.g., the solubility of a molecule at a solute temperature), or (ii) a label asserting presence or absence of the property of interest relative to the molecule included in the training example. In either case, the training examples measure the property of interest relative only to the molecule included in a particular training example.
Using training data in either form has often, however, proved to be ineffective in training molecular properties models with a useful degree of predictive power. This may occur due to problems with the quality of the training data. First, consider a scenario where the data is a numerical value representing a measurement of the property of interest over a continuous range. The measurement values available for a particular molecule frequently differ depending on the data source. For example, measurements obtained from one lab or using one experimental protocol may consistently assign higher values for a property of interest to a particular molecule than others. These differences often lead to inconsistent values for the property of interest being reported for the same molecule. Additionally, even measurements obtained under “identical” experimental conditions may have enough experimental uncertainty or noise that it becomes unreasonable to assign a precise numerical value to the property of interest. One reasonable observation under these circumstances may be that if the difference in, or relative magnitude of, measurements reported for two different molecules is large enough, then one molecule may be said to have “more” of the property than the other.
Measurements for a set of molecules may be either relative or absolute. For example, this is commonly encountered in molecular modeling calculations where the ranking of molecules based on the calculation of absolute binding energies can be less accurate than the ranking of compounds based on relative calculated binding energies.
Training examples that use a label asserting the presence or absence of the property of interest have also proven to be of limited value in training a molecular properties model. Oftentimes, such data has a large bias in that the data is predominantly of one label. (e.g., nearly all of the molecules are “inactive” for the property of interest). In this case, it is easy to obtain a model with high accuracy; the model simply predicts the predominant label (e.g., always predict that a molecule will not have the property of interest). This model, however, is not particularly useful, as it makes the same prediction for every molecule.
Generally, models built from data will not predict the property of interest with perfect accuracy for all molecules, and there will be some errors. For binary valued data (i.e. training examples that use a label asserting the presence or absence of a property) these errors consist of false positives (i.e. molecules falsely predicted to have the property of interest), or false negatives (i.e. molecules falsely predicted to not have the property of interest). These types of errors have different costs, (e.g., in a diamond mine it is far more expensive to falsely predict that a diamond is dirt than it is to predict that dirt is a diamond). In biological and pharmaceutical applications, however, it can be very difficult to assign relative values to false positives and false negatives and so it becomes very difficult to trade them off.
As these examples illustrate, it is often easier (and more accurate) to consider the ordering of two molecules relative to a certain property than it is to assert an absolute value for the property for a single molecule. Existing molecular property modeling techniques, however, are not capable of using such ordering information, nor are they capable of dealing with bias in the data or of constructing reasonable models without knowing the optimal trade-off between false positives and false negatives. Accordingly, there is a need for improved methods and apparatus for modeling molecular properties.