1. Field of the Invention
This invention relates to methods and systems of predicting properties of molecules.
2. Description of the Related Art
A physical item's unknown conditions can often be predicted based on the item's known conditions. Disease diagnosis is one simple example. If a patient has symptom A, and has symptom B, but does not have symptom C, then it may be predicted that the patient has some particular disease. In this example, the physical item's (the patient's) three known conditions (have symptom A, have symptom B, not have symptom C) are used to predict an unknown condition (that the patient has some particular disease). The conditions that are known or easy to measure or calculate are often called descriptors or X variables. The conditions that are unknown or difficult to measure or calculate, and that are believed to be related to the descriptors, are often called properties or Y variables.
Decision trees are one of the most common methods of forming predictions about a property of an item, based on descriptors of the item. The structure of a decision tree can be derived by studying existing items. Each of the existing items have known descriptor values and a known property value. The existing items that are used to formulate the decision tree are called training items. The items that have an unknown property and are not used in formulating the decision tree are called new items. The known descriptor values of the training items, in conjunction with the known property values of the training items, are used to develop a connected series of decision points in the form of the decision tree. The decision tree can then be used to predict the unknown property of a new item. For example, based on the descriptors (e.g., age and blood pressure) and the property (whether patient suffered a heart attack) of the training items (medical history data of prior patients), a decision tree can be formulated and then used to predict whether a new patient with given descriptors is likely to suffer a heart attack.
Decision trees classify training items by repeatedly splitting the training items into subsets at nodes of the tree. Each split is based on a logic test on one descriptor (e.g., whether the patient is more than fifty years old, whether blood pressure is greater than 100). Each terminal node (i.e., leaf) of the tree corresponds to a classification of the property. The property of a new item is then predicted by running the new item from the root of the tree through the logic tests of the decision tree, based on the descriptors of the new item, until a leaf is reached. The property corresponding to the leaf is the predicted property of the new data. In addition to predicting new property, the decision tree can also aid a user in interpreting relationships between descriptors and the property. For a more detailed description of decision tree methods, please refer to pp. 18–36 of the text of “Classification and Regression Trees”, Breiman, Friedman, Olshen and Stone, Chapman & Hall/CRC 1984 (CART hereinafter). For a description of some of the advantages of a tree-structured approach, please refer to pp. 55–58, CART. The disclosure of the above described book “Classification and Regression Trees” is hereby incorporated by reference in its entirety.
Decision trees may include both classification trees and regression trees. A classification tree's terminal nodes each represent a class of properties. A regression tree's terminal nodes each represent a value of a property. As those ordinarily skilled in the art will appreciate, the disclosed methods and systems can be applied to both classification trees and regression trees. Therefore, the terms such as “class”, “classify” and “classification,” can be used in the present application to refer to assigning a class to a classification tree terminal node as well as assigning a value to a regression tree terminal node. The term “decision tree” as used in the application refers to both classification trees and regression trees.
FIG. 1 illustrates one example of a decision tree. The decision tree shows that for many compounds, poor absorption or permeation (the property value of NDL) are more likely when:
The total molecular weight (MWT) is over 500; or
The computed octanol/water partition coefficient (ALogP) is over 5; or
There are more than 5 H-bond donors in the molecule (HBD); or
There are more than 10 H-bond acceptors in the molecule (HBA).
This classification rule is also called the Lipinski rule, named after Lipinski, Lombardo, Dominy and Feeney. As shown in FIG. 1, the decision tree consists of a hierarchy of nodes. The top node 101 is called the root node. Nodes that flow downward from a node are called the descendents of that node. For example, 102 and 103 are descendents of 101, 104 and 105 are descendants of 102. Each node, except the nodes with no descendents, contains a logic test (also called a split) on one of the descriptors. For example, at 101, if descriptor MW has a value of no greater than 500, then 101 proceeds to its left descendent node 102. Otherwise 101 proceeds to its right descendent node 103. A logic test is represented by a circle in FIG. 1. The nodes with no descendents (103, 105, 107, 108 and 109) are called leaf nodes or terminal nodes. The values associated with leaf nodes are the predicted values of property (DL or NDL).
In one advantageous application, decision trees are developed and used to predict the likely biochemical and physiological properties of drug candidate compounds. In many cases, there are hundreds of molecules with known descriptors and a known property. In addition, the relationships between the descriptors and the property are typically not known, may be interrelated, may be highly non-linear, and may even differ for different members of the set of known compounds. It is often time-consuming and expensive to test the properties of a large number of drug candidcate molecules. Therefore it is often desirable to predict the properties of the new molecules, using decision tree(s) formed by classifying a training item set of molecules with known properties. Those new molecules with promising prediction results are then tested to experimentally determine their properties. For these situations, tree generation procedures have been developed which may be computer implemented, and which take as inputs the descriptors and properties of the known compounds and which produce as an output a predictive decision tree that can in some sense be considered the “best” tree in terms of generality and predictive value.
FIG. 2 illustrates a typical decision tree creation process. A start state block 202 proceeds to block 204. At block 204, the tree creation process chooses the root node of the tree as the node to split. At block 206, potential logic tests are rated at the node according to how much the logic tests can reduce the “impurity” of two groups of items which exit the node following a logic question directed to a descriptor. The impurity is reduced by a maximum value when a potential logic test at the node splits training items at the node into equal numbers of each property class. The impurity is reduced by a minimum value when the potential logic test at the node does not split training items at the node into more than one property classes. For a more formal definition of impurity, please refer to pp. 24–27, CART. At block 208, the logic test producing the largest drop in impurity is chosen and the tree is split at that node using the chosen logic test.
The splitting process is repeated until no split can reduce the impurity further or there is only one training item left at a node. At block 210, another node is chosen for evaluation of potential logic tests. At block 212, the chosen node is evaluated to determine if at least one potential logic test can reduce impurity at the node, and if the node represents only one training item. If the node represents more than one training item, and if a potential logic test can reduce impurity at the node, then the process proceeds to block 206, so that the node will be split with the logic test that best reduces impurity. If there is only one training item left at the node, or if no potential logic test can reduce impurity at the node, then the process proceeds from block 212 to block 214.
Thus, at block 214, a determination is made as to whether all nodes have been evaluated. If all nodes have been evaluated, then the tree creation process proceeds to end state block 216. If not all nodes have been evaluated, then another node that hasn't been evaluated is chosen for evaluation at block 210.
The tree created using the above-described process may not be optimal, because there is usually only one training item left at each leaf node. Such a large tree may perfectly classify the training items used to construct the tree, but such trees have been found to be inaccurate when applied to new items with unknown classification. To improve the applicability of the tree model to new items, a pruning process may be applied to reduce the size of the tree. Pruning reduces the number of nodes and thus reduces the number of logic tests. Compared to the original tree, the pruned tree can typically better predict the unknown properties of new items. However, a pruned tree may classify some training items incorrectly. Pruning more nodes may lead to more training items being classified incorrectly. For a more detailed description of the advantages and side effects of pruning, please refer to pp. 59–62, CART.
FIG. 3 illustrates examples of a tree, a branch, and a sub-tree. A branch of a tree includes the root node of the branch and its descendent nodes. In FIG. 3, tree 301 is the tree with a root node 301. Branch 302 is a branch of the tree 301, the branch starting at the branch root node 302. Pruning a branch from a tree is the process of deleting from the tree all descendents of the branch, i.e., cutting off all nodes of the branch except the root node of the branch. A sub-tree is the original tree minus the pruned branch. In FIG. 3, pruning away the branch 302 from the tree 301 results in the sub-tree 301–302.
A number of pruning approaches may be employed to prune a tree. One approach is called minimal cost-complexity pruning. The tree pruning process starts at a leaf node, prunes away branches until a sub-tree of the original tree is left. Since multiple sub-trees may be formed from a tree by a pruning process, minimal cost complexity pruning selects a sub-tree that minimizes the function Rα=R0+αNleaf, where R0 is the miscalculation cost on the training data set, Nleaf is the number of leaf nodes, and α is a complexity parameter that controls the size of the tree. Therefore, Rα is a combination of the miscalculation cost of the tree and its complexity. In general, miscalculation cost is the cost or loss of mis-classifying an item as having one property value, when the item in fact has a different property value. For a formal definition of miscalculation cost, please refer to pp. 35–36 of CART. Using minimal cost complexity pruning, the pruning process successively cuts off branches until the function Rα=Ro+αNleaf stops decreasing. For a more detailed description of the pruning process, please refer to pp. 63–81 of CART.
Although the tree-based prediction of single properties is useful in many contexts, there are often multiple properties of interest associated with each item. For example, with respect to compounds A, B, and C, a molecule may have properties (binds with A), (does not bind with B), and (binds with C). As another common example, a molecule may bind to a target of interest, and may thus be a good pharmaceutical candidate, but may have other properties such as toxicity or poor bioabsorption that are also relevant to the usefulness of the compound as a drug.
A training item set may thus include a number of molecules each with its own descriptors (such as molecule size, surface area, number of rotable bonds, system energies, and so forth) and its multiple properties. According to traditional decision tree methods, in order to predict multiple properties, the training item set is split into multiple subsets, each containing the descriptors of the training items and one of the properties. For each of the subsets, a computing process is run to classify each of the subsets. A separate decision tree is then formed for each of the properties. Using each of the decision trees, each of the corresponding properties of a new item is then predicted. In addition to requiring multiple tree-creating and predicting processes, these traditional methods also produce too much complexity for researchers. Since each of the decision trees created using these traditional methods concern only one property, they inhibit researchers from discovering “generic” descriptors and logic tests on such descriptors that are relevant to all properties. They also inhibit researchers from analyzing the relationships between descriptors, for example, the relationships between more generic descriptors and less generic descriptors. The multiple trees created by traditional methods inhibit researchers from finding descriptors and logic tests on such descriptors that are relevant to all properties or several of the properties.