1. Field of the Invention
This invention relates to methods and systems of predicting properties of molecules.
2. Description of the Related Art
A physical item's unknown conditions can often be predicted based on the item's known conditions. Disease diagnosis is one simple example. If a patient has symptom A, and has symptom B, but does not have symptom C, then it may be predicted that the patient has some particular disease. In this example, the physical item's (the patient's) three known conditions (have symptom A, have symptom B, not have symptom C) are used to predict an unknown condition (that the patient has some particular disease). The conditions that are known or easy to measure or calculate are often called descriptors or X variables. The conditions that are unknown or difficult to measure or calculate, and that are believed to be related to the descriptors, are often called properties, attributes, or Y variables.
Decision trees are a common method of forming predictions about a property of an item, based on descriptors of the item. The structure of a decision tree can be derived by studying existing items. Each of the existing items have known descriptor values and a known property value. The existing items that are used to formulate the decision tree are called training items. The items that have an unknown property and are not used in formulating the decision tree are called new items. The known descriptor values of the training items, in conjunction with the known property values of the training items, are used to develop a connected series of decision points in the form of the decision tree. The decision tree can then be used to predict the unknown property of a new item. For example, based on the descriptors (e.g., age and blood pressure) and the property (whether patient suffered a heart attack) of the training items (medical history data of prior patients), a decision tree can be formulated and then used to predict whether a new patient with given descriptors is likely to suffer a heart attack.
Decision trees classify training items by repeated classification of the training items into classes at nodes of the tree. Classification at each node is based on a test on one or more descriptors (e.g., whether the patient is more than fifty years old, whether blood pressure is greater than 100). Each terminal node (i.e., leaf) of the tree corresponds to a classification of the property. The property of a new item is then predicted by running the new item from the root of the tree through the tests of the decision tree, based on the descriptors of the new item, until a leaf is reached. The property corresponding to the leaf is the predicted property of the new data. In addition to predicting a new property, the decision tree can also aid a user in interpreting relationships between descriptors and the property. For a more detailed description of decision tree methods, please refer to pp. 18-36 of the text of “Classification and Regression Trees”, Breiman, Friedman, Olshen and Stone, Chapman & Hall/CRC 1984 (CART hereinafter). For a description of some of the advantages of a tree-structured approach, please refer to pp. 55-58, CART. The disclosure of the above-described book “Classification and Regression Trees” is hereby incorporated by reference in its entirety.
Decision trees may include both classification trees and regression trees. A classification tree's terminal nodes each represent a class of properties. A regression tree's terminal nodes each represent a value of a property. As those ordinarily skilled in the art will appreciate, the disclosed methods and systems can be applied to both classification trees and regression trees. Therefore, the terms such as “class”, “classify” and “classification,” can be used in the present application to refer to assigning a class to a classification tree terminal node as well as assigning a value to a regression tree terminal node. The term “decision tree” as used in the application refers to both classification trees and regression trees.
Decision trees and various methods for their construction are further disclosed in U.S. Pat. No. 7,016,887, issued on Mar. 21, 2006, hereby expressly incorporated by reference in its entirety.
There exists an unmet need in the art for a method of quickly generating improved decision trees which have higher accuracy in predicting the properties of various molecules while simultaneously having a smaller number of leaves. Smaller trees tend to be more predictive on molecules outside the training set, and are also easier to interpret.