This invention relates generally to machine learning techniques, and more specifically to decision-tree learning techniques.
Machine learning techniques are a mechanism by which accumulated data can be used for predictional and other analytical purposes. For example, web site browsing data can be used to determine which web sites are more likely to be viewed by particular types of users. As another example, product purchasing data can be used to determine which products a given consumer is likely to purchase, based on his or her prior product purchasing history, and other information.
One type of machine learning technique is decision-tree learning. A decision tree is a structure that is used to encode a conditional probability distribution of a target variable, given a set of predictor variables. For example, the predictor variables may correspond to the web sites a user has or has not already viewed, or products a user has or has not already purchased. The target variable may then correspond to a particular web site or product that an analyst is determining whether the user is likely to view or purchase, respectively. Once a decision tree has been constructed, it is navigated using a particular user""s data to determine the answer to this potential viewing or purchasing query.
A decision tree is generally constructed by starting with an internal node corresponding to a predictor variable that splits into two or more branches that each end in a leaf node signifying the end of the tree with a probability regarding the target variable. In order to make the tree more accurate, leaf nodes are replaced with other internal nodes also corresponding to predictor variables, such that each of these nodes splits into two or more branches that also end in leaf nodes. Thus, by iteratively replacing leaf nodes with internal nodes, more levels are added to the tree, improving the predictional accuracy of the decision tree.
For internal nodes corresponding to predictor variables that have continuous values, the branches extending from the internal nodes have corresponding intervals, such that a given branch extending from a node and having a given interval is followed if the predictor variable for the node has a value that falls within the interval. For example, an internal node may have two branches extending from it, one having an interval of less than 7, and the other having an interval of greater than or equal to 7. If navigation of the decision tree results in landing on this node, and if the predictor variable has a value less than 7, then the former branch is followed; if the variable has a value greater than or equal to 7, then the latter branch is followed.
Identifying intervals that yield accurate decision trees is typically accomplished within the prior art by sorting the relevant training data used for constructing the tree by the value of every continuous predictor variable. Once an interval has been identified, any new leaf nodes that are also made into internal nodes must have its relevant training data resorted as well. Unfortunately, this is a time-consuming process, and can result in delays where on-the-fly, dynamic decision tree construction is required. An alternative approach used in the prior art is to a priori determine a set of static intervals. However, while this allows for quicker decision tree construction, accuracy is reduced because the intervals are not constructed dynamically. For these and other reasons, therefore, there is a need for the present invention.
The invention relates to dynamically determining continuous split intervals for decision trees, without sorting. As a result of the dynamic nature of the determination, embodiments of the invention provide for accurately constructed decision trees on-the-fly. Furthermore, since sorting is not required, the embodiments provide for quickly constructed decision trees as compared to dynamic approaches within the prior art.
In one embodiment, a method for constructing a decision tree using a set of training data starts with a current, or present, tree. A new tree is determined that has a leaf of the present tree replaced by a continuous split on a predictor variable with a number of intervals. The intervals are dynamically determined without sorting. In varying embodiments, the intervals are determined using the mean, the mean and the standard deviation, the median, or a predetermined number of percentiles of a relevant sub-set of the set of training data. If the new tree has better predictional value than the current tree, then the current tree is replaced with the new tree. This process continues until the decision tree has been constructed.
As an example using the mean, there may be two intervals: a first interval of less than the mean of the relevant sub-set of the set of training data, and a second interval of greater than or equal to the mean of the relevant sub-set of the set of training data. As another example, using both the mean and the standard deviation of the relevant sub-set of the set of training data, there may be four intervals: a first interval of less than the mean minus a multiple of the standard deviation, a second interval of greater than or equal to the mean minus the multiple of the standard deviation and less than the mean, a third interval of greater than or equal to the mean and less than the mean plus the multiple of the standard deviation, and a fourth interval of greater than or equal to the mean plus the multiple of the standard deviation. Other examples according to different embodiments of the invention could have intervals based on the median and/or a predetermined number of percentiles of the relevant sub-set of the set of training data.
The invention includes computer-implemented methods, machine-readable media, computerized systems, and computers of varying scopes. Other aspects, embodiments and advantages of the invention, beyond those described here, will become apparent by reading the detailed description and with reference to the drawings.