The present invention generally relates to machine learning techniques and probabilistic reasoning under uncertainty. More particularly, the present invention relates to learning decision trees from data and using learned decision trees to approximate conditional probabilities.
Machine learning techniques are a mechanism by which accumulated data can be used for prediction and other analytical purposes. For example, web site browsing data can be used to determine web sites more likely to be viewed by particular types of users. As another example, product purchase data can be employed to determine products a consumer is likely to purchase, based on prior product purchase history and other information.
One type of machine learning technique is decision-tree learning. A decision tree is a structure employed to encode a conditional probability distribution of a target attribute given a set of predictor attributes. For example, the set of predictor attributes may correspond to web sites a user has or has not viewed, or products a user has or has not purchased. The target attribute may then correspond to a web site or product that an analyst is examining to determine whether the user is likely to view or purchase, respectively. Once a decision tree has been constructed, it can be navigated by employing a particular target user""s data to determine answers to future viewing or purchasing queries concerning the target user.
A decision tree 10 illustrated in Prior Art FIG. 1 was constructed by a decision-tree learning algorithm for the purpose of predicting a person""s salary based on various attributes associated with the person. The learning algorithm constructed the decision tree 10 using a set of training data, where each record in the training data corresponded to a person. The set of known attributes in the training data includes Age, Gender, Job and Salary, where Age is a continuous attribute, Job is a categorical attribute with three states {Engineer, Lawyer, Researcher} and Salary is a categorical (binary) attribute with states {High, Low}. Salary is referred to as the target attribute for the tree because the tree is used to predict Salary. Other attributes employed in building the tree 10 are referred to as predictor attributes.
The decision tree 10 in Prior Art FIG. 1 encodes the conditional probability distribution p(Salary|Age,Gender,Job) learned from the training data. In particular, for assignments of the predictor attributes Age, Gender and Job, the decision tree 10 can be traversed from a root node 12 down to a leaf node 18, a leaf node 20 and/or a leaf node 16. The leaf nodes 18, 20 and 16 store a probability distribution for Salary.
In general, a decision tree is traversed by starting at the root node and following child links until a leaf node is reached. Each non-leaf node is annotated with the name of a predictor attribute to be examined, and each out-going child link from that node is annotated with a value or a set of values for the predictor attribute. Every value of the predictor attribute corresponds to one out-going child link. When the traversal reaches a non-leaf node, the known value of the corresponding predictor attribute is examined, and the appropriate (unique) child link is followed. Non-leaf nodes are referred to as split nodes (or simply splits) in the decision tree. Each split node is annotated with the name of a predictor attribute X, and the node is thus referred to as a split on X. Splits have at least two children. Prior Art FIG. 1 illustrates splits with two children, which create binary trees. It is to be appreciated by one skilled in the art, that although the application describes binary trees, the more general case of non-binary trees can be employed in accordance with the present invention.
To illustrate how to traverse and extract conditional probabilities from a decision tree, consider again the tree 10 in Prior Art FIG. 1. Assume that an analyst desires to predict the salary of a twenty eight year old female engineer. The analyst desires to use the tree 10 to determine p(Salary|Age=28, Gender=female, Job=Engineer). The traversal starts at the root node 12, which is a split on Age. Consequently, the known value of twenty eight for Age is examined and compared to the values on the out-going edges of the root node 12. Because twenty eight is less than thirty, the left child edge is traversed and the traversal moves to a node 14. The node 14 is a split on Job, and because Job=Engineer for the person in question, the traversal moves next to the node 18. The node 18 is a leaf node, and consequently the traversal completes and the conditional probability for Salary can be obtained. In particular,
P(Salary=High|Age=28, Gender=female, Job=Engineer)=0.65
P(Salary=Low|Age=28, Gender=female, Job=Engineer)=0.35
Note that the decision tree 10 does not contain any splits on Gender. This means that the learning algorithm identified that Gender was not useful for predicting Salary, at least in the context of knowing Age and Job.
In general, given a decision tree for a probability distribution p(Y|X1, . . . ,XN) then for values x1, . . . xn the values p(Y|X1=x1, . . . XN=xn) can be extracted by performing the traversal algorithm as described above, and using the distributions stored in the leaf nodes. One skilled in the art will appreciate that p(Y|X1, . . . Xn) denotes either a discrete probability distribution or a probability density function, depending on whether Y is a discrete or a continuous attribute, respectively.
There are three problems, using decision trees as described.
A first problem arises because decision trees are constructed using a finite set of data that may not contain very many examples corresponding to a probability later requested from the decision tree. Since the probability distributions at the leaf nodes are estimated from the training data, conventionally it is possible to extract a probability that may not have a reliable estimate due to this xe2x80x9cinadequate training dataxe2x80x9d problem.
Another problem arises when the requested query does not contain a predictor value that may conventionally be employed to traverse a decision tree and thus retrieve a stored probability. This problem typically occurs when not all of the predictor values (e.g., the values of the attributes that define the splits in the decision tree) are provided in a query, yet a conditional probability of the target attribute is still sought. This problem can arise because the conditional probability distribution p(W|X,Y) does not provide adequate information about the probability distribution p(W|X). That is, if the values for one or more predictors are not known, a conventional decision tree may not extract the desired probability. This is known as the missing predictor problem.
A third problem arises because the domain (e.g. the set of possible values) for predictor attributes may not be known when the decision tree is constructed, and these domains may have to be estimated from data. For example, if a decision tree is constructed for p(W|X,Y) using a set of training data, and in that data the attribute X appears in one of two distinct states, the training algorithm is likely to assume that X is a binary attribute. This is problematic if X has more than two states, and the tree is later used to extract p(W|X,Y) for the third value of X. This is known as the xe2x80x9cnew valuexe2x80x9d problem.
These three problems can be illustrated in Prior Art FIG. 1. To illustrate the inadequate training data problem, assume that the training data contained no data wherein a lawyer was under thirty. In this case, assume that the split on Job in node 14 was chosen by the learning algorithm because it separates the engineers from the researchers, and this distinction is useful when predicting Salary. By the definition of a split node, Lawyer has to correspond to an out-going edge and the learning algorithm chose to group Lawyers with Researchers. If the tree 10 is used to extract query 22, the probability distribution will be based on Researchers alone, and may not be an accurate distribution for Lawyers.
As another example of the inadequate training data problem, suppose that probabilities are not considered accurate unless at least k records matching the query existed in the training data. Using a conventional decision tree, there is no confidence in the accuracy of the returned probability according to the desired constraint.
The missing predictor problem is illustrated in FIG. 1 by considering an attempt to extract the probability p(Salary|Job=Engineer). That is, the value for the Age predictor is unknown. A conventional decision tree 10 is unable to provide the desired probability because it is not known to which child of the root node 12 the traversal should follow to reach a leaf node.
The new value problem is illustrated in FIG. 1 by considering the query p(Salary|Age less than 30, Job=Carpenter). Because the learning algorithm assumed that the values of Job were {Engineer, Lawyer, Researcher} when the tree was built, a conventional decision tree cannot be traversed using the given query because there is no out-going edge from node 14 corresponding to Carpenter, and consequently no conditional probability can be returned.
In light of the above problems associated with decision trees, the inadequate training data, the missing predictor, and the new value problems, the usefulness of conventional decision trees are limited. Thus, there is a need for a system and method to build and analyze decision trees so that the problems described above are mitigated.
The following presents a simplified summary of the invention in order to provide a basic understanding of some aspects of the invention. This summary is not an extensive overview of the invention. It is not intended to identify key or critical elements of the invention or to delineate the scope of the invention. Its sole purpose is to present some concepts of the invention in a simplified form as a prelude to the more detailed description that is presented later.
The present invention provides a system and method for using a decision tree to approximate a conditional probability in either (1) the inadequate training data, (2) the missing predictor, or (3) the new value problem situations. The invention concerns both learning decision trees from data and using the resulting trees to answer queries.
When a decision-tree learning algorithm constructs a tree from training data, it uses counts from the data, known as sufficient statistics, to calculate probabilities within leaf nodes. Conventionally, these statistics are discarded, while the present invention stores such sufficient statistics explicitly in the leaf nodes to facilitate deriving conditional probabilities for problematic queries. The present invention recognizes when such problematic queries occur, aggregates the sufficient statistics contained within a subset of the leaf nodes below a problematic split node, and uses the aggregated statistics to derive an appropriate approximate probability.
The present invention may include a data structure wherein statistics used to generate stored probabilities are not discarded, and are made available to an aggregation algorithm that approximates the probabilities. The aggregating algorithm may utilize the stored statistics to approximate a probability distribution in either of the three problem situations described above. Since such aggregating techniques may not be required for all queries, a program predicting conditional probabilities by analyzing a decision tree may include a component for detecting when aggregation should occur. Further, a component for determining which of the inadequate training, missing predictor, and/or new value problem situations has triggered the need to aggregate may be included. Different aggregation algorithms may be applied, based, at least in part, on the determination of which problem triggered the need to aggregate. Thus, the problems concerning the three situations described above are mitigated and the usefulness of decision trees in computing conditional probabilities is improved.
The invention implements an aggregation method operable to approximate queries in problematic situations. The aggregation method collects a set of sufficient statistics for nodes below a problematic internal node a decision tree to facilitate approximating a desired probability. In one example aspect of the present invention, when a split node is encountered during a query-driven traversal of the tree that triggers at least one of the inadequate training data, missing predictor and/or new value problems, the sufficient statistics for all nodes below the triggering split node can be aggregated, facilitating producing a desired, yet conventionally unproducable probability. This aggregation technique can be referred to as the xe2x80x9csimple aggregationxe2x80x9d method.
It is to be appreciated by one skilled in the art that the sufficient statistics collected by such an aggregation method are identical to the sufficient statistics that would correspond to the problematic split node if the decision tree learning algorithm had stopped partitioning the data at the triggering split node, in which case the triggering split node would have been a leaf node. The xe2x80x9csimple aggregationxe2x80x9d method can be enhanced by caching aggregate statistics corresponding to (internal) split nodes in the split nodes themselves, which facilitates retrieving such aggregation statistics. With the aggregation statistics cached, probabilities can be pre-computed and stored in the (internal) split nodes, eliminating the need to re-derive probabilities, facilitating efficiently retrieving probabilities.
An alternative aspect of the present invention provides the xe2x80x9cconsistent look-ahead aggregationxe2x80x9d method to restrict the sufficient statistics that are included in the aggregation to those statistics that are consistent with a given query. For example, while a query may be missing a predictor value at a trigger node, rather than aggregating all the sufficient statistics for nodes below the trigger node, only nodes consistent with the known conditions in the query may be included in the harvest of sufficient statistics that are aggregated to produce the desired probability. For example, referring to FIG. 1 and Query 24, the sufficient statistics from node 20 would not be included in the aggregation triggered at the root node 12 because the query specifies that Job=Engineer.
Although two aggregating methods, the simple and the consistent look-ahead methods, are described herein, it is to be appreciated by one skilled in the art that the present invention is not intended to be limited to these two aggregation methods, and that other aggregation methods may be employed in connection with the present invention.
To the accomplishment of the foregoing and related ends, certain illustrative aspects of the invention are described herein in connection with the following description and the annexed drawings. These aspects are indicative, however, of but a few of the various ways in which the principles of the invention may be employed and the present invention is intended to include all such aspects and their equivalents. Other advantages and novel features of the invention may become apparent from the following detailed description of the invention when considered in conjunction with the drawings.