The present application relates generally to an improved data processing apparatus and method and more specifically to mechanisms for detecting interesting decision rules in tree ensembles.
Supervised learning algorithms are commonly described as performing the task of searching through a hypothesis space to find a suitable hypothesis that will make good predictions with regard to a particular problem. Even if the hypothesis space contains hypotheses that are very well-suited for a particular problem, it may be very difficult to find a good one. Ensembles combine multiple hypotheses to form a (hopefully) better hypothesis. The term ensemble is usually reserved for methods that generate multiple hypotheses using the same base learner.
Evaluating the prediction of an ensemble typically requires more determination than evaluating the prediction of a single model, so ensembles may be thought of as a way to compensate for poor learning algorithms by performing a lot of extra determination. Fast algorithms, such as decision trees, are commonly used with ensembles, although slower algorithms can benefit from ensemble techniques as well.
Therefore, tree ensembles are among the most popular and successful machine learning models. Tree ensembles combine predicted values from each tree within the ensemble by voting for categorical targets and by averaging for continuous targets. Different trees in an ensemble are usually generated using bagging, random forests, or boosting methods. Bagging is based on re-sampling data records from training data set, random forests are based on re-sampling both data records and attributes, while boosting is based on dynamically changing the record weights for generating each tree model. These and other similar tree ensemble methods improve model accuracy by reducing the prediction variance inherent to single tree models.
Tree models generate a number of decision rules that are easy to understand and apply. They often provide direct insights into important relationships. Unfortunately, the ease of interpretation for a single tree is lost when they are combined into an ensemble. Tree ensembles are usually more accurate than a single tree, but are very non-transparent from the user perspective. They offer no interpretable insights into important relationships supported by the data.