The present disclosure relates generally to machine learning, and more particularly to computing decision trees or regression trees.
The growing of decision and regression trees is widely used in machine learning and data mining for generating predictive tree models. In these tree models, leaves comprise class labels (for decision trees) or numeric target attribute values (for regression trees) and branches represent conjunctions of features that lead to those class labels or target attribute values. Decision trees are data structures which are used for classifying input data into predefined classes. Regression trees are data structures which are used for calculating a predicted data value, e.g. an integer, from input data. Multiple tree models may be used together in an ensemble model for improving accuracy. An ensemble may consist of several thousand trees or more. The prediction result of each individual tree in an ensemble model is combined, e.g. based on a voting or averaging procedure, for generating a final result of said ensemble model.
The trees in the ‘ensemble models’ or ‘ensembles’ are generated from different data bags (derivative sets or sub-sets of the available training data). It is a common approach to calculate derivative data sets (also called ‘data bags’) from the available training data and generate a decision or regression tree based on each of said data bags separately. The resulting ‘ensemble model’, will in general provide more accurate predictions than a single tree created from the totality of the available training data.
In ensemble tree modelling, the growing of multiple trees may be computationally much more expensive than growing a single tree on the totality of training data. This is because the growing of each node in each of the trees generally involves a heavy computation of statistics on attribute values of a large number of training data records. Data bags may be distributed to different processing nodes in a grid for growing a tree by each of the nodes of said grid. As the data bags may comprise large overlaps, a huge amount of data has to be moved to the respective grid nodes and used for processing the trees. This results in increased network traffic and a high processing load of the individual grid nodes.
The computational costs of ensemble tree growing are also an obstacle for implementing such algorithms in (analytical) databases, which have to provide sufficient processing capacity for executing complex joins over multiple database tables and other computationally demanding tasks and therefore must not spend all available processing capacity on tree growing.
In standard tree growing techniques, multiple data bags are calculated from an original training data set. A single tree growing algorithm is applied to each of the data bags. Therefore, the cost of growing an ensemble of N trees by using standard tree growing approaches is N times greater than the cost of growing a single tree. Thus, the creation of an ensemble model is computationally much more expensive than the growing of a single tree model. If trees are grown in a parallel in-database analytics environment such as Netezza Analytics™, which already provides some algorithms for decision and regression trees, the overhead for executing the stored procedures and user-defined functions or the creation of temporal tables will slow down the calculation of ensemble models.