The present disclosure relates generally to computer implemented prediction approaches, and more specifically to use of decision or regression trees for automated prediction.
Decision and regression trees are widely used predictive models. Decision trees are data structures which may be used for classifying input data into different, predefined classes. Regression trees are data structures which may be used for calculating a prediction result in the form of a data value, e.g. an integer, from input data. The calculation of a result data value as well as a classification into predefined classes from some input data will in the following be referred to as ‘prediction’.
In order to increase accuracy, it is a common approach to use combinations of multiple decision trees or of multiple regression trees for calculating a prediction. Said collections of trees are known as ‘tree ensemble models’ or ‘ensemble models’. The predictions of each tree in the ensemble model need to be combined using an appropriate combination scheme, e.g. an unweighted or weighted voting function for decision tree ensembles and an unweighted or weighted averaging for regression tree ensembles.
Applying a single tree model for prediction is typically a fast process, even for refined tree models. Unfortunately, this is not the case for ensemble models which may comprise thousands of individual trees: the time needed to predict a result using an ensemble of N trees is N times greater than the prediction time needed when using a single tree model. Thus, the gain in accuracy achieved by using multiple trees is connected with relatively high computational costs.
The computational costs of ensemble tree based prediction are also an obstacle for implementing such algorithms in (analytical) databases, which have to provide sufficient processing capacity for executing complex joins over multiple database tables and other computationally demanding tasks and therefore must not spend too much CPU power on tree based prediction.
Some in-database analytics environments, such as IBM Netezza Analytics™, already comprise some decision and regression tree based prediction logic. Said logic is implemented based on stored procedures and user-defined functions or aggregates. An overhead is associated with applying said tree-based prediction logic because the input data set on which the different trees of an ensemble model operate has to be redundantly stored. In addition, temporary table creation and the calling of the stored procedures for each tree increase the computational overhead. The latter kind of overhead may also be generated when the trees are processed sequentially on the same input data set. As a consequence, tables and index structures for the input data set have to be created and maintained redundantly. This increases the handling costs and slows down tree-based predictions in current in-database analytical solutions. Often, the used input data sets are small or medium-size. In this case the handling costs of the input data set and its copies relative to the computational costs of the actual prediction are particularly expensive in terms of memory and CPU consumption. If the existing single-tree “logic” is used to predict with multiple trees by applying all of them sequentially (one at a time), then the input data set does not have to be redundantly stored. However, overhead is associated with temporary table creation and stored procedure calls repeated for each tree.