When learning decision trees for continuous variables (or other types), a scoring criterion employed to evaluate how well a tree fits a set of data is often a function of a prior distribution over respective parameters of the tree. In many instances, it is desirable that this prior information have as little effect on the resulting tree as possible. For example, assuming Normal distributed data, when modeling linear regressions in the leaves of a decision tree, a Bayesian scoring criterion generally requires a prior mean on respective inputs (regressors), a prior mean on a target, and a prior covariance matrix over the inputs and the target. Without knowing the domain of the problem, it is often difficult to anticipate what a useful set of priors will be. For instance, data could be in a 0.01 to 0.001 range or in a 100 to 1000 range in which case the prior distributions with least effect are very different.
One solution to the above problem is to pre-standardize data so that the data has a mean zero and standard deviation of one, thus utilizing a prior mean of zero for all variables and assuming a prior covariance matrix to be diagonal (i.e., assume apriori that all variables are independent). One problem with this solution is that after splitting on a variable in the decision tree, the data that results in different leaf nodes may have very different ranges, and therefore the original problem with the parameter prior is postponed until later in the learning algorithm It is also not favorable to shift or scale data each time a new split is considered in the tree, as this will generally cause an enormous runtime performance reduction due to the additional scaling or shifting operations. To illustrate scaling and shifting of data, the following example is provided.
A variable xk can be employed to denote a variable in some domain. A variable xi denotes a vector of values for a set of variables in the ith case in the data. A variable xik denotes a value of the variable xk in the ith case. For example, if the data is:
x1x2x3Case 1145Case 2987
Then, x1=(1,4,5) and x22=8.
Shifting the n cases x1, . . . , xn is defined as subtracting, for each variable xk, a mean mk=Σi=1nxik/n from each case.
After shifting the data above yields:
x1x2x3Case 1−4−2−1Case 2421
Scaling the cases is defined as dividing the value of each variable in each case by the standard deviation for that variable. The standard deviation for a variable is defined as:
      SD    ⁡          (              x        k            )        ⁢                    1        n            ⁢                        ∑                      i            =            1                    n                ⁢                                  ⁢                              (                                          x                i                k                            -                              m                k                                      )                    2                    
(As can be appreciated, there are alternative formulas for standard deviation.)
In this example: SD(x1)=4, SD(x2)=2, SD(x3)=1, and thus the scaled data is
x1x2x3Case 11/44/25Case 29/48/27
Generally, standardizing the cases is defined as first shifting and then scaling the data. The result is that each variable will have a mean of zero and a standard deviation of one in the data. As can be appreciated, with larger data sets and number of cases, and as decision tress grow in complexity, standardizing operations can be quite burdensome in terms of system performance such as the large amount of computer computations that may be required to perform such operations.