The technique employed for the prediction of the objective numeric attribute values of data in database (DB) can be applied broadly, to such purposes as calculation of insurance ratios, prediction of stock values or health diagnoses. One technique employed for predicting numeric attribute values involves the use of a tree structure called a regression tree that is constructed for the prediction of the objective numeric data attributes using the large amount of stored data. A regression tree is constructed by recursive splitting of a data set into subsets according to a specific typical rule.
FIG. 1 shows an example regression tree for predicting salaries in a client DB. In FIG. 1, n denotes the number of data instances and m denotes an average salary. In this example, data of 1000 instances are handled and the average salary is 4800. First, the data are classified in accordance with whether or not the ages (Age) of clients are greater than 30. In the data set for ages greater than 30, there are 650 data instances and the average salary is 5200. In the data set for ages equal to or less than 30, there are 350 data instances and the average salary is 3250. Following this, the data set for which the ages are greater than 30 are classified in accordance with whether or not the balance is smaller than 2000. The data set for ages equal to or less than 30 are classified in accordance with whether the number of years employed (Years employed) is greater than 10.
By employing such a tree structure, DB data can be analyzed in view of a specific numeric attribute, such as salary, and a numeric attribute value in future data can be predicted.
Generally, a regression tree that satisfies the following conditions is regarded as appropriate. (1) The depth of the tree is small; (2) the number of vertexes (nodes) is small; and (3) the mean-square sum of differences between a value of a numeric attributes (hereinafter referred to as objective numeric attributes) which is object to calculation for data belonging to an end node, and a representative value in the node (for example, averages) is reduced at the end node. Since the generation of such a tree is very difficult, some approximate solutions are required. As the representative heuristic method, a method is employed whereby "a tree is generated from the root, and when a rule for splitting a node at individual steps is selected, it is calculated how the objective numeric attribute distributes in each subset when a data set is split into the subsets according to each rule. For example, the mean-squared error by splitting is calculated and the rule providing the smallest mean-squared error is selected as a splitting rule (or a rule for splitting)."
The final end condition of the tree is determined by the number of data, and the depth and the dispersion of the objective numeric attribute. When the value of an objective numeric attribute W of a tuple t in a database is t[W], the dispersion of subset data D having an average objective numeric attribute .mu..sub.D is expressed as