The present application is related to the following application even dated herewith: Docket number Y0999-085, entitled, xe2x80x9cGenerating Decision-Tree Classifiers With Oblique Hyperplanes,xe2x80x9d by inventor Vijay Iyengar, which is incorporated herein by reference in entirety.
This invention relates to the field of data processing. It is more specifically directed to the field of computer data mining. More particularly, the invention relates to methods and apparatus for generating a regression tree with oblique hyperplanes from data records.
Data mining is the search for valuable information from data. Regression is a form of data mining in which relationships are learned between a set of attributes and a dependent variable. This dependent variable is sometimes referred to as a continuous label to contrast it with the discrete labels in the case of classification. The learnt relationships are then used to predict the value for the dependent variable given values for the set of attributes. Various phenomena can be represented by such relationships. Examples of such phenomena can be found in the financial domain, insurance domain and in the medical domain. The dependence of the total claim amount from an automobile insurance policy on various characteristics like age of drivers, type of car, geographical location, driving history and so on is an example of a phenomena. The characteristics like age and type of car are attributes. In the medical domain, dependence of a diabetic patient""s blood glucose test results on amount of insulin taken, food consumed and exercise performed is another example of a phenomena.
The process of generating a regression model uses input data, herein referred to as a training set, which includes multiple records. Each record has values for various attributes, and has a value for the dependent variable. The number of attributes are referred to as the dimensionality of the attribute space. Generally each attribute is also referred to as a dimension. Attributes can be categorical or numeric in nature.
This invention relates to numeric attributes. Regression has wide applications in various domains.
Regression has been studied extensively within several disciplines, including statistics and machine learning. Known regression techniques include statistical algorithms, regression trees, and neural networks. The desired qualities for a regression model include prediction accuracy, speed of model generation and understandability, and intuitiveness of the result.
The tree based method is chosen as a basis for this invention because of its superior speed of model generation and scalability to high dimensional problems with large training sets. Regression trees can be separated into two forms depending on the nature of the test at each node of the tree. The simplest form of regression trees has a test of the form (xixe2x89xa6b), where xi is the value in the i-th numeric dimension and b is some constant. A more complex form of regression tree allows linear combinations of the attributes in the test at each node. In this case, the test is of the form
(a1.x1+a2.x2+ . . . +an.xnxe2x89xa6b).
These trees, also called oblique trees or trees using oblique hyperplanes, produce better results for some problem domains. This was discussed and demonstrated for classification in xe2x80x9cClassification and Regression Trees,xe2x80x9d Breiman et. al., Chapman and Hall/CRC, 1984, which is hereinafter referred to as xe2x80x9cCARTxe2x80x9d. When applicable, oblique trees produce compact solutions with higher accuracy. While these are advantageous, the generation of these oblique trees is difficult because of the difficulty in determining the equation for the complex test at each node.
Oblique tree generation methods have been proposed for classification. Some of these oblique tree generation methods use a particular form of an optimization technique to determine the test at each node. These methods are complex and tend to be computationally intensive without any guarantee of improved accuracy.
It is therefore an aspect of the present invention to present a method and apparatus for generating a regression tree with oblique hyperplanes from data records. In an embodiment, the regression tree is generated using an iterative method wherein for each iteration a set of vectors is provided to a regression tree generating process. The regression tree generated uses hyperplanes orthogonal to the vectors provided in the set to separate records at each of its node. The iterative process starts out with the set of numeric attribute axes as the set of vectors. At the end of each iteration, pairs of leaf nodes in the generated tree are considered and analyzed to determine new vectors. The set of vectors for the next iteration is determined using a filter process. This iterative process generates multiple regression trees from which one tree is chosen as a solution meeting a particular criteria.