1. Field of the Invention
The present invention relates to a system, method, computer program product, and database statement for building decision trees in a database system.
2. Description of the Related Art
Data mining is a technique by which hidden patterns may be found in a group of data. True data mining doesn't just change the presentation of data, but actually discovers previously unknown relationships among the data. Data mining is typically implemented as software in or in association with database systems. Data mining includes several major steps. First, data mining models are generated by one or more data analysis algorithms. Initially, the models are “untrained”, but are “trained” by processing training data and generating information that defines the model. The generated information is then deployed for use in data mining, for example, by providing predictions of future behavior based on specific past behavior.
One important form of data mining model is the decision tree. Decision trees are an efficient form for representing decision processes for classifying entities into categories or constructing piecewise constant functions in nonlinear regression. A tree functions in an hierarchical arrangement; data flowing “down” a tree encounters one decision at a time until a terminal node is reached. A particular variable enters the calculation only when it is required at a particular decision node and only one variable is used at each decision node.
Classification is a well-known and extensively researched problem in the realm of Data Mining. It has found diverse applications in areas of targeted marketing, customer segmentation, fraud detection, and medical diagnosis among others. Among the methods proposed, decision trees are popular for modeling data for classification purposes. The primary goal of classification methods is to learn the relationship between a target attribute and many predictor attributes in the data. Given instances (records) of data where the predictors and targets are known, the modeling process attempts to glean any relationships between the predictor and target attributes. Subsequently, the model is used to provide a prediction of the target attribute for data instances where the target value is unknown and some or all of the predictors are available.
Some of the problems in the classification (or generally in machine learning) process arise from noisy and/or irrelevant predictors, very high-cardinality (number of distinct values) predictors etc. Noisy or irrelevant predictors can often times mask the real predictors, resulting in useless, or worse, misleading models. High-cardinality categorical predictors can impose significant computational demands and also result in over-fitting; a problem where the models learn all the quirks in the data used for learning but generalize very poorly and are useless for other instances of data.
Various approaches have been researched and proposed to deal with noisy predictors. Most of these involve some form of pre-filtering based on relevance. For dealing with high-cardinality predictors some form of discretization or binning is generally employed. These schemes more often than not result in some loss of information. A need arises for a technique by which binning can be performed that provides useful models, but which reduces the information loss of the model and reduces the introduction of false information artifacts.