1. Field of the Invention
The present invention relates to a system, method, computer program product, for representing, and a representation of, decision trees in a relational database system.
2. Description of the Related Art
Data mining is a technique by which hidden patterns may be found in a group of data. True data mining doesn't just change the presentation of data, but actually discovers previously unknown relationships among the data. The patterns thus discovered are represented as models. Data mining is typically implemented as software in or in association with database systems. Data mining includes several major steps. First, data mining models are generated based on one or more data analysis algorithms. Initially, the models are “untrained”, but are “trained” by processing training data and extracting information that defines the model. The extracted information represented as a model is then deployed for use in data mining, for example, by providing predictions of future behavior based on patterns of past behavior.
One important form of data mining model is the decision tree. Decision trees are an efficient form of representing decision processes for classifying entities into categories, or constructing piecewise constant functions in nonlinear regression. A tree functions in a hierarchical arrangement; data flowing “down” a tree encounters one decision at a time until a terminal node is reached. A particular variable enters the calculation only when it is required at a particular decision node.
Classification is a well-known and extensively researched problem in the realm of Data Mining. It has found diverse applications in areas of targeted marketing, customer segmentation, fraud detection, and medical diagnosis among others. Among the methods proposed, decision trees are popular for modeling data for classification purposes. The primary goal of classification methods is to learn the relationship between a target attribute and many predictor attributes in the data. Given instances (records) of data where the predictors and targets are known, the modeling process attempts to glean any relationships between the predictor and target attributes. Subsequently, the model is used to provide a prediction of the target attribute for data instances where the target value is unknown and some or all of the predictors are available.
Classification using decision trees is a well-known technique that has been around for a long time. However, expressing this functionality in standard Structured Query Language (SQL), the native language of the Relational Database Management System (RDBMS), is difficult, and it naturally leads to extremely inefficient execution by making use of operations that are not designed to handle this particular type of workload. In addition, current systems require the user to extract the data from the RDBMS into a data mining specific engine and then invoke decision tree algorithms. A need arises for a technique by which classification functionality using decision trees may be expressed in SQL that provides improved ease of use and implementation, as well as improved efficiency of execution.