1. Field of the Invention
The present invention relates to a method, system, and computer program product for counting predictor-target pairs for a decision tree model that provides the capability to generate count tables that is quicker and more efficient than previous techniques.
2. Description of the Related Art
Data mining is a technique by which hidden patterns may be found in a group of data. True data mining doesn't just change the presentation of data, but actually discovers previously unknown relationships among the data. Data mining is typically implemented as software in, or in association with, database systems. Data mining includes several major steps. First, data mining models are generated by one or more data analysis algorithms. Initially, the models are “untrained”, but are “trained” by processing training data and generating information that defines the model. The generated information is then deployed for use in data mining, for example, by providing predictions of future behavior based on specific past behavior.
One important form of data mining model is the decision tree. Decision trees are an efficient form for representing decision processes for classifying entities into categories or constructing piecewise constant functions in nonlinear regression. A tree functions in a hierarchical arrangement; data flowing “down” a tree encounters one decision at a time until a terminal node is reached. A particular variable enters the calculation only when it is required at a particular decision node and only one variable is used at each decision node.
Classification is a well-known and extensively researched problem in the realm of Data Mining. It has found diverse applications in areas of targeted marketing, customer segmentation, fraud detection, and medical diagnosis among others. Among the methods proposed, decision trees are popular for modeling data for classification purposes. The primary goal of classification methods is to learn the relationship between a target attribute and many predictor attributes in the data. Given instances (records) of data where the predictors and targets are known, the modeling process attempts to glean any relationships between the predictor and target attributes. Subsequently, the model is used to provide a prediction of the target attribute for data instances where the target value is unknown and some or all of the predictors are available.
Classification using decision trees is a well-known technique that has been around for a long time. However, the early decision tree algorithms worked well only on small amounts of data and did not scale to large datasets. Most of the well known algorithms for building decision trees, like SLIQ, SPRINT, RainForest, BOAT etc., construct count tables to find splitting attributes and split points. Count tables store record counts for every (predictor value, target value) pairs at every node in the tree. As the build process goes deeper in the tree, constructing these count tables becomes very expensive in terms of computing resources and time. A need arises for a technique by which such counting can be performed more quickly and efficiently.