Data mining is the exploration and analysis of large quantities of data, in order to discover correlations, patterns, and trends in the data. Data mining may also be used to create models that can be used to predict future data or classify existing data.
For example, a business may amass a large collection of information about its customers. This information may include purchasing information and any other information available to the business about the customer. The predictions of a model associated with customer data may be used, for example, to control customer attrition, to perform credit-risk management, to detect fraud, or to make decisions on marketing.
Intelligent cross-selling support may be provided. For example, the data mining functionality may be used to suggest items that a user might be interested in by correlating properties about the user, or items the user has ordered, with a database of items that other users have ordered previously. Users may be segmented based on their behavior or profile. Data mining allows the analysis of segment models to discover the characteristics that partition users into population segments. Additionally, missing values in user profile data may be predicted. For example, where a user did not supply data, the value for that data may be predicted
To create and test a data mining model, available data may be divided into two parts. One part, the training data set, may be used to create models. The rest of the data, the testing data set, may be used to test the model, and thereby determine the accuracy of the model in making predictions. Once a data mining model has been created, it may be used to make predictions regarding data in other data sets.
Data within data sets is grouped into cases. For example, with customer data, each case may correspond to a different customer. Data in a case describes or is otherwise associated with one customer. One type of data that may be associated with a case (for example, with a given customer) is a categorical variable. A categorical variable categorizes the case into one of several pre-defined states. For example, one such variable may correspond to the educational level of a customer. In one example, there are various possible values for this variable. The possible values are known as states. For instance, the states of a marital status variable may be “married” or “unmarried” and may correspond to the marital state for the customer. Another kind of variable is a continuous variable. A continuous variable is one with a range of possible values. For example, one such variable may correspond to the age of a customer. Associated with the age variable is a range of possible values for the variable.
As mentioned, available data is partitioned into two groups—a training data set and a testing data set. Often 70% of the data is used for training and 30% for testing. A model may be trained on the training data set, which includes this information. Once a model is trained, it may be run on the testing data set for evaluation. During this testing, the model will be given all of the data except the age data, and asked to predict the customer's age given the other data. After training and evaluation, the model may be used on other data sets.
Running the model on the testing data set, the results produced by the model are compared to the actual testing data to see how successful the model was at correctly predicting the age of the customer.
When the model has been run, a graphical representation of the model as applied to the data set may be produced. FIG. 1 is an example a model of a decision tree graph displaying the result of applying a data mining model to a data set. The graph displays the results of applying a data mining model in order to predict the ages for a specific group of cases from the data set. Each case in the model exists in one or more nodes of the graph. For example, the root node 1142 of the graph is labeled “all” and contains all of the cases in the graph. Nodes can be described in terms of “levels” where the leaves with the longest path from root to leaf are level zero nodes in the decision tree, and the parent of a level n node is a level n+1 node. With this terminology, root node 1142 is a level four node.
One level below the root node 1142 are level three nodes 1132 and 1134. The cases are divided among these nodes based on the marital status in each case. The groups of cases represented by the nodes are further subdivided based on a value for a “Capitalgain” variable into four level two nodes 1122, 1124, 1126, and 1128. A further divisions is made to the cases represented in level two node 1122 based on an “Educationnum” variable into level one nodes 1112 and 1113. A further division is made to the cases represented in level two node 1126 based on the “Educationnum” variable into two level one nodes 1114 and 1116. And a further division is made to the cases represented in level two node 1128 based on a “Hoursperweek” variable into two level one nodes 1118, and 1119. Cases in the level one nodes 1114 and 1116 are further divided on the basis of an age variable into level zero nodes 1102 and 1104 (for level one node 1114) and into level zero nodes 1106 and 1108 (for level one node 1116).
This graph presents a visual representation of the application of a mining model to a data set. Other graphs, such as cluster maps, also present such visual representations of the application for a mining model to a data set. In some graphical displays, each node includes an informational bar or other display which contains information regarding the cases contained in the node.
While this information may be useful, there may be a need to find more information regarding the cases contained in the node. Some programs which implement the graphing of the results of the application of a data mining model to a data set allow a user to access data from a node. Such existing solutions are proprietary to the data mining program being used and are closed. No extension or generality is present for such access. The functionality is tied to the tool being used to generate and display the graph. However, providing a user with the ability to use a broad range of applications to store data sets, apply data mining models, and display data mining graphs is desirable, in order to provide flexibility to the user. Thus, there is a need for the ability to implement access of data from a data set corresponding to data graphically displayed for a data mining model as applied to the data set, regardless of the application being used to store data sets, apply data mining models, and display data mining graphs.