1. Field of the Invention
The present invention relates generally to data visualization and data mining.
2. Related Art
Many data mining tasks require classification of data into classes. Typically, a classifier classifies the data into the classes. For example, loan applications can be classified into either "approve" or "disapprove" classes. The classifier provides a function that maps (classifies) a data item (instance) into one of several predefined classes. More specifically, the classifier predicts one attribute of a set of data given one or more attributes. For example, in a database of iris flowers, a classifier can be built to predict the type of iris (iris-setosa, iris-versicolor or iris-virginica) given the petal length, petal width, sepal length and sepal width. The attribute being predicted (in this case, the type of iris) is called the label, and the attributes used for prediction are called the descriptive attributes.
A classifier is generally constructed by an inducer. The inducer is an algorithm that builds the classifier from a training set. The training set consists of records with labels. The training set is used by the inducer to "learn" how to construct the classifier as shown in FIG. 1. Once the classifier is built, it can be used to classify unlabeled records as shown in FIG. 2.
Inducers require a training set which is a database table containing attributes, one of which is designated as the class label. The label attribute type must be discrete (e.g., binned values, character string values, or few integers). FIG. 3 shows several records from a sample training set.
Once a classifier is built, it can classify new records as belonging to one of the classes. These new records must be in a table that has the same attributes as the training set; however, the table need not contain the label attribute. For example, if a classifier for predicting iris_type is built, we can apply the classifier to records containing only the descriptive attributes, and a new column is added with the predicted iris type.
In a marketing campaign, for example, a training set can be generated by running the campaign at one city and generating label values according to the responses in the city. A classifier can then be induced and campaign mail sent only to people who are labeled by the classifier as likely to respond, but from a larger population, such as all the U.S. Such mailing can have substantial cost savings.
A well known type of classifier is the Decision-Tree classifier. The Decision-Tree classifier assigns each record to a class. The Decision-Tree classifier is induced (generated) automatically from data. The data, which is made up of records and a label associated with each record, is called the training set.
Decision-Trees are commonly built by recursive partitioning. A univarite (single attribute) split is chosen for the root of the tree using some criterion (e.g., mutual information, gain-ratio, gini index). The data is then divided according to the test, and the process repeats recursively for each child. After a full tree is built, a pruning step is executed which reduces the tree size.
A major problem associated with decision-tree classifiers is that they are difficult to visualize using available visualizers. This is especially true of large decision-trees with thousands of nodes. Current two-dimensional displays are very limited. One product by AT&T called Dotty provides a two-dimensional visualization, but interaction is limited to simple scrolling of canvas. FIG. 4A shows an example of a small decision-tree generated by Dotty. FIG. 4B shows an example of a large decision-tree generated by Dotty.
Another conventional technique generates a decision-tree as a simple two-dimensional ASCII display. FIG. 4C shows a simple ASCII display generated by a conventional visualizer. It is difficult to comprehend from the two-dimensional ASCII display how much data was used to build parts of the decision-tree. Also, it is difficult to analyze the data from the ASCII display. A large decision-tree is a complex structure that typically has thousands of nodes. Such a large decision-tree typically does not fit into a display screen. Although graphical displays used to visualize decision-tree classifiers can scroll, they do not provide a good solution for large trees with thousands of nodes. This is because, it is extremely difficult to follow a path in a tree having hundreds of nodes. Also, it is difficult to see the big picture when a tree has thousands of nodes. Other products that contain two-dimensional visualizations are SAS, SPLUS, Angoss' Knowledge Seeker, and IBM's Intelligent Data Miner.