1. Field of the Invention
The present invention relates generally to data mining and, more specifically, to a system and a method for selecting important attributes in a database.
2. Related Art
In many applications, it is desirable to identify the important factors contributing to a phenomenon. The ability to identify important factors that contribute to a phenomenon either directly or in conjunction with other factors is immensely valuable.
Specifically, in visual data mining, humans view the data in different graphical ways (e.g., scatterplots). When there are many attributes or features (attributes and features are used interchangeably hereinafter) in a record, it is not clear which axes to plot. A system that can decide for users the important attributes to assign to the axes for a given goal is very valuable.
In data mining applications, a problem often faced by an induction method (also called induction algorithm) is how to focus its attention to relevant attributes when there is a predetermined label attribute. This is also known as attribute selection or feature selection. It is well known that induction methods degrade in their prediction accuracy when extraneous variables are presented. In other words, when an induction method is faced with many attributes that are not necessary for predicting the desired label attribute, prediction accuracy decreases.
Practical machine learning algorithms, including top-down induction of decision tree algorithms such as ID3 ("Induction of Decision Trees," J. R. Quinlan, Machine Learning, vol. 1, 1986), C4.5 (C4.5: Programs for Machine Learning, J. R. Quinlan, 1993), CART (Classification and Regression Trees, L. Breiman, J. H. Friedman, R. A. Olshen and C. Stone, 1984), and instance based algorithms, such as IBL ("Nearest Neighbor (NN) Norms: (NN) Patterns Classification Techniques," B. V. Dasarathy, IEEE Computer Society Press, 1990), ("Instance-Based Learning Algorithms," Machine Learning, D. W. Aha, D. Kibler and M. K. Albert, vol. 6, 1991), are known to degrade in performance (e.g., prediction accuracy) when faced with many attributes that are not necessary for predicting the desired output. This problem was also discussed in "Learning Boolean Concepts in the Presence of Many Irrelevant Features," Artificial Intelligence, H. Almuallim and T. G. Dietterich, vol. 69, 1994.
For example, running C4.5 in default mode on the Monk 1 problem (Thrun et al. 1991), which has three irrelevant attributes, generates a tree with 15 interior nodes, five of which test irrelevant attributes. The generated tree has an error rate of 24.3%, which is reduced to 11.1% if only the three relevant attributes are given. Aha (1991) noted that "IB3's storage requirement increases exponentially with the number of irrelevant attributes" (IB3 is a nearest-neighbor algorithm that attempts to save only important prototypes). Likewise, IB3's performance degrades rapidly with irrelevant features.
Simply stated, the problem of attribute selection is that of finding a subset of the original attributes, such that a training set that is run on data containing only those attributes generates a classifier with the highest possible accuracy. Note that attribute selection chooses a subset of attributes from existing attributes, and does not construct new attributes. Thus, there is no attribute extraction or construction (Kittler 1986, Rendell & Seshu 1990).
Consider a database of cars where each record has the following attributes:
mpg PA1 cylinders PA1 horsepower PA1 weight PA1 time zero.sub.-- to.sub.-- sixty PA1 year PA1 brand PA1 origin
Suppose, "origin" is selected as the chosen label attribute. We would like to know the importance of the remaining attributes with respect to the chosen label attribute (the remaining attributes are dependent variables of the chosen label attribute). In other words, we would like to know how well the remaining attributes discriminate the chosen label attribute. For example, the attribute "year" does not discriminate the label attribute "origin" well, i.e., the year of make of an automobile does not predict its country of origin. Thus, "year" is not an important attribute for the label attribute "origin." On the other hand, the attribute "brand" discriminates the label attribute "origin" quite well, i.e., the brand of an automobile correlates well with the country of origin. Thus, "brand" is an important attribute.
Furthermore, it would be desirable to identify the best three (or any other number) attributes for discriminating the label attribute. For example, given the label attribute "origin", it would be desirable to identify the best three attributes from the remaining attributes that discriminate the label attribute. Finally, it would be desirable to rank the remaining attributes based on how well they discriminate the label attribute.
In a Naive-Bayes classifier, the importance of each attribute is computed independently. However, if the importance of attributes is computed independently, then correlations are not properly captured. This is not desirable in databases that have several attributes that are correlated. For example, suppose that, in a database of computers, each record has several attributes, such as price, performance, etc. Suppose, the price of a computer is an important indicator of its performance. In other words, if the performance of a computer is a chosen label attribute, the price of a computer is an important attribute that discriminates the chosen label attribute well. Also, suppose among the several attributes, that there are three attributes, each indicative of price, and are essentially important and correlated: U.S. Dollar, Israeli Shekel and French Franc. While Dollar might be a good attribute individually, it is not as important together with Shekel and Franc because they are highly correlated. Thus, the best set of three attributes is not necessarily composed of the attributes that rank highest individually. If two attributes give the price in Dollar and in Shekel, they are ranked equally alone; however, if one of them is chosen, the other adds no discriminatory power to the set of best attributes. In Naive-Bayes, these three attributes will be selected first because they are equally important. However, since they are essentially equivalent, they do not provide any additional information. Thus, Naive-Bayes does not perform satisfactorily when there are correlated attributes.
Although many statistical measures for attribute selection exist, most are limited to either numeric or discrete attributes but not both, while others are based on strong distribution assumptions. Many real databases contain categorical attributes (e.g., state, eye-color, hair-color), and distribution assumptions may not be easily specified.
A feature selection method was described by the inventors of this application in "Feature Subset Selection Using the Wrapper Method: Overfitting and Dynamic Search Space Topology," R. Kohavi and D. Sommerfield, First International Conference on Knowledge Discovery and Data Mining (1995). The wrapper method, however, does not allow user interaction and users cannot influence the choice. Moreover, the wrapper method is very slow. As discussed in the paper, it required 29 hours of execution time to find the most important attributes in a DNA dataset. Furthermore, mutual information is not utilized to identify important features.
As a result, there is a need for a system and a method for determining how well various attributes in a record discriminate different values of a chosen label. There is also a need for a system and a method to enable data mining systems to focus on relevant attributes while ignoring the irrelevant attributes.