Decision learning is an important aspect in systems incorporating artificial intelligence. Many statistical, symbolic, connectionist and case-based learning algorithms have been used in a variety of applications with reasonable success. While most of the stable decision learning methods perform well on domains with relevant information, they could degrade in the presence of irrelevant or redundant information.
Selective or focused learning presents a solution to this problem. A key component of selective learning is selective attention through feature selection. The aim of the feature selection is to find a subset of features for learning that results in better performance than using all features for learning. Note that feature subset selection chooses a set of features from existing features, and does not construct new ones; that is, there is no feature extraction or construction.
From a purely decision theoretical standpoint, the question of which features to use is not of much interest. A Bayes rule, or a Bayes classifier, is a rule that predicts the most probable outcome for a given instance, based on the full distribution assumed to be known. The accuracy of the Bayes rule is the highest possible accuracy, and it is mostly of theoretical interest. The optimal Bayes rule is monotonic, i.e., adding features cannot decrease accuracy, and hence restricting a Bayes rule to a subset of features is never advised.
In practical learning scenarios, however, we face two problems: the learning algorithms are not given access to the underlying distribution, and most practical algorithms attempt to find a hypothesis by approximating NPhard optimization problems. The first problem is closely related to the bias variance tradeoff: one must tradeoff estimation of more parameters (bias reduction) with accurately estimating these parameters (variance reduction). This problem is independent of the computational power available to the learner. The second problem, that of finding a “best” hypothesis, is usually intractable and thus poses an added computational burden. For example, decision tree induction algorithms usually attempt to find a small tree that fits the data well, yet finding the optimal binary decision tree is NPhard. For artificial neural networks, the problem is even harder; the problem of loading a three node neural network with a training set is NPhard even if the nodes simply compute linear threshold functions.
Because the specific features are so application specific, there is no general theory for designing an effective feature set. There are a number of prior art approaches to feature subset selection. A filter approach attempts to assess the merits of features from the data, ignoring the learning algorithm. It selects features using a preprocessing step. In contrast, a wrapper approach includes the learning algorithm as a part of its evaluation function.
One of the filter approach called FOCUS algorithm (Almuallim H. and Dietterich T. G., Learning boolean concepts in the presence of many irrelevant features. Artificial Intelligence, 69(1-2):279-306, 1994.), exhaustively examines all subsets of features to select the minimal subset of features. It has severe implications when applied blindly without regard for the resulting induced concept. For example, in a medical diagnosis task, a set of features describing a patient might include the patient's social security number (SSN). When FOCUS searches for the minimum set of features, it could pick the SSN as the only feature needed to uniquely determine the label. Given only the SSN, any learning algorithm is expected to generalize poorly.
Another filter approach called Relief algorithm (I. Kononenko. Estimating attributes: Analysis and extensions of RELIEF. In L. De Raedt and F. Bergadano, editors, Proc. European Conf. on Machine Learning, pages 171-182, Catania, Italy, 1994. Springer-Verlag), assigns a “relevance” weight to each feature. The Relief algorithm attempts to find all weakly relevant features but does not help with redundant features. In real applications, many features have high correlations with the decision outcome, and thus many are (weakly) relevant, and will not be removed by Relief.
The main disadvantage of the filter approach is that it totally ignores the effects of the selected feature subset on the performance of the learning algorithm. It is desirable to select an optimal feature subset with respect to a particular learning algorithm, taking into account its heuristics, biases, and tradeoffs.
A wrapper approach (R. Kohavi and G. John. Wrappers for feature subset selection. Artificial Intelligence, 97(1-2), 1997) conducts a feature space search for evaluating features. The wrapper approach includes the learning algorithm as a part of their evaluation function. The wrapper schemes perform some form of state space search and select or remove the features that maximize an objective function. The subset of features selected is then evaluated using the target learner. The process is repeated until no improvement is made or addition/deletion of new features reduces the accuracy of the target learner. Wrappers might provide better learning accuracy but are computationally more expensive than the Filter methods.
It is shown that neither filter nor wrapper approaches is inherently better and that any practical feature selection algorithm needs to at least consider the learning algorithm used for classification and the metric used for evaluating the learning algorithm's performance (Tsamardinos, I. and C. F. Aliferis. Towards Principled Feature Selection: Relevancy, Filters, and Wrappers. in Ninth International Workshop on Artificial Intelligence and Statistics. 2003. Key West, Fla., USA.).
Unfortunately, there is a lack of computationally feasible method that considers the learning algorithm used for classification and there is a lack of adequate metric for evaluating the learning algorithm's performance.
Decision tree is a popular and powerful non-parametric decision learning and data mining method. Since a decision tree typically contains only a subset of the available features, decision trees could be used for feature selection. A prior art method introduced a semiflexible prediction method for feature selection using decision trees. Cardie (C. Cardie. “Using Decision Trees to Improve Case-Based Learning. In P. Utgoff, editor, Proceedings of the Tenth International Conference on Machine Learning,” pages 25-32, University of Massachusetts, Amherst, Mass. Morgan Kaufmann, 1993) used a decision tree algorithm to select a subset of features for a nearest neighbor algorithm. The decision tree thus serves as the filter for the nearest neighbor algorithm. However, when faced with many irrelevant features, the hierarchical induction of decision tree algorithms is known to degrade in performance (decision accuracy).
The decision tree learning methods have been increasingly used in pattern recognition and data mining applications. They automatically select a subset of the features through the learning algorithms. Therefore, most prior art decision tree learning methods do not incorporate a feature selection step prior to the decision tree learning. The features used at each node of a decision tree learning algorithm are selected from all possible features based on the data pertaining to the node. The decision tree learning methods use a divide and conquer method and therefore suffer from data fragmentation and small sample size at deeper nodes. Thus, the features and decisions in the deep nodes could be unstable and are highly susceptive to noise. Furthermore, the features selected in the deeper nodes are highly dependent on the initial divide of the tree that could have bias or context switching effects. A different division in the first few nodes could yield a completely different feature choices and structures of the tree. Furthermore, learning from the complete feature set without prior selection could dramatically slow down the decision tree learning process since all features have to be considered at each node.
It is highly desirable to have a robust feature selection method that is optimized for hierarchical decision learning methods and can overcome the above difficulties.
Information integration methods for decision regulation in a hierarchic decision system are disclosed in U.S. patent application Ser. No. 09/972,057 entitled, “Regulation of Hierarchic Decisions in Intelligent Systems”; U.S. patent application Ser. No. 10/081,441 entitled, “Information Integration Method for Decision Regulation in Hierarchic Decision Systems”.) The decision regulation method separates noise and application domain consistent characteristics and integrates multiple information to create robust ranking of decisions features and results that work well for the application despite of the application dynamics and errors in database (U.S. patent application Ser. No. 10/609,490 entitled, “Dynamic Learning and Knowledge Representation for Data Mining”.) However, the methods do not include feature selection.