The following relates to the informational, classification, clustering, data storage, and related arts.
Categorization is a useful operation in the informational arts. In a supervised learning approach, a training set of objects, such as documents, images, or so forth, are provided with pre-determined class labels. Features of the objects are extracted, and a classifier is trained to identify class members based on characteristic features identified from the training set. In some approaches, the class labels may not be provided a priori but rather extracted by grouping together objects of the training set with similar sets of features. This is sometimes referred to as unsupervised learning or clustering.
The computational complexity of categorization increases rapidly with increasing numbers of objects in the training set, with increasing number of features, and with increasing number of classes. For multi-class problems, a substantially sized training set and a substantial number of features is typically employed to provide sufficient information from which to differentiate amongst the multiple classes. Thus, multi-class problems are by nature generally computationally intensive.
One way to reduce this complexity is to reduce the number of features under consideration. By reducing the number of features, advantages such as faster learning and prediction, easier interpretation, and generalization are typically obtained. However, the removal of features should be done in a way that does not adversely impact the classification accuracy. Accordingly, one would generally like to filter out irrelevant or redundant features.
Irrelevant features are those which provide negligible distinguishing information. For example, if the objects are all dogs, cats, or squirrels, and it is desired to classify each new animal into one of these three classes, the feature of color may be irrelevant if each of dogs, cats, and squirrels have about the same distribution of brown, black, and tan fur colors. In such a case, knowing that an input animal is brown provides negligible distinguishing information for classifying the animal as a cat, dog, or squirrel. Features which are irrelevant for a given classification problem are not useful, and accordingly a feature that is irrelevant can be filtered out.
Redundant features are those which provide distinguishing information, but are cumulative to another feature or group of features that provide substantially the same distinguishing information. Using the previous example, consider illustrative “diet” and “domestication” features. Dogs and cats both have similar carnivorous diets, while squirrels consume nuts and so forth. Thus, the “diet” feature can efficiently distinguish squirrels from dogs and cats, although it provides little information to distinguish between dogs and cats. Dogs and cats are also both typically domesticated animals, while squirrels are wild animals. Thus, the “domestication” feature provides substantially the same information as the “diet” feature, namely distinguishing squirrels from dogs and cats but not distinguishing between dogs and cats. Thus, the “diet” and “domestication” features are cumulative, and one can identify one of these features as redundant so as to be filtered out. However, unlike irrelevant features, care should be taken with redundant features to ensure that one retains enough of the redundant features to provide the relevant distinguishing information. In the foregoing example, one may wish to filter out either the “diet” feature or the “domestication” feature, but if one removes both the “diet” feature and the “domestication” feature then useful distinguishing information is lost.
Existing feature filtering techniques are generally effective at identifying and filtering out irrelevant features. Identifying irrelevant features is relatively straightforward because each feature can be considered in isolation. On the other hand, identifying redundant features has been found to be more difficult, since the analysis entails comparing distinguishing information provided by different features and selecting a sub-set of the redundant features for removal.
Another issue with feature filtering is scalability. Some filtering techniques are effective for a relatively small numbers of features, but perform less well as the feature set size increases. This is disadvantageous for multi-class problems where the feature set size is usually selected to be relatively large in order to provide sufficient information to provide effective differentiation amongst the multiple classes.
One feature filtering technique is known as the fast correlation based filtering (FCBF) technique. In the FCBF approach, irrelevant features are first identified and filtered out, and the remaining features are ranked by relevance. Redundant features are then identified using an approximate Markov blanket configured to identify for a given candidate feature whether any other feature is both (i) more correlated with the set of classes than the candidate feature and (ii) more correlated with the candidate feature than with the set of classes. If both conditions (i), (ii) are satisfied, then the candidate feature is identified as a redundant feature and is filtered out.
The FCBF technique has been found to be generally effective for binary classification problems, and is scalable to large feature set sizes. However, the FCBF filtering technique has been found to be less effective for multi-class problems in that it sometimes filters out too many features leading to loss of valuable information.