1. Field of the Invention
The invention relates generally to data classification. More specifically, the invention relates to estimating and evaluating classifier performance.
2. Description of Related Art
A classification algorithm, or classifier, is a method of determining the class to which a sample belongs based on a set of one or more features. A “class” is simply an attribute of the sample, which in some cases is well-defined. For example, the classes “tumor” and “normal” could be used in the design of a classifier for diagnosing whether a particular tissue sample is cancerous. Useful features for detecting cancer might include gene and/or protein expression levels measured from the tissue samples. Although the applications of classification algorithms are numerous, this diagnosis example underscores the need for accurate methods of evaluating classifier performance in certain domains.
A “sample” is any type of data. A “feature” is an aspect or characteristic of the data. For example, if the sample is a tissue, a feature of the tissue may be its protein expression level. If the sample is a credit card application, a feature of the application might be the age or income of the applicant.
For applications similar to the diagnosis example above, supervised learning techniques have been employed for the design of the classifier. Supervised learning is a kind of machine learning where the learning algorithm is provided with a set of inputs for the algorithm along with the corresponding correct outputs, and learning involves the algorithm comparing its current actual output with the correct or target outputs, so that it knows what its error is, and modify things accordingly. Such techniques are used to learn the relationship between independent features of a sample and a designated dependent attribute (i.e., the class of the sample). Most induction algorithms fall into the supervised learning category. By contrast, unsupervised learning signifies a mode of machine learning where the system is not told the “right answer”—for example, it is not trained on pairs consisting of an input and the desired output. Instead the system is given the input patterns and is left to find interesting patterns, regularities, or clusterings among them. Clustering algorithms are usually unsupervised. Supervised learning techniques utilize a training dataset in which the classes of the samples are known. Under ideal circumstances, a classifier designed to correctly classify the data in the training dataset will perform well on test data not contained in the training dataset. Such classifiers are said to generalize well. In practice, there are numerous complications that can impact the generalization performance of a classifier. Thus, simply evaluating a classifier based on training data performance alone is ill advised.
Many techniques have been developed in an effort to produce a more robust estimate of expected classifier performance. Most popular among these are crossvalidation methods. In these methods, the training dataset is partitioned into a design dataset and a test dataset. A classifier is computed from the design dataset and classifier performance is evaluated on the ability of the classifier to correctly classify the test dataset samples. The process is repeated for several distinct partitions of the training dataset, and an overall performance estimate is calculated from the collective test dataset results. The underlying principle at work is that measures of performance derived from data not included in the design process are apt to be more robust. However, it has been observed that these crossvalidation measures are not accurate in some circumstances. For example, an underlying assumption in the design of classification algorithms is that the training data accurately represents the population as a whole, i.e., that the class probability densities can be estimated from the training data. In many applications, the limited amount of data available for training renders this assumption invalid. As a result, classifiers generated based on such training data can be ineffective. “Class probability density” refers to the distribution of the class features within a particular class. For example, a single feature of a protein may have a density defined by a Gaussian with a mean of 1.232 and a standard deviation of 0.287. In more than one dimension, correlations can exist between features. These correlations are also defined by a class probability density.