Not Applicable
Not Applicable
A classifier is a process or device that categorizes inputs. For example, a quality-control classifier may take measurements of an item as inputs and return a result of xe2x80x9cdefectivexe2x80x9d or xe2x80x9cnot defective.xe2x80x9d Another example could be an iceberg-detecting classifier that takes as input a satellite image of an ocean area and produces as output either xe2x80x9cicebergxe2x80x9d or xe2x80x9cno iceberg.xe2x80x9d
As another example, consider an email router system designed to be deployed in a crisis relief situation. The system users can be divided into groups according to interests, including xe2x80x9cmedicalxe2x80x9d, xe2x80x9cfoodxe2x80x9d, and xe2x80x9ctransport.xe2x80x9d For each group, a classifier could be designed to determine whether an input email is of interest to the group. When an email is received by the system, each classifier can be applied, and the classifier results can determine which groups are sent the email. For example, an email detailing when blood supplies will arrive at an airport should be routed to the members of the xe2x80x9cmedicalxe2x80x9d and xe2x80x9ctransportxe2x80x9d groups.
A xe2x80x9cnearest neighborxe2x80x9d classifier is based on a set of inputs for which the categories are known. These input-output pairs are called the xe2x80x9cin-samplexe2x80x9d examples. The nearest neighbor classifier also includes a means of determining some measure of distance between inputs. When an xe2x80x9cout-of-samplexe2x80x9d input is encountered, the nearest neighbor classifier determines which in-sample example has the closest input and returns the corresponding output.
For the email router example, a nearest neighbor classifier for the medical group could consist of email messages from previous crises, with each message categorized as either xe2x80x9cmedically interestingxe2x80x9d or xe2x80x9cnot medically interesting.xe2x80x9d These are the in-sample examples. The distance measure between pairs of emails could be based on the number of words shared by the emails. To determine whether an out-of-sample email should be routed to medical personnel, the classifier would find the closest in-sample email. If that in-sample email is categorized as xe2x80x9cmedically interestingxe2x80x9d, then the classifier would categorize the out-of-sample email as xe2x80x9cmedically interesting.xe2x80x9d
Accuracy is a major concern in the development, selection, and decision whether to deploy a classifier. The probability that a classifier will incorrectly classify an input drawn at random is known as the expected error rate. (The expected error rate is also called the error probability.) Methods to estimate the expected error rate are called validation techniques. Effective validation techniques are needed so that classifiers can be selected for low expected error rates and so that classifier results can be used with appropriate levels of trust.
Now we describe prior art. One method to validate a nearest neighbor classifier is to remove some examples from the in-sample data to form a validation data set, use the remaining examples to form the classifier, then use the error rate of the classifier over the validation data as an estimate of the expected error rate. The drawback of this method is that the resulting classifier does not include the validation examples. On average, having fewer examples in a classifier results in a higher expected error rate. Hence, the primary goal of forming an accurate classifier suffers in order to accomplish the secondary goal of assessing the accuracy.
There are two prior validation techniques to obtain probabilistic bounds on the expected error rates for classifiers formed using all in-sample examples. Both methods are based on using in-sample examples to validate classifiers based on subsets of the remaining in-sample examples, then bounding the difference in error rates between the classifiers consisting of subsets of the in-sample examples and classifiers consisting of all in-sample examples. One method is based on the xe2x80x9cleave-one-outxe2x80x9d estimate, which is the error rate of applying the classifier consisting of the remaining in-sample examples to each in-sample example. This method is described in xe2x80x9cDistribution-Free Inequalities for the Deleted and Holdout Estimatesxe2x80x9d by L. Devroye and T. Wagner, published in IEEE Transactions on Information Theory, Volume 25, 1979, pages 202 to 207, which is included in its entirety by reference. The other method is based on the xe2x80x9choldoutxe2x80x9d estimate, which is the error rate of applying the classifier consisting of a subset of the in-sample examples to the remaining in-sample examples. This method is also described in the paper xe2x80x9cDistribution-Free Inequalities for the Deleted and Holdout Estimates.xe2x80x9d
The invention is a computer-based system to obtain probabilistic bounds on the expected error rate for a nearest neighbor classifier. The estimate used in the bound is an extension of the holdout estimate. The difference in error rates between the holdout classifier and classifier consisting of all in-sample examples is estimated using truncated inclusion and exclusion.
The bounds are computed as follows. Partition the in-sample examples into validation examples and remaining examples. Form a reduced classifier from the remaining examples. Use the error rate of the reduced classifier over the validation examples to estimate the expected error rate of the reduced classifier. Then estimate the difference in expected error rates between the reduced classifier and the full classifier as follows. Partition the validation examples into two xe2x80x9choldoutxe2x80x9d subsets. Then use each holdout subset to estimate the difference in expected error rate due to including the other holdout subset in the full classifier.
The bound is based only on the examples that comprise the classifier. Thus, all available examples can be used in the classifier; no examples need to be withheld to compute error bounds. Also, the probability that the prior xe2x80x9cleave-one-outxe2x80x9d and xe2x80x9choldoutxe2x80x9d error bounds are weaker than those produced by the process described here is shown to be large in xe2x80x9cValidation of Nearest Neighbor Classifiersxe2x80x9d by Eric Bax, published in IEEE Transactions on Information Theory, Volume 46, Number 7, 2000, pages 2746 to 2752, which is included in its entirety by reference.