This disclosure is related to the field of methods and systems which perform classification of an object, such as a data set associated with a test sample. Here and in the following, the term “classification” is used in the sense of supervised classification, i.e. classification based on a training set of previously labeled objects. More particularly, the disclosure is directed to a method for determining the probability that a test object is a member of a particular class, given a training set of previously labeled objects. The methods have many possible applications, including medical-related fields. For example, the classification methods can be used for predicting whether a patient will derive benefit or adverse effects from the administration of a particular drug.
The present disclosure discusses one possible application of the invention in which a test object to be classified is in the form of a mass spectrum containing a peak, or a group of peaks, with respect to a training set comprising a set of mass spectra that are members of two or more classes. However, the methods can be used with other types of data. Hence, in the following disclosure, the term “test instance” is occasionally used to represent the object to be classified, which may take the form of a mass spectrum containing a peak, or a group of peaks, or other form of data, e.g., data from a different type of analytical instrument, e.g., gas chromatograph or spectrometer. The term “instance” is used as synonymous to “object”.
Of the various classification methods known in the art, the k-Nearest Neighbor (kNN) method is a powerful method of nonparametric discrimination, or supervised learning. Background literature related to the kNN method includes E. Fix and J. L. Hodges, “Discriminatory analysis. Nonparametric discrimination: consistency properties.” Report Number 4, Project Number 21-49-004, USAF School of Aviation Medicine, Randolph Field, Tex. (February 1951). Reprinted in International Statistical Review, 57 (1989) 238-247; E. Fix and J. L. Hodges, “Discriminatory analysis. Nonparametric discrimination: small sample performance.” Report Number 11, Project Number 21-49-004, USAF School of Aviation Medicine, Randolph Field, Tex. (August 1952); T. M. Cover and P. E. Hart, “Nearest Neighbor Pattern Classification”, IEEE Transactions on Information Theory, IT-13 (1967) 21-27; and B. W. Silverman and M. C. Jones, “E. Fix and J. L. Hodges (1951): An important contribution to nonparametric discriminant analysis and density estimation”, International Statistical Review, 57 (1989) 233-238.
Each object, or instance, to be classified, is characterized by d values xi, i=1 . . . d and is thus represented by a point in a d-dimensional space. In the example of mass spectrometry (MS) data, each value xi represents an intensity of an individual feature, or intensity of an individual peak, in the mass spectrum. The distance between any two instances can be defined in different ways, the simplest of which is the usual Euclidean metric √{square root over (Σi(xi−x′i)2)}., but any other distance measure can also be used. Given a training set (a set of instances with known class assignments/labels) and a positive integer k, classification of the test object is performed as follows.    1. Find the k nearest neighbor instances from the training set instances to the test object.    2. Determine which of the labels of the k nearest neighbor training set instances is in the majority.    3. Assign the label determined as being in the majority in step (2) to the test object.
This simple algorithm has two noticeable drawbacks. First, it does not properly take into account the number of instances of each class in the training set. Simply adding more instances of a given class to the training set would bias classification results in favor of this class. Thus, the algorithm in the above simple form is only applicable when each class in the training set is represented by an equal number of instances. In practice, this is rarely the case.
Second, the algorithm provides no information on the confidence of class assignment for individual instances. Consider, for example, the case of k=15 and two classes. It is intuitively clear that the confidence of class assignment in the situation where all 15 of the nearest neighbors belong to the same class is much higher than in the situation where 8 belong to one class and 7 belong to another class. In many applications, such as those related to clinical diagnostics, it may be very important to be able to characterize the confidence of each individual class assignment.
In this document, we address these problems by providing a probability estimate of the test instance belonging to each of the classes in the training set, based on the class labels of each of the k nearest neighbors from the training set. An example is described below where there are two classes of objects in the training set, however the methods can be extended to the situation where there are three or more classes. We provide two derivations of the probability estimates, one within the kernel density estimation framework (a fixed vicinity of the test instance determines the number of neighbors), the other within the kNN framework (a fixed number of neighbors determines the size of the vicinity). Both lead to the same result for the probability estimate of the test instance belonging to each of the classes.
Unlike the estimates of the overall error rate of kNN classification that depend on the probability distributions associated with the classes, the probability estimates of this disclosure provides a reliability of class assignment for each individual test instance, depending only on the (known) training set data and their labels. It also properly accounts for complications arising when the numbers of training instances in the two classes are different, i.e. N1≠N2. Here N1 and N2 are the numbers of instances in the training set that belong, respectively, to class 1 and to class 2, if one considers the two class classification problems. Extensions to more than two classes are analogous.
The problem of statistical confidence of kNN classification has been also addressed in several other references, including see J. Wang, P. Neskovic and L. N. Cooper, “Partitioning a feature space using a locally defined confidence measure”, ICANN/ICONIP (2003) 200-203; J. Wang, P. Neskovic and L. N. Cooper, “An adaptive nearest neighbor algorithm for classification”, Proceedings of the 4th International Conference on Machine Learning and Cybernetics, Guangzhou (2005) 3069-3074; J. Wang, P. Neskovic and L. N. Cooper, “Neighborhood size selection in the k-nearest-neighbor rule using statistical confidence”, Pattern Recognition 39 (2006) 417-423; and X.-J. Ma, R. Patel, X. Wang, et al, “Molecular classification of human cancers using a 92-gene real-time quantitative polymerase chain reaction assay”, Arch. Pathol. Lab. Med. 130 (2006) 465-473. However, the “confidence level” proposed in the J. Wang et al. papers has a completely different statistical meaning and cannot be used to estimate the reliability of class assignment for each individual test instance. The same is true for P-values discussed in the Ma et al. paper at p. 466.
Additional prior art of interest includes the paper of Robert P. W. Duin, David M. J. Tax, Classifier Conditional Posterior Probabilities, published in: A. Amin, D. Dori, P. Pudil, and H. Freeman (eds.), Advances in Pattern Recognition, Lecture Notes in Computer Science, Volume 1451, p. 611-619, Springer, Berlin (1998), ISBN 978-3-540-64858-1. Other prior art of interest includes U.S. Pat. Nos. 7,016,884, 7,359,805, 7,228,239, and 6,003,027.
The probabilistic classification methods and system of this disclosure provide a facility for determining the reliability of class assignment for each individual test instance. The methods depend only on the (known) training set data, and are not dependent on knowledge of the probability density functions of the training set data, i.e., they are non-parametric. They also avoid the potential bias in a classification system when the numbers of instances in the two classes in the training set are different.