Statistical classification has two widely recognized meanings. First, based upon a set of observations or data, statistical classification seeks to establish the existence of classes or clusters in the data. This type of statistical classification is referred to as unsupervised learning (or clustering). Secondly, the existence of classes may be known beforehand. In this second case, statistical classification seeks to establish a rule or rules whereby a new observation is classified into one of the known existing classes. This type of statistical classification is known as supervised learning.
Supervised learning possesses wide applicability to industrial and technical applications. For example, supervised learning may be used to establish a rule or rules for machine vision recognition. The machine vision recognition based upon the established rule(s) may be used to guide or control an automated fabrication process.
In supervised learning, a set of measurements are selected that are believed to be indicative of the defined classification(s). Training data is created based upon the selected measurements. Each element in the training data is labeled according to the defined classifications. Upon the basis of the label training data, various methodologies may be used to classify subsequently observed data elements.
The “nearest neighbor” classification methodology measures the distance (e.g., calculated using a suitable weighted metric) from an observed data element to each data element in the training data. The N-closest data elements from the training data are selected. The most frequently occurring class in the N-closest data elements is used to classify the observed data element.
The classification methodology assumes that the classifications of the training data elements are correct. However, the classifications can possess a number of errors for a variety of reasons. The amount of misclassification is related to accuracy of the classification methodology. Specifically, the greater amount of misclassification in the training data leads to reduced accuracy of the classification performance. Thus, data integrity of the classification data is an important consideration in supervised learning applications.