Classification is an important field of study in many disciplines. Human beings do this naturally all the time; medical doctors use their skills to identify diseases, inspection line-workers use them to identify defective manufactured parts, police use them to find wanted criminals, etc. Since machines do not tire, can reliably provide the same output for the same given set of inputs and can readily be reproduced, it would be extremely desirable to develop one that could learn how to classify as well or better than the best human classifiers.
The literature is filled with examples where machines receive data and attempt to classify it in some way. Typically, this data can be represented in the form of a multidimensional vector defined within some hyperspace, wherein each dimension of the vector can be viewed as a feature of the data. Therefore by using machines, one can view many problems as processing an unknown data vector in some way that produces an output, which correctly associates it with a known class. Exemplar based machine learning techniques tackle these problems by learning to distinguish among classes through representative training examples. These techniques can be supervised or unsupervised, classifications can be generalized or specific instances can be retained for comparison purposes, and they can be trained incrementally or in a batch process. Several popular algorithms employing these techniques in various ways have been developed and published in the literature. In general, they result in an algorithm that attempts to classify an unknown data vector after learning from training examples.
Human brains continually classify the detections made by their sensors. Seeing and recognizing someone's face, listening and understanding a spoken word, smelling a lemon, feeling the presence of a liquid and tasting an apple are simple examples of how humans classify the detections made by their sensors. In essence, a person receives data from one or more than one sensor and then classifies the object sensed based on these detections. Of course, one of these classifications is “don't know”.
The accuracy under which classifications are made largely depends on the level of experience that the person has that is making the classification. The more experience a person has sensing various objects, the more likely they are to accurately classify an unknown object into one of their categories. Their known categories are generally formed based on what they have initially learned from a teacher (supervised learning) or have independently established based on objects they had discovered (unsupervised learning).
The teacher's function is to provide the student with accurate concrete examples of a particular class. The more examples the student is provided, the better understanding the student has for a given class. Hence when presented with an unknown object that belongs to a class that had been previously learned, one expects that the greater the number of training examples that had been used in the learning process, the more likely the student will be able to accurately classify the unknown object. The reason for this is because classification is greatly aided by familiarity, which increases ones ability to identify similarity.
People utilize their natural and artificial sensors and tools during the classification process. Given an unknown object, the information output from the sensing and measuring sources is processed and a classification is made. This complex process from raw data to classification after which some action may be based can generally be understood to be pattern recognition. Human beings do this all the time; depending on the unknown to be classified, some are better at it than others.
Since machines do not tire, can reliably provide the same output for the same given set of inputs and can readily be reproduced, developing one that could learn how to recognize patterns would be extremely useful. As mentioned, there are several examples within the open literature where machines receive data and attempt to classify it in some way. Typically, this data can be represented in the form of a multidimensional vector defined within some hyperspace (Each dimension of the vector can be viewed as a feature of the data.) Therefore the pattern recognition problem can be viewed as processing this data vector in some way that produces an output, which places it into some class. In general, the approach taken for the design of the classification processor depends on the data one has available. Several approaches are usually attempted with the available data and the one that produces the best results is typically accepted.
Exemplar based machine learning techniques tackle these pattern recognition problems by learning from representative training data. These techniques can be supervised or unsupervised, classes can be generalized or specific instances can be retained for comparison purposes, and they can be trained incrementally or in a batch process. Several popular algorithms employing these techniques in various ways have been developed and published in the literature.
In supervised learning each example in the training set is associated with a known class. In unsupervised learning the training examples are not divided into classes and one attempts to form clusters or find “natural groupings” among them. The Hierarchical Clustering and Self Organizing Map algorithms have been developed in an attempt to find natural groupings among the totality of training examples and ascertain classes based on those natural groupings. It should be noted that clustering the data into natural groupings provides an opportunity for extracting useful information regarding the structure of a given class within its feature hyperspace.
Specific instance algorithms would utilize training examples without any generalization; these types of algorithms are also referred to as lazy learners. The k-Nearest Neighbor would be considered specific instance because unknown data is compared directly with individual training examples. The classes of the k “most similar” training examples to the unknown data are tallied and the classification that represents the majority of those examples selected is assigned to the unknown data. Instance Based Learning also compares unknown data to individual training examples. They function similarly to the nearest neighbor algorithm, however the training examples retained for comparison purposes vary with the instance based algorithm that is chosen. Specific instance type algorithms can become computationally expensive if one needs to compare unknown data with every training example saved. These algorithms may also have very large storage requirements as well.
Generalization algorithms combine training examples and typically compare unknown data to generalized representations of the training data; these types of algorithms are also referred to as eager learners. Hyperrectangle algorithms group training examples as rectangular shapes within the feature hyperspace with sides that are parallel to the features of the data. Unknown data classifications are made based on their proximity to these hyperrectangles. Feature partitioning algorithms attempt to partition each feature range as a series of line segments that are identified with classes based on the training examples. Preliminary classifications are made based on each of the unknown's features proximity to these line segments. As a result, for a given unknown each of its features is independently used to associate the unknown with a class. Each of these independent class associations is assigned a weight. The unknown data is then declared as the class that received the most weight. The neural network algorithms reviewed generally form hyper planes or more complex decision surfaces within the feature space, or as can be the case with support vector machines within some transformed version of the feature space. The data is generalized because whether or not an unknown vector is considered to belong to a particular class depends on which side of the decision surface it lays. The binary decision tree algorithms generalize unknown data as belonging to one branch of classes or another and work their way along various branches until they progress toward an ultimate class leaf. The hierarchical clustering and self-organizing maps formed “natural clusters” from training examples. As the natural clusters were being formed they were compared to the known classes of the training examples.
Incremental vs. Batch Learning
Incremental learning algorithms continually allow the class decision space to be modified every time a new training sample is presented. These recursive techniques are typically sensitive to the order in which the training examples are presented.
Batch learning algorithms utilize all of the training examples in their totality. These algorithms may attempt to utilize every training example as a specific instance, such as in the case of k-Nearest Neighbor. They can also generalize their training examples and solve for their parameters either iteratively or all at once. By utilizing all of the training data at once, batch learning is more likely to find optimal parameters for a given algorithm based on the training data available than incremental learning algorithms. Since batch learning algorithms utilize all of the training data at once they can be more computationally intensive than incremental learning algorithms. Whenever new training data is available, batch learning algorithms often need to reprocess the entire training set in order to obtain their parameters. Incremental learning algorithms simply update their parameters based on the new training data available. The exemplar-based machine learning method developed in this manuscript utilized batch learning to obtain the algorithm parameters.
Principal component analysis (PCA) has often been used successfully in many, many applications as a classification preprocessing feature reduction technique. This has been done on a wide range of data sets from hyperspectral images, to vegetable oil data. A nonlinear kernel based technique, KPCA, has also been developed to further utilize the feature reduction capabilities of PCA. These kernel based techniques have been used successfully for feature reduction of gene expression data. This nonlinear kernel approach is sometimes called “kernel trick” in the literature. These typical applications of PCA utilize all of the training instances as one giant cluster. Researchers have also applied PCA after breaking the training data into a series of clusters. However, the incomplete information that could be extracted when the cluster's covariance matrix is singular is typically not utilized.
The University of California at Irvine data repository was specifically established so that researchers could compare their machine learning techniques with those of other methods on common data sets. The methodology developed in this study was performed on twenty one of those data sets. However, none of the highest performing classification approaches uncovered from other researchers for the data sets assessed utilized any form of PCA to develop their technique. The reason for this may be that common applications of PCA oversimplify the class generalizations for these more complex data sets. Typically data within a given class is the result of a complicated interaction of its features therefore its mean vector and covariance matrix can be poor representations of the entire class. However, if one applies clustering and exclusion techniques within a given class, this work will show that natural groupings can be identified and adequately generalized using principal components even when a cluster's covariance matrix is singular. Therefore, instead of using the typical implementations of PCA as a feature reduction technique, which often requires a nonsingular covariance matrix, the novelty of the technique developed in this manuscript can be viewed as combining clustering, a logical exclusion process and PCA capable of handling singular covariance matrices as a subclass local feature reduction technique.
The ability to classify unknown data after learning from a training set can be extremely useful for a wide range of disciplines. The purpose of this study is to explore and develop a classification technique that is capable of segmenting arbitrary complex multidimensional classes in a way that reveals a final set of clusters which can be adequately generalized with their mean vector, relevant subspace and null space so that the tools of PCA can be harnessed (when the covariance matrix is singular) and an exemplar based machine learning algorithm can be developed. The rationale for this approach lies in the understanding that one can adequately represent a hyper ellipsoidal shaped cloud with its mean vector and covariance matrix. Given enough hyper ellipsoidal shaped clouds of various sizes and orientations one can generalize virtually any arbitrary complex shape within a multidimensional space. Therefore, the more accurately an arbitrary complex class shape can be represented, the better result one anticipates from a similarity assessment with that shape and the more accurately one expects to be able to predict the class of unknown data.
Exemplar based learning methods implicitly assume that data belonging to the same class can potentially be clustered together in some complex arbitrary shape or shapes within the hyperspace of the data features. Therefore, it is based on this reasoning that attempts are made to identify the similarity between unknown data and the specific instance or generalized concepts learned from the training examples. Therefore, if one can adequately approximate how the multidimensional data is clustered and distributed within its feature hyperspace, it seems reasonable to expect that this knowledge could then be applied to develop improved generalized concepts for estimating classes.
Instinctively one expects each class's arbitrary complex multidimensional shape to be revealed through the distribution of an appropriate set of training instances within the pattern recognition problem's hyperspace. This work generalizes this complex shape by breaking each class down into clusters that are easy to describe mathematically. Each cluster is described by its first and second order statistics. A mental image would be that each cluster is generalized as a mean point surrounded by a hyper ellipsoidal shape cloud of size and orientation representative of the training instances from which it is composed. Unknown data vectors are compared to these clusters by calculating normalized Euclidean distances to them. The smaller the Euclidean distance, the greater the association made with the cluster. The class of the cluster with the greatest association becomes the estimated classification for the unknown. This approach has obtained very positive results when tested on two-dimensional synthetically generated data and real data taken from the University of California at Irvine data repository.