"Pattern recognition" refers to classifying an object or event into one of several pre-specified categories. This process can be broken down into two steps. The first step is "feature extraction," which includes selecting parameters that will allow the object or event to be distinguished from another and detecting values for these parameters. The second step is "classification," which uses the values (i.e., data) obtained from the feature extraction to make a classification decision.
As an example of pattern recognition, consider the task of automatically determining whether a person is a male or a female. First, a feature extractor will extract specific data about pre-selected parameters characterizing the person, e.g., height and weight. This data is typically combined in what is known as a feature vector. Here, the feature vector is two dimensional, where one dimension is for height and the other for weight. The space containing such feature vectors is known as feature space. The feature vector for height and weight will then be used by a classifier to decide whether the person is a male or a female.
Conventional classifiers apply statistical analysis techniques to the data to make decisions. Typically, for each class, a mean and a variance are estimated for scaler data, or a mean vector and a covariance matrix are estimated for multidimensional data. Bayes decision theory (see for example, "Pattern Recognition and Scene Analysis," R. O. Duda and P. E. Hart, Wiley, 1973) can then be used to locate a decision boundary, known as a "discriminant," which divides the data according to the classes. Data falling on one side of the boundary will belong to one class whereas data falling on the boundary or on the other side of the boundary will belong to another class.
As an example, consider the male/female classification previously mentioned. Considering only the "weight" feature, if the average weight for females is estimated as 110 pounds and the average weight for a male is estimated as 170 pounds, and both these measurements have equal variance, then a discriminant boundary would be located at 140 pounds. Thus, a person weighing over 140 pounds would be classified as male, and otherwise would be classified as female.
Classifiers based on statistical decision theory, such as that described above, rely upon accurate estimates for the parameters of probability density functions. A probability density function (PDF) refers to a model for the statistical properties of a random signal. Common statistical properties for a PDF used in pattern recognition are the previously mentioned mean and variance parameters. However, these properties have short-comings for problems where it is difficult to estimate the PDF parameters, whether it be due to insufficient data, high dimensional feature vectors, etc. It is well-known in pattern recognition that the higher the dimension of a vector, the more difficult it is to estimate its statistical properties.
Recently, to overcome some of these problems neural networks have been suggested as an alternative to the above-described classifiers. Neural networks attempt to model the pattern recognition capabilities of a human brain by interconnecting simple computational elements. These individual computational elements are known as "neurons."
Each neuron computes a weighted sum of an input vector. The sum essentially defines a boundary line for classifying data. For a two dimensional feature vector, a boundary line is formed in accordance with the equation W.sub.1 X.sub.1 +W.sub.2 X.sub.2 =D, where X.sub.1 and X.sub.2 are measured features for which the corresponding feature vector is [X.sub.1 X.sub.2 ]; the W's (usually called the "weights") comprise the "weight vector" and are perpendicular to the boundary line that separates the classes; and the D is the distance of the line from the origin.
The above equation represents a weighted sum of the input features. If this weighted sum is greater than the variable D, then the input is said to belong to one class. If the weighted sum is not greater than D, the input is said to belong to the other class. Typically, the D is absorbed into the weight vector by augmenting the feature vector with a "one" (i.e., [X.sub.1 X.sub.2 1], W.sub.1 X.sub.1 +W.sub.2 X.sub.2 -D=0, or W.sub.1 X.sub.1 +W.sub.2 X.sub.2 +W.sub.3 1=0, where W.sub.3 =-D (the weight W.sub.3 corresponding to the offset D, is commonly known as a "bias").
The classification decision can be made by evaluating the dot product of the weight vector [W.sub.1 W.sub.2 W.sub.3 ] with the feature vector [X.sub.1 X.sub.2 1]. If the value of the dot product is greater than zero, the data belongs to one of the two classes (i.e., the data is on one "side" of the boundary). Likewise, if the value is negative, the data does not belong to that class. This is numerically equivalent to evaluating W.sub.1 X.sub.1 +W.sub.2 X.sub.2 =D, but is notationally more convenient.
The weighted sum can be applied to a hard limiter, which makes any value above 0 equal +1 and any number below 0 equal to -1. The output of the hard limiter is the neuron output.
A commonly-used alternative is to apply the weighted sum to a soft limiter. This is typically accomplished by applying the value to a "sigmoid" function. For a value y, the sigmoid of y is defined as f(y)=1/(1+e.sup.-.alpha.y), where .alpha. is some constant, e.g., .alpha.=1. The sigmoid is a S-shaped function that asymptotically approaches +1 or 0 as y approaches positive or negative infinity, respectively. The sigmoid can be easily modified to approach +1 and -1, as the hard limiter. As contrasted to the hard limiter (which can provide only +1 or -1 values), the sigmoid provides a more gradual transition between these two levels.
For two dimensional features, the discriminant is a line in feature space. For features of a dimension greater than two, the discriminant will be a hyperplane. That is, in "n" dimensional space (i.e., beyond just two or three dimensions, such as twenty dimensions), the line is replaced with a "hyperplane" in "hyperspace." The test, then, is whether the data is on one side of a hyperplane or the other.
To make a classification decision, it may be necessary to use many of these neurons--an array of these neurons, each of which constructs lines or hyperplanes in feature space and outputs a +1 or a -1. So, if the task at hand is to recognize ten people (i.e., each person represents a class) from some features from digitalized photographs of their faces (e.g., their eyes, the length of their noses, the distances of their lips, and so on), the process makes linear separations in feature space. The number of neurons must be sufficient to partition the feature space into distinct regions corresponding to each person within this set of ten people.
Arrays of neurons, such as that described above, are only capable of solving problems that are linearly separable (i.e., the two classes can be separated by a single line). For problems that are not linearly separable, many layers of neurons can be required to do a piece-wise linear approximation to partition feature space into distinct regions. Mathematically, if this process is followed enough times, one can approximate any "decision surface," i.e., any kind of curve or arbitrary function that is needed to separate one class from another.
After data is processed by the first layer of neurons, the outputs of these neurons can be used as inputs to a second layer of neurons, known as a "hidden layer." Thus, if there are five neurons in an input layer, each node in the next (or "hidden") layer will have five inputs. The hidden layer processes the outputs from the first layer with additional weights in the same fashion as for the input layer. The purpose of the hidden layers in this "feed-forward architecture is to solve non-linearly separable problems (i.e., when two classes can not be separated by a single line). The outputs of the hidden layer can be used as inputs to another hidden layer, which operates identically to the first hidden layer, or can be used as inputs to an "output layer." The output layer has one or more neurons, whose output(s) represent the final decision of the classifier. In neural networks of this architecture, soft limiters are used between the neurons of each layer and one (or more) hard limiter(s) is used after the output layer to help define the final decision by producing an electrical output of +1 or -1.
The neural network structure used to implement this process is known as a multi-layer perceptron (MLP), which is shown in FIG. 1. The MLP is a powerful computing tool, but it presents many technical challenges in implementation. These challenges include the challenge of efficiently training the MLP to solve a problem, and the challenge of selecting the proper architecture, i.e., number of neurons, number of layers, etc., to solve the problem.
A. MLP Training
In about the 1980's, a process for training an MLP was discovered. The process uses what is know as a "gradient descent" process. The feature vector is input to all the neurons in the front (input) layer. The outputs of the input layer are used as inputs to the hidden layer, and so on, until the output layer is reached.
At the output layer, the desired output value is compared to the actual output of the MLP. "The error" is defined as the difference between the actual output and the desired output. The error is fed backwards (i.e., from the output to the input) into the MLP to adjust the weights in the network, making the output closer to the desired output. This technique is known as "back-propagation." To continue the training process, the next feature vector is input and the above-described process is repeated. Again, the error is propagated back into the system and the weights are adjusted to reduce the error. By processing all of the feature vectors many times with this approach, the MLP will usually reduce the overall error and essentially learn the relation between the feature vectors and their corresponding classes. Thus, a trained MLP should be capable or retrieving the correct class when given a feature vector for that class. (For further information about the back-propagation process, see D. E. Rumelhart and J. L. McClelland, Parallel Distributed Processing, MIT Cambridge Press, 1986.)
In the first phase of MLP training, all of the weights are initially set with random numbers. Therefore, essentially all hyperplanes are randomly positioned in feature space. Then the back-propagation process is used to adjust the hyperplanes to organize and discriminate among the data.
Training an MLP usually takes a long time because the MLP is learning everything at once; i.e., the proper location of each hyperplane. Over the years, people have tried many approaches to speed up training, but have yet to find a simple solution to this problem.
Another problem with this training process is that an MLP frequently gets "trapped;" it trains for a long time and cannot solve the problem. The error function whose minimization is desired can be viewed as a line with many curves. The best solution will correspond to the point on this curve corresponding to the global minimum, i.e., the lowest "valley." The MLP is prone to getting stuck in a local valley minimum and not converging the global minimum. To find a better solution, the MLP's must exit the valley, go over a peak, look for another minimum, and continue hunting until the global minimum is found. When an MLP gets stuck in a local minimum, it is necessary to dump out all the data and randomly restart somewhere else to see if the global minimum can be found. Not only is this approach extremely time consuming, it is also heuristic and thus, there is no guarantee that a solution will ever be found.
An MLP can also be very expensive with respect to hardware implementation. Companies, such as Intel, are selling neurons on a chip, but they do not seem to be in use extensively and effectively. Instead, Digital Signal Processor (DSP) chips are used. A DSP chip is a fast computer comprising an analog/digital (A/D) converter coupled to hardware multipliers. A DSP chip can cost often more than a dollar each, and at a million neurons (chips), for example, the MLP can be very expensive.
B. MLP Testing
Another challenge presented by an MLP involves testing to determine how well the neural net generalizes. The ability of a neural net to generalize from output data is analogous to the situation in which a child learns to generalize from seeing a few chairs: the child does not need to have every chair in the world pointed out--the child generalizes from what is known about a few chairs. So too can MLP's generalize, say, to recognize a known face though it is contorted with a grimace or a smile.
This "generalizing" process illustrates a difference between an MLP and a device which stores data about faces in a look-up table. The interconnections between the neurons store data in a highly encoded and compact way, similar to the way neurons in a brain function. This permits generalizing instead of merely retrieving data from a look-up table.
C. Decision Trees
Decision trees, refer to FIG. 2, are another approach to the pattern classification problem. A decision tree is a tree structure comprised of nodes (i.e., nodes are the basic unit of a tree structured for a decision). The bottom-most nodes of the tree are known as "leaf nodes," and represent the different classes of the problem. A feature vector, to be classified, is first applied to the top most node of the tree, known as a "root node" or simply the "root." After being processed by the root node, the feature vector will be sent to the appropriate node at the next level in a pattern resembling a branch in the decision tree, etc., until the feature vector reaches a leaf node, where the feature vector is classified in the category belonging to that leaf. In feed-forward-feed neural networks, the classification decision is made in the last layer, i.e., the leaf layer, and not in the intermediate layers.
As an example of a decision tree, consider the problem of discriminating between males and females as previously described. The first node may decide that if the weight is greater than 140 pounds, the person is a male. If the weight is equal or less than 140 pounds, the next node checks if the height is less than 5'8". If it is, then the leaf node classifies the person as a female; otherwise, the person is classified as a male.
Decision trees are advantageous in that they (1) train fast and (2) have a self-organizing architecture that does not have to be specified beforehand.
But decision trees are not without their limitations. Most decision tree processes only consider one feature at a time for the classification decision, which creates discriminant boundaries perpendicular to the feature axes. For example, in the above case, the decision for whether the weight is above or below 140 pounds is implemented by a discriminant that is perpendicular to the weight axis at the point 140. This is a severe limitation for linearly separable problems that might require a diagonal discriminant, as illustrated in FIG. 3a versus FIG. 3b. Note that to partition such classes, a staircase-like discriminant is necessary, requiring a node for each line. On the other hand, a single diagonal discriminant, which can be implemented by a single layer perception (also known as a neuron) in an MLP, can separate the two classes. Several decision tree processes, such as CART and ID3 are described by Breiman and Freeman in Classification and Regression Trees, Wadsworth International Group, Belmont, Calif. 1984.