When statistical classifiers are used for pattern recognition, it is generally desirable to employ an adaptive type of classifier, rather than a programmed classifier, in those situations where input samples which belong to the same class can have an unknown variance between them. One example of such a situation is the field of handwriting recognition. While alphanumeric characters each have well defined shapes, the manner in which individuals write those characters can vary widely among different persons. Even the same person can write characters differently at various times. A classifier that is programmed to recognize particular patterns, such as the letters of an alphabet, may not be able to accommodate the nuances introduced by the handwriting of different individuals. Conversely, a classifier which has trainable properties, for example a neural network, can be taught to recognize that different variations of the same character belong in the same class. For this reason, adaptive statistical classifiers are used for applications such as speech recognition, handwriting recognition and optical character recognition.
The present invention is directed to the manner in which adaptive statistical classifiers, such as neural networks, are trained to recognize input patterns such as those representative of handwriting or speech. Generally, in the training of a classifier, a number of training samples are individually provided as input data to the classifier, and a target output is designated for each input sample. Each time a training sample is input, the classifier produces an output vector. This output vector is compared with the target output, and the differences between the two represent an error. This error is then employed to train the classifier. For example, in a neural network, the value of the error can be back-propagated through each of the layers of the network, and used to adjust weights assigned to paths through which the data propagates in the operation of the classifier. By repeating this process a number of times, the weights are adjusted in a manner which causes the outputs of the classifier to converge toward the target value for a given input sample.
It can be appreciated that the training procedure has a significant impact on the resulting ability of the classifier to successfully classify applied input patterns. Typically, a classifier is trained to classify each input sample into one or a few output classes which have the highest probabilities of being the correct class. While a conventional classifier may provide good results for output classes having a probability near 0.5 or greater, it typically does not provide a good estimate when there is a low probability that a given input sample belongs to a particular class. Rather, it tends to estimate the probability of such classes as zero. However, this type of behavior may not be desirable in all cases. For example, in handwriting recognition, a vertical stroke might typically represent the lowercase letter "l", the capital letter "I", or the numeral "1". However, there is also the possibility that the vertical stroke might represent an exaggerated punctuation mark, such as a comma or apostrophe. If outcomes which have a very low probability are forced towards zero, the classifier might not indicate that there is a possibility that a vertical stroke represents a comma or an apostrophe. As a result, this possibility is not taken into account during subsequent word recognition. Accordingly, it is one objective of the present invention to provide a technique for training a classifier in a manner which provides a more reliable estimate of outputs having low probabilities.
Typically, a statistical classifier attempts to place every input sample into one or more of the recognized output classes. In some instances, however, a particular input sample does not belong to any output class. For example, in a handwriting recognition environment, a particular stroke may constitute one portion of a character, but be meaningless by itself. If, however, the stroke is separated from the rest of the character, for example by a small space, the classifier may attempt to classify the stroke as a separate character, giving an erroneous output. Accordingly, it is another objective of the present invention to train a statistical classifier in a manner which facilitates recognition of input samples which do not belong to any of the output classes, and thereby avoid erroneous attempts by the classifier to place every possible input sample into a designated output class.
In a typical training session for a classifier, every input sample in a set of samples might be employed the same number of times to train the classifier. However, some input samples have greater value, in terms of their effect on training the classifier, than others. For example, if the classifier correctly recognizes a particular input sample, there is no need to repeatedly provide that sample to the classifier in further training iterations. Conversely, if an input sample is not correctly classified, it is desirable to repeat the training on that sample, to provide the classifier with a greater opportunity to learn its correct classification. It is a further objective of the present invention to provide an efficient training procedure, in which the probability that different input samples are provided to the classifier is adjusted in a manner which is commensurate with their value in the training of the classifier.
In a typical application, it is unlikely that all possible classes will occur with the same frequency during run-time operation of the classifier. For example, in the English language, the letter "Q" is used infrequently, relative to most other letters of the alphabet. If, during the training of the classifier, the letter Q appears significantly less often in training samples than other letters, the classifier may exhibit an improper bias against that letter, e.g., it may have a much lower probability than the letter "O" for a circular input pattern. In the past, efforts have been made to compensate for this bias by adjusting the output values produced by the classifier. See, for example, R. P. Lippmann, "Neural Networks, Bayesian A Posteriori Probabilities, and Pattern Classification," appearing in From Statistics to Neural Networks--Theory and Pattern Recognition Applications, V. Cherkassky, J. H. Friedman and H. Wechsler, Springer-Verlag, Berlin, 1994, pages 83-104; and N. Morgan and H. Bourlard, "Continuous Speech Recognition--An Introduction to the Hybrid HMM/Connectionist Approach," IEEE Signal Processing, Vol. 13, No. 3, pages 24-42, May 1995. Rather than consuming run-time resources to compensate output values, as was done in the past, it is preferable to eliminate the bias against rare classes through appropriate training of the classifier. Accordingly, it is another objective of the present invention to train a classifier in a manner which accounts for unequal frequencies of classes among input samples in a training set, and thereby factor out any tendency for bias against low frequency classes.
In training a classifier, it is a common problem that the classifier will become "overfitted" to the training data, and will not exhibit good generalization to novel samples encountered in actual use. A prior art method to improve generalization is to augment the set of training samples with modified copies of the training samples, to help the classifier learn to generalize across a class of modifications. This approach, discussed by Chang and Lippmann, "Using Voice Transformations to Create Additional Training Talkers for Word Spotting", Advances in Neural Information Processing Systems 7, 1995, requires a large increase in the size of the training set actually handled by the training process, resulting in a significant slow-down and a large memory requirement. Accordingly, it is another object of the present invention to train a classifier in a manner which avoids overfitting and helps generalization across a range of modifications or distortions, without a large increase in training cost in terms of compute time or memory.
In a classifier that is based upon neural networks, the output values that are generated in response to a given input sample are determined by the weights of the paths along which the input data propagates, i.e., the interconnections between the nodes in successive layers of the network. Previous large-scale studies have concluded that performance of a neural network in a recognition task degrades seriously for weights having a resolution smaller than about 15 bits or two bytes. See Asanovic and Morgan, "Experimental Determination of Precision Requirements for Back Propagation Training of Artificial Neural Networks", Tech Report of the International Computer Science Institute, Berkeley, Calif., 1991. Depending upon the complexity of the network, there may be hundreds or even thousands of interconnections between the various nodes. Since a weight must be stored for each such interconnection, it can be appreciated that a significant amount of storage capacity is required for the weight information that defines a neural network's function. It is a further objective of the present invention to provide a neural network which can successfully operate with weights defined by smaller values, e.g., one byte each, and thereby reduce the memory requirements for the run-time operation of the network.