The present invention is directed to adaptive classifiers of the type that are capable of being trained, such as neural networks, and more particularly to procedures for training such devices, particularly for use in handwriting recognition systems and other recognition systems.
When statistical classifiers are used for pattern recognition, it is generally desirable to employ an adaptive type of classifier, rather than a programmed classifier, in those situations where input samples which belong to the same class can have an unknown variance between them. One example of such a situation is the field of handwriting recognition. While alphanumeric characters each have well defined shapes, the manner in which individuals write those characters can vary widely among different persons. Even the same person can write characters differently at various times. A classifier that is programmed to recognize particular patterns, such as the letters of an alphabet, may not be able to accommodate the nuances introduced by the handwriting of different individuals. Conversely, a classifier which has trainable properties, for example a neural network, can be taught to recognize that different variations of the same character belong in the same class. For this reason, adaptive statistical classifiers are used for applications such as speech recognition, handwriting recognition and optical character recognition.
The present invention is directed to the manner in which adaptive statistical classifiers, such as neural networks, are trained to recognize input patterns such as those representative of handwriting or speech. Generally, in the training of a classifier, a number of training samples are individually provided as input data to the classifier, and a target output is designated for each input sample. Each time a training sample is input, the classifier produces an output vector. This output vector is compared with the target output, and the differences between the two represent an error. This error is then employed to train the classifier. For example, in a neural network, the value of the error can be back-propagated through each of the layers of the network, and used to adjust weights assigned to paths through which the data propagates in the operation of the classifier. By repeating this process a number of times, the weights are adjusted in a manner which causes the outputs of the classifier to converge toward the target value for a given input sample.
It can be appreciated that the training procedure has a significant impact on the resulting ability of the classifier to successfully classify applied input patterns. Typically, a classifier is trained to classify each input sample into one or a few output classes which have the highest probabilities of being the correct class. While a conventional classifier may provide good results for output classes having a probability near 0.5 or greater, it typically does not provide a good estimate when there is a low probability that a given input sample belongs to a particular class. Rather, it tends to estimate the probability of such classes as zero. However, this type of behavior may not be desirable in all cases. For example, in handwriting recognition, a vertical stroke might typically represent the lowercase letter xe2x80x9clxe2x80x9d, the capital letter xe2x80x9cIxe2x80x9d. or the numeral xe2x80x9c1xe2x80x9d. However, there is also the possibility that the vertical stroke might represent an exaggerated punctuation mark, such as a comma or apostrophe. If outcomes which have a very low probability are forced towards zero, the classifier might not indicate that there is a possibility that a vertical stroke represents a comma or an apostrophe. As a result, this possibility is not taken into account during subsequent word recognition. Accordingly, it is one objective of the present invention to provide a technique for training a classifier in a manner which provides a more reliable estimate of outputs having low probabilities.
Typically, a statistical classifier attempts to place every input sample into one or more of the recognized output classes. In some instances, however, a particular input sample does not belong to any output class. For example, in a handwriting recognition environment, a particular stroke may constitute one portion of a character, but be meaningless by itself. If, however, the stroke is separated from the rest of the character, for example by a small space, the classifier may attempt to classify the stroke as a separate character, giving an erroneous output. Accordingly, it is another objective of the present invention to train a statistical classifier in a manner which facilitates recognition of input samples which do not belong to any of the output classes, and thereby avoid erroneous attempts by the classifier to place every possible input sample into a designated output class.
In a typical training session for a classifier, every input sample in a set of samples might be employed the same number of times to train the classifier. However, some input samples have greater value, in terms of their effect on training the classifier, than others. For example, if the classifier correctly recognizes a particular input sample, there is no need to repeatedly provide that sample to the classifier in further training iterations. Conversely, if an input sample is not correctly classified, it is desirable to repeat the training on that sample, to provide the classifier with a greater opportunity to learn its correct classification. It is a further objective of the present invention to provide an efficient training procedure, in which the probability that different input samples are provided to the classifier is adjusted in a manner which is commensurate with their value in the training of the classifier.
In a typical application, it is unlikely that all possible classes will occur with the same frequency during run-time operation of the classifier. For example, in the English language, the letter xe2x80x9cQxe2x80x9d is used infrequently, relative to most other letters of the alphabet. If, during the training of the classifier, the letter Q appears significantly less often in training samples than other letters, the classifier may exhibit an improper bias against that letter, e.g., it may have a much lower probability than the letter xe2x80x9cOxe2x80x9d for a circular input pattern. In the past, efforts have been made to compensate for this bias by adjusting the output values produced by the classifier. See, for example, R. P. Lippmann, xe2x80x9cNeural Networks, Bayesian A Posteriori Probabilities, and Pattern Classification,xe2x80x9d appearing in From Statistics to Neural Networksxe2x80x94Theory and Pattern Recognition Applications, V. Cherkassky, J. H. Friedman and H. Wechsler, Springer-Verlag, Berlin, 1994, pages 83-104; and N. Morgan and H. Bourlard, xe2x80x9cContinuous Speech Recognitionxe2x80x94An Introduction to the Hybrid HMM/Connectionist Approach,xe2x80x9d IEEE Signal Processing, Vol. 13, No. 3, pages 24-42, May 1995. Rather than consuming run-time resources to compensate output values, as was done in the past, it is preferable to eliminate the bias against rare classes through appropriate training of the classifier. Accordingly, it is another objective of the present invention to train a classifier in a manner which accounts for unequal frequencies of classes among input samples in a training set, and thereby factor out any tendency for bias against low frequency classes.
In training a classifier, it is a common problem that the classifier will become xe2x80x9coverfittedxe2x80x9d to the training data, and will not exhibit good generalization to novel samples encountered in actual use. A prior art method to improve generalization is to augment the set of training samples with modified copies of the training samples, to help the classifier learn to generalize across a class of modifications. This approach, discussed by Chang and Lippmann, xe2x80x9cUsing Voice Transformations to Create Additional Training Talkers for Word Spottingxe2x80x9d, Advances in Neural Information Processing Systems 7, 1995, requires a large increase in the size of the training set actually handled by the training process, resulting in a significant slow-down and a large memory requirement. Accordingly, it is another object of the present invention to train a classifier in a manner which avoids overfitting and helps generalization across a range of modifications or distortions, without a large increase in training cost in terms of compute time or memory.
In a classifier that is based upon neural networks, the output values that are generated in response to a given input sample are determined by the weights of the paths along which the input data propagates, i.e., the interconnections between the nodes in successive layers of the network. Previous large-scale studies have concluded that performance of a neural network in a recognition task degrades seriously for weights having a resolution smaller than about 15 bits or two bytes. See Asanovic and Morgan, xe2x80x9cExperimental Determination of Precision Requirements for Back Propagation Training of Artificial Neural Networksxe2x80x9d, Tech Report of the International Computer Science Institute, Berkeley, Calif., 1991. Depending upon the complexity of the network, there may be hundreds or even thousands of interconnections between the various nodes. Since a weight must be stored for each such interconnection, it can be appreciated that a significant amount of storage capacity is required for the weight information that defines a neural network""s function. It is a further objective of the present invention to provide a neural network which can successfully operate with weights defined by smaller values, e.g., one byte each, and thereby reduce the memory requirements for the run-time operation of the network.
In accordance with the present invention, the first objective noted above is achieved by reducing back-propagated errors for classifier outputs that correspond to incorrect classes, relative to those for correct classes. The effect of such an approach is to raise the output values from the classifier, since the training procedure does not push the outputs for incorrect classifications towards zero as much as conventional training processes do. As a result, the classifier does a better job of estimating probabilities of classifications when they are low, in addition to probabilities in the range of 0.5 and greater.
The second objective identified above is accomplished through a negative training procedure. In this procedure, patterns which do not belong in any of the output classifications are employed as training inputs to the classifier, in addition to correct patterns. A target output value of zero for all classes is employed for these xe2x80x9cincorrectxe2x80x9d patterns. As a result, the overall ability of the classifier to recognize related groups of input patterns, for example words in a handwriting recognition environment, is significantly improved.
As a further feature of the invention, negative training samples have a different probability of being employed in a given training session than correct samples. With this approach, the training time is made more efficient, and the classifier has less tendency to suppress outputs for certain inputs that comprise legitimate members of a class by themselves as well as being portions of a larger pattern. In addition, the error value which is back-propagated through the classifier in response to a negative training sample can be scaled differently than for a positive sample, to trade off the need to push output values towards zero for negative input samples against the need to not overly suppress output values for characters that may resemble negative samples.
As a further extension of this concept, another of the objectives identified above is accomplished by providing every training sample with a probability for its use in any given training iteration. For example, in a handwriting recognition application, when a character is correctly classified, there is little advantage to be gained by further training on that character. Rather, it may be preferable to skip the sample to save time, and to prevent overtraining or committing the classifier""s power to non-discriminative features. Therefore, correctly classified samples are given a lower probability of use than those which are incorrectly classified. Further in this regard, the probability of use of an input sample can be varied over the training session. Specifically, at the beginning of a training session, correctly classified samples are given a low probability of being trained upon, so that the training resources are focused primarily towards training on errors. However, as training continues, the probability of utilizing a correctly classified sample is increased, to avoid biasing the classifier towards peculiarities in the input samples.
As a further feature of the invention, another of the objectives identified above is accomplished through the balancing of the frequencies of training samples. For a given set of training samples, a repetition factor is calculated for each output class, which indicates the average number of times an input pattern should be repeated each time a pattern from the class is selected for training. Thereafter, input samples are randomly selected and can be either skipped or repeated, depending on their repetition factors. With this approach, samples which belong to more frequent classes are less likely to be used as training inputs than those which belong to rarer classes.
As another feature of the present invention, overfitting is prevented and generalization is improved by applying random distortions from a predetermined family of distortions to training samples as they are selected during the training process, so that repeated selection of samples over time does not result in training the classifier on identical data patterns. In particular, samples that are repeated for balancing, as mentioned above, are given different random distortions on each repetition.
As another feature of the present invention, the memory requirements of a classifier can be reduced by limiting the range and resolution of the weights which define links between individual nodes of a neural network. In the forward propagation of an input sample through the network during training, low resolution weight values, e.g., one byte each, are employed. During error back-propagation, weight changes are accumulated in high resolution values that are formed by concatenating an extra one or two bytes with the low resolution weight values. If the weight change results in a carry across the boundary between the low resolution and higher resolution bytes, the low resolution weight is varied. After training is completed, only the low resolution weight values need to be stored for the actual operation of the classifier, thereby reducing memory requirements for final products.
These and other features of the invention, as well as the advantages provided thereby, are explained in detail hereinafter with reference to preferred embodiments of the invention illustrated in the accompanying figures.