Many pattern recognition machines have a built-in capability to adapt their useful output on the basis of "training data." One such system is the adaptive neural network, which is finding increasing use in character and speech recognition applications. An example of this type of system is described in U.S. Pat. No. 5,067,164, issued Nov. 19, 1991, to J. S. Denker et al., and assigned to Applicants' assignee. This patent application is hereby incorporated by reference herein in its entirety.
Workers in the art of pattern recognition have recognized that in training the neural network it is useful to take into account a characteristic of patterns known as "invariance." The term "invariance" as used herein refers to the invariance of the nature of a pattern to a human observer, with respect to some transformation of that pattern. For instance, the nature of the image of a "3" pattern is invariant by translation, which is a linear displacement of the image. That is, translating the image does not change the meaning of the image to a human observer. On the other hand, the nature of the image of a "6" is not invariant by a rotation of 180 degrees: to a human observer it becomes a "9." To the same observer, however, a small rotation of the upright "6" image does not change the meaning of the image.
A desirable property of a pattern recognizing machine is that its output be invariant with respect to some specific transformation of its input. In the case of alphanumeric patterns, the possible transformations include: translation, rotation, scaling, hyperbolic deformations, line thickness changes, grey-level changes, and others.
In many systems in which the processing machine adaptively "learns," it is useful to input not only raw training data but also some amount of high-level information about the invariances of the training data input patterns. In automated alphanumeric character recognition, for example, the answer generated by the classifier machine should be invariant with respect to small spatial distortions of the input images (translations, rotations, scale changes, etc.). In speech recognition systems, the system should be invariant to slight time distortions or pitch shifts.
A particular example of such a system is a neural network-based machine for making accurate classifications as to the identity of letters and numbers in the address block of envelopes being processed at a postal service sorting center. Here, it is necessary that the neural network be trained to recognize accurately the many shapes and sizes in which each letter or number are formed on the envelope by postal service users.
Given an unlimited amount of training data and training time, this type of system could learn the relevant invariances from the data alone, but this is often infeasible. On the other hand, having limited amount of input data for the learning process can also degrade the accuracy of recognition.
This latter limitation is addressed by the prior art by using artificial data that consists of various distortions (translations, rotations, scalings . . . ) of the original data. This procedure, called the "distortion model," allows the statistical inference process to learn to distinguish the noise from the signal. This model is described in an article "Document Image Defect Models" by Henry Baird, published in IAPR 1990 Workshop on Sytactic and Structural Pattern Recognition (1990). Unfortunately, if the distortions are small, the learning procedure makes little use of the additional information provided by the distorted pattern. The reason is that the information contained in the difference between the two patterns is masked by the information they have in common. Learning is therefore prohibitively slow. If the distortions are made larger however, the learning performance can also go down, due to the fact that the database pattern distribution no longer reflects the distribution which the system must perform on.
Another approach found in the prior art to overcome this limitation is to incorporate into the training procedure some general invariances without specifying the specific transformations (rotation, translation, etc.) of the input which will leave the output invariant. This procedure is exemplified by the "weight decay" model described in the article "Learning Internal Representations by Error Propagation," published in Parallel Distributed Processing, Volume 1 (1987), by D. E. Rumelhart, G. E. Hinton, and R. J. Williams. It attempts to decrease the network sensitivity to all variations of the network input. One problem with the results obtained with this model, however, is a lack of realism. While invariance with respect to a few specific transformations does not compromise correct output classification, it is also true that invariance with respect to a transformation, which to the human observer makes the transformed letter look like another, will result in incorrect classification of one of the two letters. This lack of selectivity is a well-known limitation of the "weight decay" and like models.
The factors of training time, correctness, and limitations on available data therefore are not yet satisfactorily addressed and remain an issue in the use of neural networks to recognize handwritten script in the address box of envelopes. Training obviously cannot be conducted using the total universe of letters/numbers on envelopes that flow through the Postal Service. To train instead on samples of these requires typically many thousands of samples of training data to teach a network to distinguish useful information from noise. Further, a training session for a neural network can take days or weeks in order to make the best use of the training data available. As to results, in the best of circumstances in prior art machines, modern neural network classifiers seldom perform better than to achieve after training approximately 95 percent correct classifications on uncleaned handwritten digit databases.
Basically, therefore, having to convey useful information about the database by enumerating thousands of sample patterns to the learning procedure, is a primary inefficiency of the prior art.
Accordingly, one object of the invention is to build recognition databases more efficiently.
Another object of the invention is to incorporate into a neural network training database new and useful instructions as to how the input data may be transformed.
A specific object of the invention is to automatically recognize handwritten alphanumeric script more rapidly and at the same time more accurately in real time.