a. Field of the Invention
The present invention concerns an apparatus, and an accompanying method, for locating positions, such as centers, and spans of all desired objects within an input field. The desired objects can then be preprocessed based on their locations and spans before subsequent classification to optimize such classification process.
b. Related Art
Optical character recognition (or "OCR") systems should be able to accurately automate the process of recognizing and translating first machine printed alphanumeric characters, and ultimately handwritten characters, into appropriate digital data. For various reasons not relevant here, starting several years ago and continuing to the present, neural networks are seen in the art as a preferred technique for providing accurate character recognition in an OCR system. Although neural networks are known to those skilled in the art, they will be briefly described below for the reader's convenience.
In contrast to traditional sequential "Von Neumann" digital processors that operate with mathematical precision, neural networks are generally analog--though digital implementations are increasingly common, and typically manifest massively parallel processing. These networks provide fast and often surprisingly good output approximations, but not precise results, b)y making weighted decisions on the basis of fuzzy, incomplete and/or frequently contradictory input data.
By way of background, a neural network is basically a configuration of identical processing elements, so-called neurons, arranged in a multi-layered hierarchical configuration. Each neuron can have one or more inputs, but only one output. Each input is weighted by a coefficient. The output of a neuron is typically calculated as a function of the sum of its weighted inputs and a bias value. This function, the so-called activation function, is typically a "sigmoid" function; i.e. it is S-shaped, monotonically increases and asymptotically approaches fixed values typically +1, and zero or -1 as its input respectively approaches positive or negative infinity. The sigmoid function and the individual neural weight and bias values determine the response or "excitability" of the neuron to signals presented to its inputs.
The output of a neuron in one layer may be distributed as input to neurons in a higher layer. A typical neural network contains at least three distinct layers: (i) an input layer situated at the bottom of the network; (ii) an output layer situated at the top of the network; and (iii) one or more hierarchical interconnected hidden layers located intermediately between the input and output layers.
For example, if a neural network were to be used for recognizing normalized alphanumeric characters situated within a 7.times.5 pixel array, then the output of a sensor for each pixel in that array, such as a cell of an appropriate charge coupled device (CCD), would be routed as input to a different neuron in the input layer. Thirty-five different neurons, one for each different pixel, would exist in this input layer. Each neuron in this input layer would have only one input. The outputs of all 35 neurons in the input layer would be distributed, in turn, as input to every neuron in, e.g., a single intermediate or so-called hidden layer.
The number of neurons in this single hidden layer, as well as the number of separate hidden layers that is used in the neural network, depends, inter alia, upon the complexity of the character bit-maps to be presented to the network for recognition; the desired information capacity of the network; the degree to which the network, once trained, can handle unfamiliar patterns; and the number of iterations that the network must undergo during training for all the network weight and bias values to properly converge.
If the network were to utilize several separate hidden layers, then the output from each neuron in the first (i.e. lowest) hidden layer would feed the inputs to the neurons in the second (i.e. next higher) hidden layer and so forth for the remaining hidden layers. The output of the neurons in the last (i.e. highest) hidden layer would feed the neural inputs in the output layer. The output of the network typically feeds a processor or other circuitry that converts the network output into appropriate multi-bit digital data, e.g. ASCII characters, for subsequent processing.
Generally, the output of each of the neurons in the last hidden layer is distributed as an input to every neuron in the output layer. The number of neurons in the output layer typically equals the number of different characters that the network is to recognize (or classify), with the output of each such neuron corresponding to a different one of these characters. The numerical outputs from all the output layer neurons form the output of the neural network. For example, one output neuron may be associated with the letter "A", another with the letter "B", a third with the letter "a", a fourth with the letter "b" and so on for each different alphanumeric character, including letters, numbers, punctuation marks and/or other desired symbols, if any, that is to be recognized by the network.
The use of a neural network generally involves two distinct successive procedures: (i) initialization and training of the neural network using known pre-defined patterns having known outputs; and (ii) recognition (or classification) of actual unknown patterns by the trained neural network. Although those skilled in the art know how to initialize and train neural networks, such initialization and training is briefly discussed below for the reader's convenience.
To initialize the network, the weights and biases of all the neurons situated therein are set to random values typically within certain fixed bounds. Thereafter, the network is trained. Specifically, the neural network is successively presented with pre-defined input data patterns, i.e. so-called training patterns. The values of the neural weights and biases in the neural network are simultaneously adjusted such that the output of the neural network for each individual training pattern approximately matches a desired corresponding neural network output (target vector) for that pattern. Once training is complete, all the weights and biases are then fixed at their current values.
One technique commonly used in the art for adjusting the values of the weights and biases of all the neurons during training is back error propagation (hereinafter referred to simply as "back propagation"). Briefly, this technique involves presenting a pre-defined input training pattern (input vector) to the neural network and allowing that pattern to be propagated forward through the neural network to produce a corresponding output pattern (output vector, O) at the output neurons. The error associated with the output pattern is determined and then back propagated through the neural network to apportion this error to individual neurons in the network. Thereafter, the weights and bias for each neuron are adjusted in a direction and by an amount that minimizes the total network error for this input pattern.
Once all the network weights have been adjusted for one training pattern, the next training pattern is presented to the network and the error determination and weight adjusting process iteratively repeats, and so on for each successive training pattern. Typically, once the total network error for each of these patterns reaches a pre-defined limit, these iterations stop and training halts. At this point, all the network weight and bias values are fixed at their then current values. Thereafter, character recognition on unknown input data can occur at a relatively high speed.
Once the neural network has been trained, it can be used to recognize unknown patterns. During pattern recognition, each unknown pattern is applied to the inputs of the neural network and resulting corresponding neural network responses are taken from the output nodes. Ideally speaking, once the neural network recognizes an unknown input pattern to be a given character on which the neural network was trained, then the signal produced by a neuron in the output layer and associated with that character should sharply increase relative to the signals produced by all the other neurons in the output layer.
During character recognition, a "winner take all" approach is generally used to identify the specific character that has been recognized by the network. Under this approach, once the neural network has fully reacted to an input data pattern, then the one output neuron that generates the highest output value relative to those produced by the other output neurons is selected, typically by a processing circuit connected to the neural network, as the network output. Having made this selection, the processor then determines, such as through a simple table look-up operation, the multi-bit digital representation of the specific character identified by the neural network.
Neural network based OCR systems have exhibited excellent performance characteristics with machine printed text, particularly "clean" text that exhibits a high degree of uniformity, in terms of line thickness and orientation, from one character to the next. Unfortunately, the task of recognizing characters, even through the use of a neural network, is complicated by the existence of touching or otherwise overlapping characters. While a very small number of machine printed characters actually touch, due to kerning and the like, touching and overlapping characters are particularly prevalent with handwritten text and numerals, effectively exhibiting, due to human variability, an infinite number of variations. As a practical matter, neural networks are not trained to recognize even a major portion, let alone all, of these variations; some preprocessing is needed to simplify the recognition task of a neural network.
In an effort to greatly simplify the task of recognizing human handwriting, the art teaches the use of determining those characters which touch and then segmenting or otherwise partitioning these characters and recognizing each character that results. In this regard, the art teaches two basic approaches: (i) performing segmentation before character recognition; or (ii) simultaneously performing both segmentation and recognition.
One example of the pre-recognition segmentation approach is the system disclosed in U.S. Pat. No. 5,299,269 (issued to R. S. Gaborski et al on Mar. 29, 1994 and assigned to the present assignee hereof, also referred to herein as "the '269 patent"). In the system disclosed in the '269 patent, a window is stepped across an image field, on a pixel-by-pixel basis, to capture a sub-image, i.e. a kernel of the image at each step. An associated memory or neural network is trained., through one training set, to recognize all non-character images that can exist within the sub-image, i.e. all possible intersections and combinations of known characters that correspond to window positions that straddle adjacent characters. This training includes window-captured sub-images that in the past were incorrectly perceived as being centered on a character when, in fact, they were not, i.e. the result of false character segmentation.
The same memory or network, or a second one, is trained, through a second training set, to recognize the individual "non-straddling" characters of a given character set. If one item of the training sets is recognized, then the entire sub-image is forwarded to a downstream portion of an OCR system for further character recognition.
For the system disclosed in the '269 patent to properly function, the appropriate memory or network must be trained on all possible non-character images. Unfortunately, a very large, potentially infinite, number of such images can exist. Since a training sequence cannot encompass all such non-character images, the accuracy with which any one of these images will be recognized will be reduced as the number of non-character training images is decreased. Furthermore, as the number of different characters which a network (or memory) must recognize increases, the size and complexity of that network (or memory) increases at a considerably greater rate. Hence, the system disclosed in the '269 patent may be impractical in many applications.
Another example of the pre-recognition segmentation approach is described in A. Gupta et al, "An Integrated Architecture for Recognition of Totally Unconstrained Handwritten Numerals", International Journal of Pattern Recognition and Artificial Intelligence, Vol. 7, No. 4, pages 757-773 (1993) (hereinafter referred to as "the Gupta article"). In the system discussed in the Gupta article, once an image is scanned, typically to implement machine recognition of a handwritten zip code, a resulting digital binary bit-map of a source document, such as an envelope, is passed through a preprocessing stage which performs segmentation, thinning and rethickening (the latter two functions to impart uniform thickness to otherwise differing stroke thicknesses among different characters) as well as character size normalization and slant correction. Character recognition is then performed on a resulting preprocessed bit-map. By reducing differences among resulting characters, the complexity of the recognition stage, particularly the neural network used therein, would be considerably reduced. However, the pre-recognition segmentation approach has exhibited, on an empirical basis, quite some difficulty in accurately separating touching characters.
Consequently, as a result of this difficulty among other reasons, the art is turning to a combined (concurrent) segmentation-recognition approach. This latter approach typically involves moving a window, e.g. a so-called "sliding" window, of a certain width across a field and fabricating confidence measures for competing character classifications to determine if the window is positioned directly on top of character and to try to recognize a character within the window. A combined segmentation/recognition approach is described, for example, in Martin et al, "Learning to See Where and What: Training a Net to Make Saccades and Recognize Handwritten Characters" (1993), appears in S. J. Hanson et al (eds.), Advances in Neural Information Processing Systems, Volume 5, pages 441-447 (Morgan Kaufmann Publishers, San Mateo, Calif.) (hereinafter referred to as "the Martin article"). In the system discussed in the Martin article (which, for convenience, will henceforth be referred to herein as the "Saccade" system) a four-layer neural network (i.e. with two hidden layers) is trained, using back propagation, not only to locate and recognize characters, by class, in the center of a window (as well as whether a character exists in the window or not) but also to make corrective jumps, i.e. so-called "saccades", to the nearest character, and after its recognition, to the next character and so forth. Unfortunately, this system tends to miss (jump over) relatively narrow characters and occasionally duplicates relatively wide characters, thereby reducing overall recognition accuracy.
Another combined segmentation/recognition approach, is described in Bengio et al, "Globally Trained Handwritten Word Recognizer using Spatial Representation, Convolutional Neural Networks and Hidden Markov Models", Proceedings of 1993 Conference on Neural Information Processing Systems--Natural and Synthetic, Nov. 29-Dec. 2, 1993, Denver, Colo., pages 937-944 (hereinafter referred to as "the Bengio article"). The approach discussed in the Benjio article uses a multi-layer convolution neural network with multiple, spatially replicated, sliding windows displaced by a one or several pixel shift with respect to each other along the scanning direction. The outputs of corresponding neural classifiers serve as input to a post-processing module, specifically a hidden Markov model, to decide which one of the windows is centrally located over a character. This approach provides a neural output indicating whether the character is centered within a window or not. Unfortunately, this particular approach of replicating a neural classifier, when viewed with the need for post-processing, tends to be quite expensive computationally and relatively slow, and thus impractical.
Therefore, a general and still unsatisfied need exists in the art for an OCR system that can accurately and efficiently recognize handwritten characters that include touching and/or otherwise overlapping characters. Moreover, the system should be able to normalize characters within an input field. Furthermore, the system should also be able to operate in more generalized applications such that it can find and classify objects within an image.
In furtherance of meeting this general need, we believe that a relatively simple and fast, yet accurate apparatus (and an accompanying method), particularly suited for inclusion within an object location and classification (or recognition) system (e.g., an OCR system) should properly locate each object (or character) that is to be recognized from within an image field. To handle a wide variety of different objects, the apparatus will preferably utilize a neural network. By using such a method apparatus in conjunction with appropriate object classification (or recognition) means, by centrally positioning a recognition window over the object and by determining a span of the object, fewer different patterns within the window would need to be classified, thereby simplifying and/or increasing the accuracy of the classification (or recognition) task. Hence, the resulting system would likely recognize objects (e.g., handwritten characters) more accurately and efficiently than has heretofore occurred with OCR systems known in the art.