Through the use of word processors and/or other data processing and computerized office equipment, the number of paper documents, particularly forms, of one kind or another that are currently in use has simply exploded over the past few decades. At some point, the information on most of these documents must be extracted therefrom and processed in some fashion.
For example, one document that is in wide use today is a paper bank check. A payor typically fills in, either by hand or through machine, a dollar amount on an appropriate line of the check and presents the check to its recipient. The recipient deposits the check in its bank. In order for this bank to process the check for payment, a human operator employed by the bank reads the amount on the check and instructs a printer to place appropriate digits on the bottom of the check. These digits and similar electronic routing codes situated on the bottom of the check are subsequently machine read to initiate an electronic funds transfer through a banking clearinghouse from the payor's account at its bank (i.e. the paying bank) to the recipient's account at its bank (the presenting bank) and to physically route the check back through the clearinghouse from the presenting bank to the payor bank for cancellation. Inasmuch as the number of checks has been and continues to substantially increase over the past few years, the cost to banks of processing paper checks has been steadily increasing. In an effort to arrest these cost increases or at least temper their rise, banks continually attempt to bring increasing levels of machine automation to the task of processing checks. Specifically, various individuals in banking believe that if the check encoding process were automated by replacing human operators with appropriate optical character recognition (OCR) systems, then the throughput of encoded checks and encoding accuracy will both substantially increase while significant concomitant cost savings will occur. As envisioned, such systems would scan the writing or printing that appears on each check, accurately translate a scanned dollar amount into digital signals, such as appropriate ASCII words, and, inter alia, operate a printer to print appropriate numeric characters onto the bottom of each check in order to encode it.
With the ever expanding amount of paper documents in use in present day society--of which paper checks represent only one illustrative example, the human resources needed to read these documents and convert their contents into machine readable form or directly into computer data are simply becoming either unavailable or too costly to use. As such, a substantial need exists, across many fields, to develop and use OCR systems to accurately automate the process of recognizing and translating first machine printed alphanumeric characters and ultimately handwritten characters into appropriate digital data.
For various reasons not relevant here, starting several years ago and continuing to the present, neural networks are seen in the art as a preferred technique for providing accurate character recognition in an OCR system.
In contrast to traditional sequential "Von Neumann" digital processors that operate with mathematical precision, neural networks are generally analog--though digital implementations are increasingly common, and typically manifest massively parallel processing. These networks provide fast and often surprisingly good output approximations, but not precise results, by making weighted decisions on the basis of fuzzy, incomplete and/or frequently contradictory input data.
By way of background, a neural network is basically a configuration of identical processing elements, so-called neurons, that are arranged in a multi-layered hierarchical configuration. Each neuron can have one or more inputs, but only one output. Each input is weighted by a coefficient. The output of a neuron is typically calculated as a function of the sum of its weighted inputs and a bias value. This function, the so-called activation function, is typically a "sigmoid" function; i.e., it is S-shaped, monotonically increasing and asymptotically approaches fixed values typically +1, and zero or -1 as its input respectively approaches positive or negative infinity. The sigmoid function and the individual neural weight and bias values determine the response or "excitability" of the neuron to signals presented to its inputs. The output of a neuron in one layer is distributed as input to neurons in a higher layer. A typical neural network contains at least three distinct layers: an input layer situated at the bottom of the network, an output layer situated at the top of the network and one or more hierarchical interconnected hidden layers located intermediate between the input and output layers. For example, if a neural network were to be used for recognizing normalized alphanumeric characters situated within a 7.times.5 pixel array, then the output of a sensor for each pixel in that array, such as a cell of an appropriate charge coupled device (CCD), is routed as input to a different neuron in the input layer. Thirty-five different neurons, one for each different pixel, would exist in this layer. Each neuron in this layer has only one input. The outputs of all of 35 neurons in the input layer are distributed, in turn, as input to the every neuron in, e.g., a single intermediate or so-called hidden layer. The output of each of the neurons in the hidden layer is distributed as an input to every neuron in the output layer. The number of neurons in the output layer typically equals the number of different characters that the network is to recognize, i.e., classify, with the output of each such neuron corresponding to a different one of these characters. The numerical outputs from all the output layer neurons form the output of the network. For example, one output neuron may be associated with the letter "A", another with the letter "B", a third with the letter "a", a fourth with the letter "b" and so on for each different alphanumeric character, including letters, numbers, punctuation marks and/or other desired symbols, if any, that is to be recognized by the network. The number of neurons in this single hidden layer, as well as the number of separate hidden layers that is used in the network, depends, inter alia, upon the complexity of the character bit-maps to be presented to the network for recognition; the desired information capacity of the network; the degree to which the network, once trained, is able to handle unfamiliar patterns; and the number of iterations that the network must undergo during training in order for all the network weight and bias values to properly converge. If the network were to utilize several separate hidden layers, then the output from each neuron in the first (i.e. lowest) hidden layer would feed the inputs to the neurons in the second (i.e. next higher) hidden layer and so forth for the remaining hidden layers. The output of the neurons in the last (i.e. highest) hidden layer would feed the neural inputs in the output layer. The output of the network typically feeds a processor or other circuitry that converts the network output into appropriate multi-bit digital data, e.g., ASCII characters, for subsequent processing.
The use of a neural network generally involves two distinct successive procedures: initialization and training on known pre-defined patterns having known outputs, followed by recognition of actual unknown patterns.
First, to initialize the network, the weights and biases of all the neurons situated therein are set to random values typically within certain fixed bounds. Thereafter, the network is trained. Specifically, the network is successively presented with pre-defined input data patterns, i.e., so-called training patterns. The values of the neural weights and biases in the network are simultaneously adjusted such that the output of the network for each individual training pattern approximately matches a desired corresponding network output (target vector) for that pattern. Once training is complete, all the weights and biases are then fixed at their current values. Thereafter, the network can be used to recognize unknown patterns. During pattern recognition, each unknown pattern is applied to the inputs of the network and resulting corresponding network responses are taken from the output nodes. Ideally speaking, once the network recognizes an unknown input pattern to be a given character on which the network was trained, then the signal produced by a neuron in the output layer and associated with that character should sharply increase relative to the signals produced by all the other neurons in the output layer.
One technique commonly used in the art for adjusting the values of the weights and biases of all the neurons during training is back error propagation (hereinafter referred to simply as "back propagation"). Briefly, this technique involves presenting a pre-defined input training pattern (input vector) to the network and allowing that pattern to be propagated forward through the network in order to produce a corresponding output pattern (output vector, O) at the output neurons. The error associated therewith is determined and then back propagated through the network to apportion this error to individual weights in the network. Thereafter, the weights and bias for each neuron are adjusted in a direction and by an amount that minimizes the total network error for this input pattern.
Once all the network weights have been adjusted for one training pattern, the next training pattern is presented to the network and the error determination and weight adjusting process iteratively repeats, and so on for each successive training pattern. Typically, once the total network error for each of these patterns reaches a pre-defined limit, these iterations stop and training halts. At this point, all the network weight and bias values are fixed at their then current values. Thereafter, character recognition on unknown input data can occur at a relatively high speed.
During character recognition, a "winner take all" approach is generally used to identify the specific character that has been recognized by the network. Under this approach, once the network has fully reacted to an input data pattern, then the one output neuron that generates the highest output value relative to those produced by the other output neurons is selected, typically by a processing circuit connected to the network, as the network output. Having made this selection, the processor then determines, such as through a simple table look-up operation, the multi-bit digital representation of the specific character identified by the network.
Neural network based OCR systems have exhibited excellent performance characteristics with machine printed text, particularly "clean" text that exhibits a high degree of uniformity, in terms of line thickness and orientation, from one character to the next. Unfortunately, the task of recognizing characters, even through the use of a neural network, is complicated by the existence of touching or otherwise overlapping characters. While a very small number of machine printed characters actually touch, due to kerning and the like, touching and overlapping characters are particularly prevalent with handwritten text and numerals, effectively exhibiting, due to human variability, an infinite number of variations. Clearly, for the sake of efficiency and implementation simplicity, a neural network can not be trained to recognize even a major portion, let alone all, of these variations.
As noted above, a neural network classifier typically employs as many different output nodes as there are differing characters to recognize, with one node allocated to each different character. Once such a network is trained to recognize a given character, whenever a bit-mapped pattern for that character is applied, as input, to the network, the network will produce a relatively high level output activation at the output node associated with that character and relatively low output activations at all other output nodes. Accordingly, the "best guess" produced by the network is the character associated with the node that presents the highest output activation. Since neural networks produce so-called "soft" decisions, i.e., decisions having some degree of inherent uncertainty, to assure that only reliable decisions are used, a confidence measure is generally determined for each such decision. Those decisions that have relatively high confidence measures, typically equal to or in excess of a pre-defined threshold level, are accepted and subsequently used; while those with correspondingly low confidence measures, i.e., less than the threshold level, are rejected and discarded. Generally, the desired level of reliability determines the numeric value of the threshold level.
In an effort to greatly simplify the task of recognizing human handwriting, the art teaches the use of determining those characters which touch and then segmenting or otherwise partitioning these characters apart and recognizing each character that results. In this regard, the art teaches two basic approaches: performing segmentation prior to character recognition, or simultaneously performing both segmentation and recognition.
One example of the former pre-recognition segmentation approach is the system disclosed in U.S. Pat. No. 5,299,269 (issued to R. S. Gaborski et al on Mar. 29, 1994 and assigned to the present assignee hereof, also referred to herein as the 269 patent). Here, a sliding window is stepped across an image field, on a pixel-by-pixel basis, to capture a sub-image, i.e., a kernel of the image. An associate memory or neural network is trained, through one training set, to recognize all non-character images that can exist within the sub-image, i.e., all possible intersections and combinations of known characters that correspond to window positions that straddle adjacent characters. This training includes window-captured sub-images that in the past were incorrectly perceived as being centered on a character when in fact they were not, i.e., the result of false character segmentation. The same memory or network, or a second one, is trained, through a second training set, to recognize the individual "non-straddling" characters of a given character set. If one item of the training sets is recognized, then the entire sub-image is forwarded to a downstream portion of an OCR system for further character recognition. For this particular system to properly function, the appropriate memory or network must be trained on all possible non-character images. Unfortunately, a very large, potentially infinite, number of such images can exist. Hence, if a training sequence is to encompass a substantial number, even if considerably much less than all such, non-character images, then the accuracy with which any one of these images will be recognized will be reduced. Furthermore, as the number of different characters which a network (or memory) must recognize increases, the size and complexity of that network (or memory) increases at a considerably greater rate. Hence, the system disclosed in the 269 patent is rather impractical.
Another example of the pre-recognition segmentation approach is described in A. Gupta et al, "An Integrated Architecture for Recognition of Totally Unconstrained Handwritten Numerals", International Journal of Pattern Recognition and Artificial Intelligence, 1993, Vol. 7, No. 4, pages 757-773. Here, once an image is scanned, typically to implement machine recognition of a handwritten zip code, a resulting digital binary bit-map of a source document, such as an envelope, is passed through a preprocessing stage which performs segmentation, thinning and rethickening (the latter two functions to impart uniform thickness to otherwise differing stroke thicknesses among different characters) as well as character size normalization and slant correction. Character recognition is then performed on a resulting preprocessed bit-map. By reducing differences among resulting characters, the complexity of the recognition stage, particularly the neural network used therein, would be considerably reduced. However, the pre-recognition segmentation approach has exhibited, on an empirical basis, quite some difficulty in accurately separating touching characters.
Consequently, as a result of this difficulty among other reasons, the art is turning to a combined segmentation-recognition approach. This latter approach typically involves moving a window, e.g., a so-called "sliding" window, of a certain width across a field and fabricating confidence measures for competing character classifications to determine if the window is positioned directly on top of character as well as to undertake recognition of that character. A combined segmentation/recognition approach is described, for example, in Martin et al, "Learning to See Where and What: Training a Net to Make Saccades and Recognize Handwritten Characters" (1993) hereinafter referred to as the "Martin et al" publication!, appears in S. J. Hanson et al (eds.), Advances in Neural Information Processing Systems, Volume 5, pages 441-447 (Morgan Kaufmann Publishers, San Mateo, Calif.). Here, a system (which, for convenience, will henceforth be referred to herein as the "Saccade" system) is described in which a four-layer neural network (i.e. with two hidden layers) is trained, using back propagation, not only to locate and recognize characters, by class, in the center of a window (as well as whether a character exists in the window or not) but also to make corrective jumps, i.e., so-called "saccades", to the nearest character, and after its recognition, to the next character and so forth. Unfortunately, this system tends to miss relatively narrow characters and occasionally duplicates relatively wide characters, thereby reducing overall recognition accuracy. Another combined segmentation/recognition approach, is described in Bengio et al, "Globally Trained Handwritten Word Recognizer using Spatial Representation, Convolutional Neural Networks and Hidden Markov Models", Proceedings of 1993 Conference on Neural Information Processing Systems--Natural and Synthetic. Nov. 29-Dec. 2. 1993. Denver, Colo., pages 937-944. This approach relies on using a multi-layer convolution neural network with multiple, spatially replicated, sliding windows displaced by a one or several pixel shift with respect to each other along the scanning direction. The outputs of corresponding neural classifiers serve as input to a post-processing module, specifically a hidden Markov model, to decide which one of the windows is centrally located over a character. This approach provides a neural output indicating whether the character is centered within a window or not. Unfortunately, this particular approach of replicating a neural classifier when viewed with the need for post-processing, tends to be quite expensive computationally, relatively slow and thus impractical.
Now, apart from the problems associated with conventional approaches to segmentation, neural based classification, as conventionally taught, is also problematic.
In particular, confidence of a neural decision should increase with either a measured increase in the value of the highest output activation for that decision or a measured numeric increase in a gap, in output activation values, between the highest output activation, i.e., for that decision, and a second highest output activation.
The art teaches several different schemes to determine a confidence measure for neural output activations. Typically, these schemes involve using either one or some combination of both conventional measurements to assess confidence. For example, in R. Battiti et al, "Democracy in Neural Nets: Voting Schemes for Classification", Neural Networks, Vol. 7, No. 4, 1994, pages 691-707, a neural network classifier is described that relies on accepting those decisions which have a highest output activation value (referred to hereinafter as simply the "HA" scheme) that exceeds a pre-defined fixed threshold, thres.sub.-- max, and a difference in activation (referred to hereinafter as simply the "DA" scheme) between the two highest output activations that exceeds a second pre-defined threshold, thres.sub.-- diff. Similarly, another neural classifier, as described in the Martin et al publication, just relies on using thresholded activation differences (i.e. the DA scheme) to reject neural decisions. Use of the DA scheme in a neural classifier is also evident in J. Bromley et al, "Improving Rejection Performance on Handwritten Digits by Training with `Rubbish`", Neural Computation, Vol. 5, 1993, pages 367-370. Alternatively, U.S. Pat. No. 5,052,043 (issued Sep. 24, 1991 to R. S. Gaborski and also assigned to the present assignee hereof) teaches a neural classifier in which a confidence measurement, for a neural decision, is formed as a ratio of the highest and second highest output activation values associated with that decision (referred to hereinafter as simply the "RA" approach). Another scheme, as proposed in F. F. Soulie et al, "Multi-Modular Neural Network Architectures: Applications in Optical Character and Human Face Recognition", International Journal of Pattern Recognition and Artificial Intelligence, Vol. 7, No. 4, 1993, pages 721-755, specifically pages 728-729 thereof, teaches a neural classifier that utilizes a distance based rejection criteria (referred to as "DI" scheme). Here, in addition to employing the HA and DA schemes, Euclidean distances between an activation vector and target activation vectors of all potential output classes are compared to a fixed threshold. If the smallest of these distances is less than the threshold, then the proposed classification is rejected.
Ambiguities usually exist between a correct output classification, i.e., the "best guess", and one other prominent alternative, i.e., a "next best guess". One would expect that the two highest output activations corresponding to these two alternatives should provide most of the necessary information upon which to assess confidence and hence base rejection of a neural decision. In this case, the HA, DA, RA or DI rejection schemes, can result in an acceptable choice between two competing output classifications. Still, various situations exist when a choice must be made among three or more classifications, including for digit recognition, where these simple rejection schemes prove to be inadequate.
Therefore, a general and still unsatisfied need exists in the art for an OCR system that is capable of accurately and efficiently recognizing handwritten characters.
In furtherance of meeting this general need, a relatively simple and fast, yet accurate apparatus (and an accompanying method), particularly suited for inclusion within an OCR system is required to properly locate each character (or other object), that is to be recognized, from within a field of such characters as well as, through an appropriate neural network, to reliably classify each such character. As to the latter, classification apparatus and an accompanying method are also required to provide character rejection decisions that are more reliable than those produced through any rejection scheme heretofore taught in the art. Hence, such a resulting OCR system would likely recognize handwritten characters more accurately and efficiently than has previously occurred with OCR systems known in the art.