There are many instances where it would be useful or desirable to provide a computer readable form of a document not available in a compatible computer readable form. Normally it is the case that the document is not available in machine readable form because the document was handwritten or typewritten and thus no computer readable form exists, or because the computer readable form is not available. In some instances there is a "foreign" document, i.e., an existing computer readable form but the document was produced on an incompatible computer system. In some instances, such as facsimile transmission, a simple optical scan of the document can produce the required form. In most instances the form most useful for later use and decision making is a separate indication of each character of the document.
The field of optical character recognition deals with the problem of separating and indicating printed or written characters. In optical character recognition, the document is scanned in some fashion to produce a electrical image of the marks of the document. This image of the marks is analyzed by computer to produce an indication of each character of the document. It is within the current state of the art to produce a reliable indication of many typewritten and printed documents. The best systems of the prior art are capable of properly distinguishing a number of differing type fonts.
On the other hand, unconstrained handwritten characters have not been successfully located and recognized by present optical systems. The problem of properly reading unconstrained handwritten characters is difficult because of the great variability of the characters. One person may not write the same character exactly the same every time. The variability between different persons writing the same character is even greater than the variability of a single person. In addition to the variability of the characters themselves, handwritten text is often not cleanly executed. Thus characters may overlap horizontally. Loops and descenders may overlap vertically. Two characters may be connected together, strokes of one character may be disconnected from other strokes of the same character. Further, the individual written lines may be on a slant or have an irregular profile. The different parts of the handwriting may also differ in size. Thus recognition of handwritten characters is a difficult task.
An example of a field where recognition of handwritten characters would be very valuable is in mail sorting. Each piece of mail must be classified by destination address. Currently, a large volume of typewritten and printed mail is read and sorted using prior art optical character recognition techniques. Presently, approximately 15% of current U.S. mail remains hand addressed. Present technology uses automated conveyor systems to present these pieces of mail, one at a time, to an operator who views the address and enters a code for the destination. This is the most labor intensive, slowest and consequently most expensive part of the entire mail sorting operation.
Furthermore, it is expensive to misidentify a ZIP code and send the piece of mail to the wrong post office. Once the mail is forwarded to the receiving post office, the receiving post office recognizes that there is no matching address or addressee in that ZIP code. The mail must then be resorted and redirected to the proper post office. Because of the high expense associated with misdirected mail, it is more desirable to have an automated system reject a piece of mail if the system cannot determine the ZIP code with an extremely high degree of accuracy. The rejected pieces of mail can then be hand sorted at the sending station or other measures can be taken to eliminate or reduce the cost of the misdelivery.
Once the ZIP code numerals are located, various systems have been devised to recognize handwritten numerals. However, many of these systems are overly complicated to compensate for numerals that deviate from certain normal model numerals. Certain deviations with handwritten numerals commonly occur. For example, the numeral 5 commonly has its top stroke separated from the remaining strokes. This problem occurs so often that the term "hatted 5" has been coined. Another common deviation relates to the numeral 8 in which the top loop is not completely closed but a gap is left between the two end points. This deviation has been coined an "open loop 8." Furthermore, 4's are written in two common ways in which one has a closed triangular upper section on the top and the second has an open U-shaped upper section. Previous recognition systems have used a plurality of models to compensate for these deviations from the norm.
Recognition of handwritten numerals is further complicated when the numerals are in groups. For example, when a hatted 5 precedes a 1 the top stroke of the hatted 5 can often make the subsequent 1 look like a 7.
It has been previously thought that repair of these digits would further complicate matters in converting the numerals into other numerals, for example curved 3's into 8's and 1's into 7's, so that such repair before segmentation of the numeral groups was shunned. However, most numeral recognition devices have a splitter or segmenter which must segment a plurality of digits into individual digits in order that the individual digit may be recognized. The broken strokes such as a top-hatted 5 or a broken 4 can cause the segmenter to incorrectly identify individual digits.
Furthermore, not all digital images are the same quality. Many digital images have additive noise that need to be removed before recognition of numerals proceeds in order to maintain the high reliability needed in recognizing ZIP code numerals. In addition, a certain percentage of images have poor quality and suffer from pixel dropouts of one form or another. These pixel dropouts can cause background inclusions within the character stroke or break the character stroke into segments with a gap therebetween, thereby lowering the reliability of the ZIP code recognizer if not repaired. Image repair of the digital image of an address block containing a ZIP code is therefore needed. The amount and type of repair needs to be customized depending on the quality and classification of the digital image. Furthermore, once the ZIP code is located, further repair on the numeral group is desired before segmentation of the numeral group into individual digits.
What is needed is a method and apparatus which sequentially repairs an image and group of characters in the image by eliminating image noise and connecting broken strokes which assists the segmenter to correctly segment individual numerals before the recognizer acts on the individual numerals.