Most commercial Optical Character Recognition (OCR) systems cannot successfully accommodate dot-matrix or ink-jet printed matter due to segmentation and recognition errors. It is difficult to accommodate segmentation in OCR systems because of the unconnected character structure of the dot-matrix and ink-jet printed matter which is formed from dots. Typical prior art segmentation algorithms look for white spaces between characters. In dot-matrix and ink-jet printed characters, the prior art segmentation algorithms cannot distinguish between dots making up the character structure and the spaces between the characters. Recognition errors, on the other hand, can be attributed to both poor segmentation and atypical character structures. For example, an "A" produced by a non-dot-matrix printer (e.g., a laser or daisy wheel printer) and an "A" from a dot-matrix printer look quite different. Thus, a separate "classifier" is needed to accommodate the dot-matrix/ink-jet printed matter.
In applications where many forms or documents are to be processed, OCR offers increased read rates and higher throughput than manual data entry. Unfortunately, OCR devices are only capable of processing a predetermined set of text characters. With the forms processing industry accepting forms from many different sources, OCR devices must be capable of handling a variety of documents printed by many different printer devices. Currently, the OCR processing sites are forced to sort their documents by separating the documents into OCR-readable and non-OCR-readable text documents. Non-OCR-readable text documents include forms with, for example, handwritten text and text printed with a dot-matrix printer as well as ink-jet and bubble-jet printers. With the non-OCR-readable text documents, manual data entry is required because the read accuracy of this type of form in the OCR systems is very poor.
There are prior art techniques that examine binary image data and correct for discontinuities in the characters of the image. One such technique is disclosed in U.S. Pat. No. 4,791,679 (L. Barski et al.), issued on Dec. 13, 1988, which discloses a method for improving broken or weak characters of the binary image. More particularly, a character stroke is strengthened by processing the binary image data with a predetermined m-by-m kernel and moving the kernel one pixel at a time around the image. In each pixel position, the kernel, which is divided into m square sections, is selectively filled with black pixels in proportion to the number of black pixels in each section in accordance with a special set of rules.
U.S. Pat. No. 4,953,114 (H. Sato), issued on Aug. 28, 1990, discloses image signal processing apparatus. The apparatus comprises a line memory for storing lines of an image signal, an image content discriminator, a smoothing circuit, an edge emphasis circuit, and switching means. The image content discriminator comprises an amplitude detection circuit and a comparator connected in series. The amplitude detector detects an amplitude of the image signal in a vicinity of a frequency at which a visual obstacle will be generated. The output signal from the amplitude detection circuit is compared with a predetermined threshold to divide each pixel into areas depending on the dot image or the half-tone image. The smoothing circuit and the edge emphasis circuit are arranged in parallel and each receive the image signal from the line memory. The output from the comparator selects a position of the switching means to provide an output signal from either the smoothing circuit or the edge emphasis circuit depending on the result of the comparison.
U.S. Pat. No. 5,048,097 (R. Gaborski et al.), issued on Sep. 10, 1991, discloses an optical character recognition (OCR) neural network system for machine-printed characters. More particularly, character images sent to a neural network, which is trained to recognize a predetermined set of symbols, are first processed by an OCR pre-processor which normalizes the character images. The output of the neural network is processed by an OCR post-processor. The post-processor corrects erroneous symbol identifications made by the neural network. For characters identified by the neural network with low scores, the post-processor attempts to find and separate adjacent characters which are kerned, and characters which are touching. The touching characters are separated in one of nine successively initiated processes depending on the geometric parameters of the image. When all else fails, the post-processor selects either the second or third highest scoring symbol identified by the neural network based upon the likelihood of the second or third highest scoring symbol being confused with the highest scoring symbol.
It is desirable to provide an improved technique for pre-processing printed text generated with a dot-matrix print head as well as with an ink-jet printer for improving optical character recognition of such printed text.