Optical Character Recognition (OCR) algorithms currently are being used in a wide variety of applications for converting digitized image data of characters to their ASCII equivalents. This is especially useful in data entry applications where thousands of documents are processed daily. For example, in processing health claim forms, many insurance carriers currently enter the data into their data base via data entry personnel. By taking advantage of OCR, the data can be entered into a data base more accurately and with a higher throughput, thereby reducing the associated costs.
OCR is especially applicable for typewritten fonts such as Gothic or courier. OCR readability of these types of fonts is quite good. However, in the health insurance industry mentioned above, claim forms are received from many different sources. Some of these forms may be filled out using a typewriter, some may be filled out by hand, while others may be done on a dot-matrix printer. Sorting of these incoming documents allows the insurance carrier to use an OCR device to read the typewritten documents and manual data entry for the hand-printed documents. However, dot-matrix documents pose a bit of a problem, since they are machine generated (not as easy to sort as typed vs. hand-printed) but the quality of print generates lower OCR read rates (and therefore more manual data entry to fix the mistakes).
Recognition rates for typewritten text are quite good due to the consistent quality of the print. For example, a gothic letter "S" is very similar to a courier "S", yet either "S" is easily distinguishable from a number "5". Although dot-matrix characters of different printers also look similar, there is less information to distinguish a "5" from an "S". This is especially true for 9 pin draft-quality dot-matrix--typical of less expensive printers. The individual dots forming the characters tend to confuse OCR algorithms that haven't been specifically developed for dot-matrix printed text, thereby reducing accuracy.
OCR algorithms intended for use on continuous fonts, such as those obtained from a typewriter, recognize the characters much more accurately than those obtained from dot-matrix printers. It has been shown that read rates for dot-matrix printed text can be increased by improving (or filtering) the image data as described in U.S. Pat. No. 5,182,778. Unfortunately, the use of such a filter requires the user to separate the dot-matrix printed documents from the typewritten documents since the filter distorts typewritten text images beyond acceptable recognition by the OCR algorithm. Another disadvantage is that the video information to be "read" by the OCR algorithm for a given document must all be of the same type (i.e. either dot-matrix printed text or typewritten text but not both). For these reasons, it is necessary to distinguish dot-matrix printed text from typewritten text.
In the invention of William E. Weideman, U.S. Statutory Invention Registration No. H681, an invention is presented which detects the presence of dot-matrix printed text. However, in this invention, it is assumed that grey-scale image data is available for use by the low-pass and high-pass filters. In systems where only binary image data is available, this algorithm would not work. The present invention, on the other hand, has been specifically developed to handle the case of binary image data.
In the invention of Chao K. Chow, U.S. Pat. No. 3,634,822, detects the presence of dot-matrix text assuming that the individual character images have already been separated. It then computes the probability that a given character belongs to a given font style by comparing the unknown character representation to known character representations of three fonts using style determination functions. It does not examine regions of a document image data, but rather examines individual character image data after separation. The present invention does not function based on probability nor individual separated character image data, but rather examines the entire image data set for certain characteristics to be described later.
Still another invention by Robert Todd et al, U.S. Pat. No. 4,274,079, presents a method whereby a switch character is used in the actual print string to flag when a change of font will occur. The present invention does not require a switch character and will automatically indicate when a different font, specifically, dot-matrix is present.
The current invention uses certain characteristics of dot-matrix printed text. Based on these characteristics, a neighborhood of pixels is flagged as containing dot-matrix printed text. Similarly, all neighborhoods of pixels within the image are evaluated and a temporary map (referred to as a filter mask) is created indicating which neighborhoods of pixels contain dot-matrix text. Given the filter mask, a control system can then be used to decide whether a given neighborhood of pixels should be processed by the algorithm, or some similar method as described in the previously cited reference.
As a result, only those areas within the given image that have been determined to contain dot-matrix printed text are enhanced for improved optical character recognition. Furthermore, the algorithm is capable of operating on said image at the full data rate (real-time, on-line processing) of the scanning device output when implemented in hardware.