Optical Character Recognition (OCR) devices require that a document be scanned and digitized. Once the image has been digitized and processed to correct for any discrepancies, the image data is stored in a memory device. The OCR device then examines the image data to determine the identity of each character stored in the memory. In doing so, the text image data is converted to a string of numerical codes (such as ASCII), thus retaining the identity of each individual character for future reference. The numerical codes can then be entered into a database or filed for data manipulation.
In applications where many forms are to be processed, OCR offers increased read rates and higher throughput than manual data entry. Unfortunately, OCR devices are only capable of processing a predetermined set of text characters. Because the forms processing industry accepts forms from many different sources, OCR devices must be capable of handling a variety of documents printed by many different printer devices. Currently, the OCR sites are forced to sort their documents, separating them into two classes of documents: OCR-readable and non-OCR readable text documents. In the latter case, manual data entry is required since OCR read accuracy of this class form is very poor.
The non-OCR-readable text class of documents includes handwritten text and text printed with a dot-matrix printer, as well as poor quality text (broken characters) resulting from the use of worn ribbons used in conjunction with daisy wheel or near-letter quality printers. The present invention relates to dot-matrix printed text by correcting the image data so as to eliminate discontinuities inherent to the way in which dot-matrix characters are printed, in effect creating continuous lines and curves from dot patterns generated by a dot-matrix print head. After having been so processed, the OCR device is able to analyze the image data and determine the numerical codes for the bit-mapped images more accurately.
In the past, there have been numerous inventions that examine the binary image data and correct for discontinuities in the characters. One such invention is disclosed in U.S. Pat. No. 3,609,685 by Edward Samuel Deutch, and describes a method for correcting the image prior to recognition. The method is ideal for correcting image data that has discontinuities that are not inherent to the shape of the character. This invention examined the shape of the character that had been scanned, digitized, and stored in a memory device. The image was examined by tracing the character's shape to determine the individual branch components that make up the character. In order to accurately identify the simplest branch component of the character, the invention required that there not be any discontinuities in the branch component, as any discontinuities would cause additional branch components to be created. Occasionally, undesired discontinuities may exist. To correct for these discontinuities, adjacent branch components are compared to determine if any can be connected together. However, in the case of dot-matrix generated text, such discontinuities are inherently present because of the spacing of the printing elements. These inherent discontinuities create many different branch components, thereby making it difficult for such an apparatus to determine which branch components need to be connected.
U.S. Pat. No. 4,791,679, by Lori L. Barski and Roger S. Gaborski, disclosed a method for improving broken or weak character image data. This invention was primarily concerned with evaluating neighboring regions to determine the percentage of black pixels within a region. The areas between the neighboring regions are, given a particular threshold, filled in so that the lines and curves of the character are smooth. It does not consider the spacing between the dots but rather the percentage of black pixels within specific regions. Also, none of the previous algorithms was implemented with real-time processing capability. They relied on a stored image in memory which could be manipulated via a software program.