1. Technical Field
The invention is related to pre-processing devices which remove non-text material from a text image for optical text character recognition systems which are capable of processing only text images.
2. Background Art
Optical character recognition (OCR) systems of the type well-known in the art digitize the image of a text document and then process the digitized image so as to deduce the identity of each character in the image, the location and identity of each character being stored in a memory. The text image may be automatically reconstructed upon demand by simply retrieving the data from the memory and printing the characters identified therein at the locations specified in the memory. Such OCR systems are capable of processing images of a predetermined set of text characters and nothing else. For this reason, documents which are to be OCR-processed must be carefully prepared to be sure that the characters on the document are all contained within the predetermined character set and that there are no other images on the document. For example, documents containing both text and graphics images tend to confuse such OCR systems. The graphical images typically include relatively long lines or curves which are unrecognizable to the OCR system. Thus, a document containing a graphical image cannot be processed by such an OCR system unless the graphical image is first removed.
There are a number of well-known methods for removing graphical or non-text images from the digitized image of a document in order to allow it to be processed by an OCR system. One type of method uses run length analysis in which the number of contiguous "on" pixels in the same row (or column) in the image is noted and used as the basis of decision-making. Such a technique is disclosed in Japanese Patent JP 61-193277 to Matsuura et al., Rohrer U.S. Pat. No. 4,590,606 and in K. Kerchmar, "Amount Line Finding Logic", IBM Technical Disclosure Bulletin, Volume 15, No. 5, pages 1531 to 1532 (October 1972). A related technique disclosed in Kataoka U.S. Pat. No. 4,559,644, is to low-pass filter the image data to detect long lines, which of course have a relatively low frequency content compared with text characters. A different technique is to decide whether a particular portion of the image is text or non-text graphical information based upon the number or density of black ("on") pixels in that region or line of pixels. This latter technique is disclosed in Japanese Patent No. JP 60-77278 to Isobe et al. and Japanese Patent No. JP 60-116076 to Iwase. Yet another technique is to segment the image data and decide whether each segment is text or non-text graphical information based upon the statistical properties of the segment, as disclosed in Yasuda et al., "Data Compression for Check Processing Machines", Proceedings of the IEEE, Volume 68, No. 7, pages 874 through 885 (July 1980).
Combining run length analysis with connected component analysis in a process for removing non-text graphical information from the text data of an image is disclosed in Nolan, "Line/Symbol Separation for Raster Image Processing", IBM Technical Disclosure Bulletin, Volume 15, No. 12 (May 1973), pages 3879 through 3883. This publication discloses a process for deciding whether a given run length of contiguous "on" pixels in the image should be classified as a graphical line to be discarded by determining whether it corresponds to a similar run length of "on" pixels in the preceding scan line which was previously identified as a graphical or non-text line.
Connected component analysis is a well-known technique used in connection with either image processing or text processing in which separately identified objects in an image are joined together as a single object whenever certain pre-determined parameters are met. This technique is disclosed in Urushibata U.S. Pat. No. 4,624,013, Japanese Patent No. JP 60-3074 to Ozawa and Frank U.S. Pat. No. 4,189,711. Connected component analysis in which the pixels of different objects are labelled with different object numbers is disclosed in Japanese Patent No. JP 60-250480 to Ninomiya et al. and Japanese Patent No. JP 60-200379 to Ariga et al. Connected component analysis and processes like it are useful for framing individual text characters in an image, as disclosed in Kumpf U.S. Pat. No. 4,403,340 and Kadota U.S. Pat. No. 4,045,773. The patent to Kadota et al. teaches discarding as noise any object whose height and width are deemed to be too small. One way in which connected component analysis is applied to separate text from non-text matter is to determine whether a length of connected "on" pixels is statistically close to a predetermined text line length, as disclosed in Scherl U.S. Pat. No. 4,513,442.
Connected component analysis is also applied in image processing of non-text or graphical images, as exemplified in the following references. Agrawala U.S. Pat. No. 4,183,013 discloses measuring the size (number of pixels) of each object and rejecting as noise those objects which are deemed to be too small. Other examples are Frank U.S. Pat. No. 4,107,648, Grosskopf U.S. Pat. No. 3,967,053 and Scott U.S. Pat. No. 3,408,485. The patent to Scott et al. teaches the technique of connected component analysis in which each object is individually numbered and may be renumbered if subsequent scanning reveals that some objects are in fact connected with one another.
The publication cited above by Nolan in the IBM Technical Disclosure Bulletin, while teaching the combination of run length analysis and a process like connected component analysis to remove non-text information from an image, fails to do two things. First, in performing run length analysis, none of the foregoing references recognize that a run of "on" pixels is more likely to be non-text information--regardless of its length--the higher the density of "on" pixels in its row. Instead, only the run length is measured. Secondly, there is no way in which a true text character which is actually joined to a non-text line or graphical curve can be saved upon removal of the non-text information from the image. Such characters are simply "lost", a significant problem.
Accordingly, it is an object of the invention to provide a process for removing non-text information from an image which takes into account not only the length of a run of "on" pixels but also the density of on pixels in the row (or column) of the image in which the run resides.
It is a further object of the invention to provide a process for removing non-text information from an image which restores characters joined to a graphic or non-text line which has been removed from the image.