1. Technical Field
The invention relates to the use of machine intelligence to detect text on a composite document page that may also contain graphics and images. In particular, the invention relates to computer programs and systems for identifying text and non-text areas in documents.
2. Description of the Prior Art
Electronic image files of printed pages are relatively easy to obtain with the use of a computer and a scanner. A typical image processing system is described by Hisao Shirasawa et al., in U.S. Pat. No. 5,696,842, issued Dec. 9, 1997. Color documents that are scanned-in typically include images, graphics, and text components. A separator is used to divide M.times.N picture elements according to the type of image in each. Picture elements with black and while values are differentiated from those that are not all-black or all-white. Purpose of such image processing system is to allow for high degrees of image compression because the black and white image areas can be encoded with far fewer bits per pixel than a pixel for a color graphic.
Unfortunately, such prior art techniques are concerned with such issues as compression/decompression and not with specifically identifying the textual elements of the image. While optical character recognition (OCR) systems are known, these systems are not so much concerned with the fast and accurate reproduction of text in a printed page that also contains graphics as they are with the character identification, typically for an all text source.
It would be advantageous to provide an improved text detection technique in which image processing was performed based upon prior knowledge of the nature of the source image components, e.g. text or image, prior to commencing such processing.