The phrase “text extraction” means the identification of text characters and graphics from an image without prior knowledge of the underlying alphabet, text layout, font size, and/or orientation. Each pixel may be classified as text or non-text using a single bit. This may be viewed as a form of binarization.
The need for text extraction arises from many image-processing applications. Automatic Optical Character Recognition (OCR) software has been widely available to average consumers. Combined with an electronic scanner, OCR software readily provides a convenient way to convert paper documents into electronic form that can be more conveniently stored and processed. Text extraction is generally the first step in the OCR process, although it is also possible to directly extract features from gray scale images. In the application of compound document compression, after the text pixels are identified and separated from others, appropriate compression methods can be applied to them to achieve readability under a high compression ratio.
As the text extraction may be viewed as a signal detection problem, it requires a good characterization of the signal. For example, a global thresholding algorithm such as assumed image luminance distribution fits a bimodal Gaussian model. Such a model may characterize the black text characters on white background produced by the earlier binary printing techniques. If a document image can be characterized as black text on a slow-varying background, various adaptive thresholding algorithms may be more appropriate. However, as the printing techniques advance, text lying on top of complicated background has become common. For images of this type, the background can no longer be characterized as uniform or a slow-varying one. As a result, neither the global thresholding nor adaptive thresholding may be able to “extract” out the text pixels satisfactorily. More recently, many researchers proposed various techniques based on text properties such as color uniformity and stroke width. An earlier method utilized stroke width to distinguish characters from background by detecting pixels near edges using a second derivative operator and searching for a match within a stroke width distance. The issue of sensitivity to noise caused by the second derivative was later addressed by proposals of using window-based local averages.
Techniques based on the stroke width are typically aimed at extracting handwritten characters from complex background. A typical example is handwritten checks. In this case, it is reasonable to assume that the stroke width is within a small and known range. However, many types of document images with printed characters such as magazine pages normally have a wide and unknown range of font sizes that make techniques that rely on stroke width ineffective. Moreover, linear spatial averaging may affect boundary accuracy of identified characters.
Other examples of various problems associated with identifying text-like pixels in a digital image are, for example, the text may lie on top of a pictorial patch, a text block may not be rectangular and/or the luminance of the text may be darker or lighter than its surroundings. These problems make it difficult to reliably identify text-like pixels in a digital image.
Text extraction has many applications. For example, text extraction may be used in software running on a computer, where given an image file, the software compresses the file in PDF format. Another example of using text extraction is in a scanner. While a document is being scanned, the scanner compresses the document and saves it in PDF format. A third example of using text extraction is in software running on a computer, where given an image file, the software extracts the text pixels into an image and feeds the image to another OCR software program.