1. Field of the Invention
The present invention relates generally to methods for image processing, and more particularly to methods for binarization of gray-level images for text processing.
2. Description of the Related Art
The traditional method of transmitting and storing information is the paper document. Improvements in computer technology are fast replacing the paper document with digital document representations. The digital representation of document data allows for efficient indexing and retrieval, massive amounts of storage, immediate transmission, and storage for unlimited periods of time without degradation.
The transformation of paper documents into digital documents should be done in a way that preserves the information content, including text, graphics, and formatting. The process of transforming from paper document to a digital image is called optical scanning. Optical scanning is accomplished in a number of ways, typically by using electronic cameras or flatbed scanners. These technologies create a digital image of the document page.
A digital image consists of a two-dimensional array of values for which each individual value represents the light intensity reflected from a corresponding spatial location of the scanned document. The individual values in this two dimensional array are called pixel elements or pixels. Each pixel is represented by a digital value, with binary images using one binary digit (called a bit) per pixel and gray scale images using more than one binary digit per pixel.
Electronic storage and transmission of digital images uses the least memory or bandwidth if these pixels can each be represented as a single binary digit. While a document with black text on white paper might appear to have this binary characteristic already, real world documents have variations in the intensity of the image, pixels overlap areas of textual characters and areas with no textual characters, text and paper have variations in color and intensity, and many images contain pictures. For example, a page of magazine text may have only two levels of information, black text and the white background. However a gray-scale image of the same page will have many more intensity values due to factors such as non-uniform printing of characters contained in the text and shadows caused by lighting effects. Other types of documents, such as journal covers, generally include multiple levels of information, e.g., multiple colors, which are used in both the text and background of the document page.
If the different shades of gray on a page are to be represented by the digital document, the individual pixel elements must be capable of representing more than two distinct intensity values. Because of these many variations, pixels are usually sampled using a range of possible values, using more than one binary digit (bit) per pixel for representation. Typical gray scale images use eight bits per pixel, allowing 2 8(256) possible gray values. Color images may use three color values for red, green, and blue values, of eight bits each, totaling 24 bits per pixel. To conserve resources or allow automated processing such as Optical Character Recognition (OCR), the intensity level for each pixel needs to be converted to a single binary digit for each pixel, a process defined as image binarization.
Several techniques have been used to perform image binarization. These techniques fall into two categories; those intended to render image graphics for human viewing/analysis and those intended for automated document processing. Techniques of the first class, such as the many dithering techniques, are generally not applicable to automatic document processing which is the subject of the present invention.
A binarized image should result in an image that, if viewed electronically, would still be consistent with the original paper document. The purpose of binarization is to yield an image suitable for automated processing, such as OCR. This requires that the binarized image be of high quality for legibility and best recognition by automated processes.
Image binarization techniques for automated document processing can be viewed as a classification problem, one in which each input gray scale pixel value is classified as either foreground or background. The single bit value for each pixel of the output binary image is assigned one value for foreground and the other for background.
Image binarization techniques for automated document processing can be further divided into two classes of methods. Methods in the first class use spatial derivative information to classify output pixels as either foreground or background. The algorithms determine rising and falling edge pixels in the input image using the spatial derivatives and then classify all pixels between the falling and rising edges as foreground. These techniques work well when there is sufficient spatial resolution and image contrast but are not appropriate for low resolution, low contrast, or very noisy images.
The present invention is a method of the second class where a direct transformation of the input gray scale pixel values to the output binary pixel value is accomplished. These methods typically calculate statistics of the image in the form of counts of the number of times each gray scale pixel value occurs in an image, called a histogram. This histogram information is used as a model of the Probability Distribution Function (PDF) for gray scale pixel intensity values. The histogram may be calculated either globally across the entire page, or within local regions of the image.
Thresholding is a common image processing operation, applied to gray-scale document images to obtain binary classification, which sets a bit to “true” for pixels equal to or above the threshold and to “false” for pixel values below the threshold. This binary decision defines a single bit value used to transform gray scale images into binary images. Generally speaking, this technique takes a gray scale image, in which each pixel has a corresponding multi-bit gray-level value, compares the gray-level to a threshold, and converts it into a binary value.
As gray scale documents may differ greatly in contrast, intensity, noise levels, and uniformity, different methods are defined to select a threshold that is appropriate for binarization of an input grays scale image. Many techniques examine the histogram to determine a suitable threshold. For example, a threshold may be set between the two largest peaks in a histogram.
The fastest and simplest thresholding technique is simply to determine a single global threshold for the entire image. An example of this technique is presented by Otsu, which defines a threshold that minimizes the in-class variance for a specific input image (“A Threshold Selection Technique from Grey-level Histograms,” IEEE Trans. Systems, Man, and Cybernetics, Vol. 9, No. 1 (1979)). However, this and other global thresholding methods frequently result in loss or confusion of the information contained in the gray scale image. This is due to variations in background intensity across the global image. This information is embodied mainly in edges that appear in the image, and depends not so much on the absolute brightness of the pixels as on their relative brightness in relation to their neighbors. Thus, depending on the choice of threshold, a meaningful edge in the gray-level image will disappear in the binary image if the pixels on both sides of the edge are binarized to the same value. On the other hand, artifacts in the binary image with the appearance of edges may occur in an area of continuous transition in the gray-level image, when pixels with very similar gray-level values fall on opposite sides of the chosen threshold.
An adaptation to this technique is to allow the threshold to vary as the image changes. A new threshold is computed for differing sub-regions of the image. In a method described by Bernsen (“Dynamic Thresholding of Grey-level Images,” Proc. Eighth Int'l Conf. Pattern Recognition (1986)) the maximum pixel value, IH, and minimum pixel value, IL, within a subregion of the image are found. A threshold value is computed as follows:Tval=(IH−HL)/2 if (IH−IL)>I; otherwise Tval=IL,where the value of I defines a maximum tolerance on the variation in pixel values—thus indicating the presence of foreground. Otherwise, the threshold is set to the minimum to assign all input pixels the value for the background.
Another method described by Niblack (An Introduction to Digital Image Processing, (1986)) calculates the mean, μ, and standard deviation, σ, of pixel values with a subregion of the image. A threshold value is computed as follows:Tval=μ+kσ. 
Values of −0.2 for k and a subregion size of 15×15 are suggested.
Another pixel histogram method is that of Chow and Keneko (“Automatic Detection of the Left Ventricle from Cineangiograms,” Computers and Biomedical Research, Vol. 5 (1972)). This method tests the histogram from non-overlapping input image subregions for bi-modality—the presence of two dominant peaks expected for white and black—and models the histogram with the sum of the two Gaussian distributions. A threshold is computed for all regions that are determined to be bi-modal. For regions that are not bi-modal, a threshold is interpolated from the thresholds of surrounding bi-modal regions. The individual thresholds are smoothed to eliminate outliers.
These techniques simply use statistical measures to determine a local or global threshold to be used for a two-class classification method. However, images with complicated background or images with a different relative proportion of background and foreground than expected will present challenges for these techniques. An alternative approach is to use models that adapt to differing image histograms to improve the classification.
An attempt to perform modeling of the histogram is introduced by Taxt (“Segmentation of Document Images,” IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol. 11, No. 12 (1991)). This technique uses a method similar to Chow and Kenko's method in an attempt to approximate the histogram of non-overlapping image subregions with the sum of two Gaussian distributions. However, Taxt's method uses an iterative algorithm to converge an initial guess of the Gaussian model parameters to the estimated solution and solves for the output binary pixel values using the most likely model for each input pixel value.
This intuitive approach of modeling with two models, one for background and one for foreground, works well for clearly bimodal histograms, but does not work well with more complicated distributions of gray scale intensities. Particularly in images of low spatial resolution, individual gray scale pixel values do not always simply correspond to areas of background or foreground. Pixels at the borders of characters will correspond to regions of both foreground and background.
A limitation of the above identified statistical techniques is that their classification of input gray scale pixel values into one of two narrowly defined classes is inadequate for an accurate description of the underlying process of gray scale image formation. Clearly, the differing gray scale values correspond to differing contributions of background and foreground to a single pixel value. Classifying a gray scale value as strictly one or the other represents a coarser quantization of the physical process, limiting performance of these approaches.
In addition to performing the binarization of image data, the current method integrates a spatial resolution enhancement process as well. If a higher resolution binary image is desired, a common approach is to first expand the spatial resolution to a new higher resolution gray scale image. There are several techniques available for expanding the spatial resolution, including replication, linear interpolation, or cubic spline interpolation. The high resolution image is then binarized using one of the existing techniques defined above. This combination of techniques is adequate, but does not accurately reflect the formulation of low resolution images. The differing gray levels represent different classes of pixels that should be classified differently—independently of their neighboring pixel values, rather than estimating gray levels using neighboring pixel values and classifying them into one of two classes. Thus the current method more accurately models the formation of the low resolution input image and should therefore provide better binarization performance.