With digital image processing and digital communication becoming increasingly prevalent today, increasing amounts of printed or other textual documents are being scanned for subsequent computerized processing of one form or another and/or digital transmission. This processing may involve, for example, optical character recognition, for converting printed characters, whether machine printed or handwritten, from scanned bit-mapped form into an appropriate character set, such as ASCII, the latter being more suitable for use with word processing and similar computerized document-processing tasks.
Scanning a gray-scale document typically yields a multi-bit, typically eight-bit, value for each pixel in the scanned document. The value represents the luminance, in terms of a 256-level gray scale, of a pixel at a corresponding point in the document. These pixels are generated, depending upon the resolution of the scanner, frequently at resolutions of 200-400 pixels/inch (approximately 80-160 pixels/cm), though with highly detailed images at upwards of 1200 or more pixels/inch (approximately 470 pixels/cm). Consequently, a scanned 81/2 by 11 inch (approximately 22 by 28 cm) image will contain a considerable amount of gray-scale data. Inasmuch scanned text generally presents a written or printed characters of some sort against a contrasting colored background, typically dark or black print against a white or light colored background, or vice versa, the exact luminance value at any one pixel in the text is not as important as whether that pixel is either part of a character or the background. Therefore, scanned textual images, or scanned textual portions of larger images containing both text and graphics, can be efficiently represented by single-bit pixels in which each pixel in a scanned image is simply set to, e.g., a "one" if that pixel in the original image is part of a character or part of foreground information, or to, e.g., a "zero" if that pixel in the original image is part of the image background. To easily distinguish the different types of scanned images, a gray-level image is defined as one having multi-bit (hence multi-value) pixels, whereas a binary (or bi-level) image is formed of single-bit pixels. Furthermore, since binary images generate considerably less data for a given textual image, such as, e.g., one-eighth, as much as for an eight-bit gray-scale rendering of the same image, binary images are more efficient than corresponding gray-scale images and thus preferred for storage and communication of textual images. Binary images are also preferred because of their easy compressibility using standard compression techniques, e.g., CCITT Groups 3 or 4 compression standards.
Gray-scale images are converted to binary images through a so-called thresholding process. In essence, each multi-bit pixel value in a gray-scale scanned image is compared to a pre-defined threshold value, which may be fixed, variable, or even adaptively variable, to yield a single corresponding output bit. If the multi-bit pixel value equals or exceeds the threshold value for that particular pixel, the resultant single-bit output pixel is set to a "one"; otherwise if the threshold is greater than the multi-bit pixel, then the resultant single-bit output pixel remains at "zero". In this manner, thresholding extracts those pixels, such as those which form characters, or other desired objects, from the background in a scanned gray-scale image, with the pixels that form each character, or object, being one value, typically that for black, and the pixels for the background all being another value, typically that for white. For ease of reference, we will hereinafter collectively refer to each character or other desired object in the image as simply an "object".
Ideally, the best thresholding process is one which accurately selects all the object pixels, but nothing more, in the scanned image and maps those pixels to a common single-bit value, such as, e.g., "one" for black. In practice, noise, background shading, lighting non-uniformities in a scanning process and other such phenomena, preclude the use of a single fixed threshold for an entire image. In that regard, if the threshold is too low, the resulting image may contain an excessive amount of noise in certain, if not all regions; or, if too high, insufficient image detail, again in certain, if not all, regions--thereby complicating the subsequent processing of this image. Given this, the art recognizes that a preferred approach would be to select a different threshold value that is appropriate to each and every pixel in the scanned image. In doing so, the proper threshold value is determined based upon local properties of the image, i.e., certain image characteristics that occur in a localized image region for that pixel. Hence, the threshold would vary across the image, possibly even adapt to changing localized image conditions.
In general, a common methodology for variable thresholding relies on measuring localized image characteristics, such as local intensity contrast (or gradient), local averaged intensity and/or local variance, in a local window centered about a pixel of interest and then using these measures to classify image pixels into either object pixels, black, or background pixels, white. Here, too, reality diverges from the ideal inasmuch this methodology is complicated, and often frustrated, by a need to extract various objects in a wide range of documents but with minimal user intervention, such as for purposes of initialization and object identification, and while still yielding a clean background in the thresholded image. In reality, these objects may include, e.g., dim, broken objects; and objects that present a relatively low contrast, such as white objects in a gray background; and gray objects embedded in a black background.
Nevertheless, given the overwhelming inability of fixed thresholding to provide adequate performance with actual images, the art has persisted by teaching several variable thresholding approaches that attempt to provide satisfactory performance. However, all these approaches suffer from one or more drawbacks that, in practice, tend to limit their utility.
Various approaches, based upon measurements of different localized image properties, are taught in M. Kamel et al., "Extraction of Binary Character/Graphics Images from Grayscale Document Images", CVGIP: Graphical Models and Image Processing, Vol. 55, No. 3, May 1993, pages 203-217. Here, a "logical level" approach is based on comparing a gray level of a given pixel or its smoothed gray level (if the image is noisy) with four local averages in neighborhoods centered about four pixels orthogonally surrounding the given pixel. If the gray level of the given pixel is sufficiently below all four local averages, then the given pixel is extracted. Another approach, so-called "mask-based subtraction", relies on considering every pixel in an image as a sum of a background image and a character/graphics image. First, most of the background pixels are detected using a logical filter; the filter ostensibly functioning to remove "particle" noise. The filter is applied to four pixel sequences that correspond to four straight lines passing through each given pixel with slopes of 0, .pi./4, .pi./2 and 3.pi./4. The resulting "filtered" binary image contains character/graphics pixels which are black and background pixels which are white. Thereafter, this filtered, or "mask", image is modified by detecting additional background pixels using a predetermined stroke width, and then, for every possible character/graphics pixel, the gray level of its background image is estimated by linear interpolating four background pixels. Lastly, a gray-scale character/graphics image is obtained by subtracting the estimated background image from the original scanned image with resulting differences then being globally thresholded to yield a binary character/graphics image. Though the global threshold value itself is fixed, basing the result on a difference between this threshold and a varying background value essentially implements a variable threshold. Though, at first blush, these two approaches would appear to be somewhat immune to noise, in actuality, each of these approaches is highly sensitive to noise and often results in a noisy background when extracting dim and/or broken objects in a thresholded image.
Another approach, as described in U.S. Pat. No. 4,868,670 (issued to R. R. A. Morton et al on Sep. 19, 1989 and owned by the present assignee hereof), relies on tracking a background value in an image, with a threshold value being a sum of a tracked background value, a noise value and a feedback signal. Here, whenever a transition occurs in the image, such as an edge, the feedback signal is momentarily varied in a pre-defined pattern to momentarily modify the threshold value such that ostensibly an output filtered thresholded pixel value has a reduced noise content. Unfortunately, in practice, this technique often exhibits boundary artifacts at intensity transitions due to abrupt changes in background intensity. In addition, since background tracking tends to exhibit poor reliability, this approach has difficulties in detecting low contrast objects.
A further approach is described in U.S. Pat. No. 4,468,704 (issued to J. C. Stoffel et al on Aug. 28, 1984). Here, adaptive thresholding is implemented by using an image offset potential, which is obtained on a pixel-by-pixel basis as a function of white peak and black valley potentials in the image. This offset potential is used in conjunction with nearest neighbor pixels to provide an updated threshold value that varies pixel-by-pixel. The peak and valley potentials are generated, for each image pixel, by comparing the image potential of that pixel with predetermined minimum white peak and maximum black valley potentials. Unfortunately, this technique also appears to exhibit difficulties in extracting low contrast objects in a thresholded image.
Therefore, a need exists in the art for a technique, specifically apparatus and an accompanying method, for accurately and reliably thresholding a gray-scale image to locate objects therein using a threshold that varies based on local image properties. This technique should exhibit heightened noise immunity and reduced boundary artifacts, as well as increased accuracy in detecting low contrast objects, than do conventional variable thresholding techniques.