Electronic capturing and processing of images, whether textual, graphic, monochrome, and/or color, has been widely used for a number of years. For example, personal computer systems with optical scanners attached thereto, operable under control of an image scanning and/or image processing application program, have become readily available for business and individual use. Paper documents are converted into electronic documents through the use of a scanning device such as a desktop scanner, digital camera, all-in-one device, etcetera, Scanning is often followed by text or image processing steps such as optical character recognition (OCR) for the conversion of bitmapped text into ASCII/Unicode text, or vectorization of graphics (e.g., to Bezier or scalable vector graphics (SVG) format).
However, the technology implemented in capturing and processing such images is not without disadvantage. Scanned images often include anomalies in the electronic image that are not present in an original scanned copy or that do not accurately represent the original scanned copy. For example, an electronic image captured using a typical personal computer-based scanner may include anomalous bits associated with the phenomena of “bleedthrough,” resulting in an electronic image that does not accurately represent the original scanned copy. Bleedthrough results from images on the back side of a document that are visible, or partly visible, during scanning of the front side of the document. Accordingly, a mirrored “ghost” of the image from the back of the document may appear in the document's scanned front image as a result of bleedthrough. Typically, the ghost image is particularly prevalent in “background” areas of the scanned front image. However, such a ghost image typically does not directly correspond to the image from the back of the document due to such things as scattering (light dispersion associated with the document media), level of darkness of the back image, level of darkness of the front image, irregularities in the document media, and the like.
The presence and/or extent of bleedthrough present in any particular situation may be affected by a number of variables. For example, the quality and/or thickness of the paper comprising the document, the particular optics used in scanning the document, the intensity of the light illuminating the document for the scan, and the light angle may all affect the extent to which bleedthrough is present in any particular scanned image.
In the past, image scanning has been user interface-centric. That is, scanning has typically been accomplished through a user placing one sheet of a document to be scanned upon a scanner and the user selecting scanning parameters for the specific image to be scanned. For example, the user may select the region or regions to be scanned, the type of image to be scanned, the intensity level of the scan, etcetera. After an image has been acquired, the user may further manipulate the scanned image to provide a desired result, such as to crop unwanted image areas. In extreme situations, the user may elect to discard the scanned image and reattempt to acquire a scanned image, such as by adjusting one or more of the aforementioned scanning parameters.
Such user interface-centric scanning has been acceptable for many situations in the past. However, as image processing technology has matured over the past decade commercial, high volume, and/or more automated scanning has become desirable. For example, it is not uncommon for document scanning to involve the need to scan many numbers of pages at a time, such as in commercial publishing. Accordingly, if scanning is to be accomplished in a reasonable amount of time and for a reasonable cost, it has become desirable to automate the process, thereby reducing the level of user input with respect to specific individual documents to be scanned. Moreover, the uses to which such scanned images are to be put often demand reliable, high-quality scans, such as to provide accurate optical character recognition and/or to avoid the need for substantial proofing/manipulation of a scanned image by an operator.
Attempts have been made to automatically detect bleedthrough present in scanned images. However, it is very difficult to determine what is bleedthrough and what is not. A determination of which portions of a scanned image are the result of bleedthrough may be based upon those areas of the image containing pixels of a certain range of intensity, i.e., gray scale values. Bleedthrough determinations based solely upon such criteria often will result in desired portions of the scanned image, i.e., the image from the front side of the document, that have similar intensity characteristics being identified as bleedthrough. For example, a bleedthrough detector based upon this technique may misidentify background features of the front image as bleedthrough. Similarly, pixels associated with a transition from one feature in an image to another feature in the image may be misidentified as bleedthrough, as there is typically not an instantaneous transition from one feature in an image to another feature in an image.
Utilization of the above-mentioned bleedthrough determination techniques in automated manipulation of the scanned image can result in undesirable results. For example, pixels surrounding text characters, wherein a character edge transitions to an image background, may be identified as bleedthrough and, therefore, removed and replaced with white or blank pixels. However, this has been found to result in the characters being surrounded by white, providing an effect where the characters appear to have been cut out and pasted in the image much like a ransom note.
Other techniques for removal of bleedthrough also have less than desirable results. For example, where an image includes only text characters, it may be possible to binarize the image to remove bleedthrough. Specifically, an image may be binarized by making it all black and white with no gray, i.e., a bit depth of 1 instead of 8 or 24. When an image is binarized, unless the original document presents particularly poor bleedthrough characteristics such as a magazine page, substantially all of the bleedthrough will be turned to white. However, the gray information, such as in the transition areas along the edges of the text characters will be lost, often providing undesired results, e.g., binarization may result in thin character attributes being removed along with the bleedthrough. For example, if the letter “T” appeared in the scanned image, binarization may result in the loss of the cross bar and, thus, the character may be identified by an OCR application as the number “1” or a lowercase rendition of the letter “L”.