Techniques are known for removing noise from digital representations of data images obtained by digitally scanning documents and the like. The scanned documents are processed to identify objects within the scanned images that are, in turn, used to mask out the noise. For example, U.S. Pat. No. 7,016,536 discloses a method and apparatus for removing noise by building objects from reduced resolution representations of the scanned image and including the identified objects in a mask that is logically ANDed with the de-skewed representation of the scanned document. Objects identified as picture objects are included in a mask and logically ANDed with the de-skewed representation to eliminate all other objects, while objects marked as data objects are added to the representation to provide a de-skewed, de-speckled representation of the scanned document.
Binarization of an image involves translating grayscale values, typically 0 to 255, into binary values, 0 or 1. A common way to accomplish this mapping is to pick a threshold whereby all values under the threshold are mapped to 0 and all values above the threshold are mapped to a 1. Binarization of images is desirable prior to applying optical character recognition (OCR) techniques to a document for text recognition so that edges may be better detected. However, binarization is difficult in the case of noisy images where a noisy background adjacent to the text may lead to improper OCR conversion. It is desired to clean up such noisy background by determining the appropriate background color and assigning same to the portion of the document adjacent the text to be examined so that the OCR software may be more accurate. The present invention satisfies such needs in the art.