Image binarization refers to the process of converting an image represented by pixel values which may assume multiple levels to pixel values which can be one of two values, e.g., a first value corresponding to foreground and a second value corresponding to background. Image binarization can be used to convert a grayscale or a color image to a black and white image. Frequently, binarization is used as the pre-processing step of document image processing. For example optical character recognition (OCR) algorithms may involve image binarization as a first step.
In order to obtain and use information included in binarized images, e.g., documents or pages of a document, the presence or absence of information of interest often needs to be determined along with the location of the information of interest so that the information of interest can be extracted from the image, e.g., document.
For example, it may be desirable to identify the presence and location of a customers name on a binarized form so that the name can be extracted and used to populate a data base, complete a customer record or for some other purpose such as providing a service to the customer.
While identification of a label associated with information of interest may facilitate detection of the information of interest, multiple different terms or phrases may be used for the same information even on the same form. The issue of determining the location of information of interest is not only complicated by the possible use of different terms or phrases possibly being used to identify information of interest but also the possibility that as part of a scanning and binarization process some of the information included in an original hard copy may have been lost or corrupted in generating the digital binarized image representing the document which is to be searched for particular information of interest and then processed with regard to extracting information of interest when its presence is determined.
In view of the above discussion, it should be appreciated that there is a need to address the technical issue of how to determine the presence of information of interest in a document where at least some of the information used to identify the information of interest may be missing or corrupted or where multiple different terms or phrases may be used to identify the same type of information.