Increasingly, businesses require the ability to capture information that is contained in conventional, tangible documents and to transfer this information into their own computer systems. This process typically involves capturing images of the pre-existing documents, identifying areas of the document that contain the desired information (which may include text or other types of information within the document), and interpreting these areas of the document image (e.g., through optical character recognition, image extraction, etc.) to obtain the information.
Document images often have “noisy” backgrounds that make it more difficult to distinguish text in an image from other elements of the image. These other background elements may include, for example, photographic images, security elements (e.g., fine lines), watermarks, and the like. The document image may also have lighting variations, skewing, or other characteristics which increase the difficulty of distinguishing text in the image from the various background elements. The background elements and other document image characteristics may cause optical character recognition (OCR) algorithms to be confused, and to mis-identify textual and non-textual pixels. For instance, a line which is a security element may be identified as a text element, or a thin or lightened part of a text element may be identified as background or noise.
In some cases, conventional empirical techniques such as thin-line removal are used in an effort to compensate for non-textual features, effectively attempting to remove the background. Once the background has been removed, traditional OCR may be more effective at correctly extracting the text from the document image. Line removal is not a complete solution, however, as it is sensitive to such factors as the size of the text and the thickness of the lines. Furthermore, line removal is ineffective with respect to other types of backgrounds, such as photographic images.
Another problem with conventional techniques for improving the effectiveness of OCR is that they commonly focus on removal of the background. When a background region is identified, it is typically “removed” by making the pixels white. This may create a sharp change in pixel intensities at the boundaries of the background region (the original pixel intensities outside the region may be relatively light, but may still contrast with the white (maximum intensity) pixels in the background region. This sudden change in intensity at the edge of the background region may be misinterpreted by OCR algorithms as an identifiable feature in the image. This can “confuse” the algorithms and reduce the effectiveness of the algorithms.
It would therefore be desirable to have systems and methods for improving the effectiveness of identifying the portions of a document image that correspond to text and the portions that correspond to non-textual background elements, so that OCR algorithms can more effectively recognize and extract text from the document image.