1. Field of the Invention
The present invention relates generally to document image encoding and decoding, and more particularly, to a method and apparatus for improving accuracy of optical character recognition (OCR).
2. Description of Related Art
Input scanners have been developed for uploading hardcopy documents into electronic document processing systems. These scanners typically convert the appearance of a hardcopy document into a raster formatted, digital data stream, thereby providing a bitmapped representation of the hardcopy document appearance. OCR systems such as Textbridge produced by ScanSoft, Inc. convert bitmapped document appearances into corresponding symbolic encodings. Unfortunately, OCR systems are not immune to making errors when inferring a correlation between a particular bitmap pattern and a corresponding document encoding (e.g., ASCII).
This problem has been address by designing special fonts such as OCR-B fonts, where characters that are likely to be confused (e.g., 1, l, and I) are given distinctly different typographic features. This allows an OCR system to more accurately infer the correlation between a bitmap pattern and its corresponding document encoding. In addition, Plumb et al. disclose in “Tools for Publishing Source Code via OCR,” 1997, printing the primary channel of a hardcopy document by replacing spaces and tabs with printable characters. Also, U.S. Pat. No. 4,105,997 discloses a method for using checksums of text in a document to locate errors during OCR.
This problem has also been addressed in U.S. Pat. No. 5,486,686, which discloses a document processing system in which human readable hardcopy renderings of a document are integrated with complete or partial electronic representations of the document and/or its content. The electronic representation provides an “assist channel” that encodes information about the document or computed from the document. The assist channel is defined using printable machine-readable codes. In one illustrated example, the assist channel can be defined using compact glyph codes at the bottom of a document.
More specifically, an “assist channel” of a hardcopy document is a machine readable encoding of side information that aids an OCR application in decoding the contents of a primary channel. The “primary channel” of a hardcopy document includes the human readable information of document. The primary channel, which cannot be modified and is slightly error prone to OCR processing, carries most of the information content of the document. One use of the assist channel is to encode information that assists in the identification of failures of an OCR application in decoding the contents of a primary channel as disclosed for example in U.S. Pat. Nos. 5,625,721; 5,748,807; and 6,047,093.
Even with these advances that improve OCR processing using an assist channel, it continues to be desirable to provide an assist channel encoding that balances and improves the tradeoff between the amount of information encoded in the assist channel and the improved accuracy of the OCR system given the encoded information. At one extreme, the assist channel can contain as much information as the primary channel (i.e., redundant information). At the other extreme, the assist channel can simply contain a single checksum of the contents of a document. There exists therefore the desirability to provide an assist channel encoding that compensates for the failure of the primary channel during OCR processing yet is compact relative to the primary channel.