1. Field of the Invention
The present invention relates generally to document image encoding and decoding, and more particularly, to a method and apparatus for improving accuracy of optical character recognition (OCR).
2. Description of Related Art
Input scanners have been developed for uploading hardcopy documents into electronic document processing systems. These scanners typically convert the appearance of a hardcopy document into a raster formatted, digital data stream, thereby providing a bitmapped representation of the hardcopy document appearance. OCR systems such as Textbridge produced by ScanSoft, Inc. convert bitmapped document appearances into corresponding symbolic encodings. Unfortunately, OCR systems are not immune to making errors when inferring a correlation between a particular bitmap pattern and a corresponding document encoding (e.g., ASCII).
This problem has been address by designing special fonts such as OCR-B fonts, where characters that are likely to be confused (e.g., 1, l, and I) are given distinctly different typographic features. This allows an OCR system to more accurately infer the correlation between a bitmap pattern and its corresponding document encoding. In addition, Plumb et al. disclose in xe2x80x9cTools for Publishing Source Code via OCR,xe2x80x9d 1997, printing the primary channel of a hardcopy document by replacing spaces and tabs with printable characters. Also, U.S. Pat. No. 4,105,997 discloses a method for using checksums of text in a document to locate errors during OCR.
This problem has also been addressed in U.S. Pat. No. 5,486,686, which discloses a document processing system in which human readable hardcopy renderings of a document are integrated with complete or partial electronic representations of the document and/or its content. The electronic representation provides an xe2x80x9cassist channelxe2x80x9d that encodes information about the document or computed from the document. The assist channel is defined using printable machine-readable codes. In one illustrated example, the assist channel can be defined using compact glyph codes at the bottom of a document.
More specifically, an xe2x80x9cassist channelxe2x80x9d of a hardcopy document is a machine readable encoding of side information that aids an OCR application in decoding the contents of a primary channel. The xe2x80x9cprimary channelxe2x80x9d of a hardcopy document includes the human readable information of document. The primary channel, which cannot be modified and is slightly error prone to OCR processing, carries most of the information content of the document. One use of the assist channel is to encode information that assists in the identification of failures of an OCR application in decoding the contents of a primary channel as disclosed for example in U.S. Pat. Nos. 5,625,721; 5,748,807; and 6,047,093.
Even with these advances that improve OCR processing using an assist channel, it continues to be desirable to provide an assist channel encoding that balances and improves the tradeoff between the amount of information encoded in the assist channel and the improved accuracy of the OCR system given the encoded information. At one extreme, the assist channel can contain as much information as the primary channel (i.e., redundant information). At the other extreme, the assist channel can simply contain a single checksum of the contents of a document. There exists therefore the desirability to provide an assist channel encoding compensates for the failure of the primary channel during OCR processing yet is compact relative to the primary channel.
In accordance with the invention, there is provided a method, and apparatus therefor, for generating image data for rendering on a hardcopy document. A primary set of symbol data is identified that provides a first channel of human readable information to be rendered on the hardcopy document. A secondary set of encoding data is computed from the primary set of symbol data. The secondary set of encoding data provides an assist channel of machine readable information that is rendered on the hardcopy document.
In accordance with one aspect of the invention, the assist channel is encoded dividing the primary set of symbol data into a plurality of vertical blocks. Each vertical block captures one or more symbols from a plurality of lines of the hardcopy document. At least two sets of guard digits are computed for each of the vertical blocks to define the second set of encoding data.
In accordance with another aspect of the invention, the assist channel is encoded by recoding character codes. Characters are first separated into equivalence classes which separates those characters that are most likely to be confused during OCR processing. Each character in the equivalence class is assigned a character code. When recoding character codes in the primary channel, the recoded character code has an equivalence class and a character code (corresponding to that equivalence class). When encoding the assist channel, a greater number of guard digits is assigned to protect portion of the recoded character code that identifies the equivalence class, thereby applying greater error correction to bits that are more susceptible to errors during OCR processing.