1. Field of the Invention
The present invention relates generally to document image encoding and decoding, and more particularly, to a method and apparatus for improving accuracy of optical character recognition (OCR).
2. Description of Related Art
Input scanners have been developed for uploading hardcopy documents into electronic document processing systems. These scanners typically convert the appearance of a hardcopy document into a raster formatted, digital data stream, thereby providing a bitmapped representation of the hardcopy document appearance. OCR systems such as Textbridge produced by ScanSoft, Inc. convert bitmapped document appearances into corresponding symbolic encodings. Unfortunately, OCR systems are not immune to making errors when inferring a correlation between a particular bitmap pattern and a corresponding document encoding (e.g., ASCII).
This problem has been address by designing special fonts such as OCR-B fonts, where characters that are likely to be confused (e.g., 1, l, and I) are given distinctly different typographic features. This allows an OCR system to more accurately infer the correlation between a bitmap pattern and its corresponding document encoding. In addition, Plumb et al. disclose in xe2x80x9cTools for Publishing Source Code via OCR,xe2x80x9d 1997, printing the primary channel of a hardcopy document by replacing spaces and tabs with printable characters. Also, U.S. Pat. No. 4,105,997 discloses a method for using checksums of text in a document to locate errors during OCR.
This problem has also been addressed in U.S. Pat. No. 5,486,686, which discloses a document processing system in which human readable hardcopy renderings of a document are integrated with complete or partial electronic representations of the document and/or its content. The electronic representation provides an xe2x80x9cassist channelxe2x80x9d that encodes information about the document or computed from the document. The assist channel is defined using printable machine-readable codes. In one illustrated example, the assist channel can be defined using compact glyph codes at the bottom of a document.
More specifically, an xe2x80x9cassist channelxe2x80x9d of a hardcopy document is a machine readable encoding of side information that aids an OCR application in decoding the contents of a primary channel. The xe2x80x9cprimary channelxe2x80x9d of a hardcopy document includes the human readable information of document. The primary channel, which cannot be modified and is slightly error prone to OCR processing, carries most of the information content of the document. One use of the assist channel is to encode information that assists in the identification of failures of an OCR application in decoding the contents of a primary channel as disclosed for example in U.S. Pat. Nos. 5,625,721; 5,748,807; and 6,047,093.
Even with these advances that improve OCR processing using an assist channel, it continues to be desirable to provide an assist channel encoding that balances and improves the tradeoff between the amount of information encoded in the assist channel and the improved accuracy of the OCR system given the encoded information. At one extreme, the assist channel can contain as much information as the primary channel (i.e., redundant information). At the other extreme, the assist channel can simply contain a single checksum of the contents of a document. There exists therefore the desirability to provide an assist channel encoding that compensates for the failure of the primary channel during OCR processing yet is compact relative to the primary channel.
In accordance with the invention, there is provided a method, and apparatus therefor, for generating image data for rendering on a hardcopy document. A primary set of symbol data is identified that provides a first channel of human readable information to be rendered on the hardcopy document. A secondary set of encoding data is computed from the primary set of symbol data. The secondary set of encoding data provides an assist channel of machine readable information that is rendered on the hardcopy document.
In accordance with one aspect of the invention, the assist channel is encoded for a selected line of the primary set of symbol data, having an ordered set of c1, c2, c3, . . . c1, ci, ci+1, . . . cn symbols, by: sequentially computing a hash of each of the symbols of the selected line with a state change function H, where the state change function H produces a hash hi that is at least a function of the current symbol in the selected line ci and the preceding computed hash hixe2x88x921; and computing a set of guard values for each of the symbols of the selected line with a guard extractor function G, where the guard extractor function G produces a guard value gi that is at least a function of the computed hash hi; the computed set of guard values defining the second set of encoding data for the selected line.