Optical Character Recognition (OCR) is the electronic translation of images of text into machine-editable code. For every letter or character on the page, an OCR program attempts to deduce its ASCII value. The most common method of doing this is “feature extraction” which, according to Susan Haigh, Optical Character Recognition (OCR) as a Digitization Technology, Network Notes #37, Information Technology Services, National Library of Canada (Nov. 15, 1996), “identifies a character by analyzing its shape and comparing its features against a set of rules that distinguishes each character/font.” Thus, OCR may include mapping various features or feature combinations to an ASCII value. Other OCR methods include matching a glyph to a stored bitmap associated with an ASCII value.
JBIG2 is a compression format for binary documents, which is designed to take advantage of the similarity between distinct glyphs in the same document. It generally uses digital geometry pattern matching techniques such as the Weighted XOR or rank Haussdorf distance between glyphs to determine if they are in the same font character class. As the format is oblivious to the textual information on the page, it makes no attempt to identify the ASCII value of the glyphs or to fit them into any pre-computed font character categories.
In short, OCR and JBIG2 provide two very different solutions to two very different problems. Junqing Shang et al., JBIG2 Text Image Compression Based On OCR, The Internal Society for Optical Engineering, Tsinghua University (Jan. 16, 2006) (“Shang”) discusses an attempt to directly use the techniques developed for OCR to help with JBIG2. Many implementations of JBIG2 have a high error rate. Shang refers to using an OCR engine to determine when to match glyphs together to improve the compression ratio while decreasing the error rate. Specifically, Shang states the following. “The JBIG2 (joint bi-level image group) standard for bi-level image coding is drafted to allow encoder designs by individuals. In JBIG2, text images are compressed by pattern matching techniques. In this paper, we propose a lossy text image compression method based on OCR (optical character recognition) which compresses bi-level images into the JBIG2 format. By processing text images with OCR, we can obtain recognition results of characters and the confidence of these results. A representative symbol image could be generated for similar character image blocks by OCR results, sizes of blocks and mismatches between blocks. This symbol image could replace all the similar image blocks and thus a high compression ratio could be achieved. Experiment results show that our algorithm achieves improvements of 75.86% over lossless SPM and 14.05% over lossy PM and S in Latin Character images, and 37.9% over lossless SPM and 4.97% over lossy PM and S in Chinese character images. Our algorithm leads to much fewer substitution errors than previous lossy PM and S and thus preserves acceptable decoded image quality.”
Junqing Shang et al., OCR Result Optimization Based on Pattern Matching, Proceedings Vol. 6500, Document Recognition and Retrieval XIV, 65009 (Jan. 29, 2007) (“Shang 2”) discusses using the standard JBIG2 pattern matching to improve OCR results. Frequently the same “feature extraction” rules which can confidently identify some glyphs on a page may not be able to confidently identify other glyphs on the page. Shang 2 therefore proposes a post-OCR processing step of using standard JBIG2 type pattern matching techniques (such as weighted XOR). If any glyphs whose confidence level regarding its value, using OCR, is below a certain threshold matches a glyph whose confidence level regarding its value, using OCR, is above the threshold, the less confident glyph is given the ASCII value of the more confident glyph to which the less confident glyph matches via JBIG2. Thus, the system allows for glyphs with low confidence levels to assume the ASCII value of a more confident glyph, if it matches it. As a consequence, OCR results could be improved on certain data sets. Specifically, the abstract of Shang 2 states the following. “Post-processing of OCR is a bottleneck of the document image processing system. Proof reading is necessary since the current recognition rate is not enough for publishing. The OCR system provides every recognition result with a confident or unconfident label. People only need to check unconfident characters while the error rate of confident characters is low enough for publishing. However, the current algorithm marks too many unconfident characters, so optimization of OCR results is required. In this paper we propose an algorithm based on pattern matching to decrease the number of unconfident results. If an unconfident character matches a confident character well, its label could be changed into a confident one. Pattern matching makes use of original character images, so it could reduce the problem caused by image normalization and scanned noises. We introduce WXOR, WAN, and four-corner based pattern matching to improve the effect of matching, and introduce confidence analysis to reduce the errors of similar characters. Experimental results show that our algorithm achieves improvements of 54.18% in the first image set that contains 102,417 Chinese characters, and 49.85% in the second image set that contains 53,778 Chinese characters.”
One aspect that has hampered the introduction of JBIG2 matching techniques into the OCR process is the high error rate of most JBIG2 implementations. In addition, the slow processing speed of current OCR engines made them unsuitable as a required step of JBIG2 compression technique.