A fundamental problem in the art of automatic document image processing relates to image defects, that is, imperfections in the image as compared to the original ideal artwork used to create the image. The sources of image defects are numerous and well-known. For example, the original printed document (e.g., paper document) which was the source of the image may be defective (e.g., the paper has spots of dirt, folds, or was printed from a faulty printing device.) Further, when the paper document was scanned, the paper may have been skewed while being placed in the scanner, resulting in a distortion of the image. In addition, the optics of the scanning process itself can produce defects due to, for example, vibration, pixel sensor sensitivity or noise.
The above-mentioned image defects result in poor display quality of the image and are a particular problem in document image processing because of the character recognition accuracy required in the automatic processing of documents. For example, optical character recognition ("OCR") is often an integral part of an image processing system. OCR is the process of transforming a graphical bit image of a page of textual information into a text file which can be later edited, for example, using word processing software. As is well-known in the art, image classifiers are key components of most OCR systems used for analyzing a digital representation of an image. The accuracy of such classifiers significantly decreases when the quality of the image source is degraded even slightly.
Training classifiers to recognize images having a wide range of shape variations and/or image degradations is a well-known challenge in OCR. One technique, the so-called adaptive OCR strategy, trains the classifier only for the fonts and degradation conditions which are present in a given image, e.g., a printed text page. Thus, this adaptive OCR strategy requires some knowledge of the dominant font and defects in the given image. Some previously known adaptive OCR techniques represent such knowledge implicitly through character prototypes extracted directly form the image. For example, G. Nagy et al., "Automatic Prototype Extraction for Adaptive OCR", Proceedings of the Fourth International Conference on Document Analysis and Recognition, Ulm, Germany, Aug. 18-20, 1997, pp. 278-282 (hereinafter "Nagy"), and A. L. Spitz, "An OCR Based on Character Shape Codes and Lexical Information", Proceedings of the 3rd International Conference of Document Analysis and Recognition, Montreal, Canada, Aug. 14-18, 1995, pp. 723-728, describe two such character prototyping techniques. Nagy's character prototype technique employs truth labels, or the so-called "ground truth", as input which are derived from a small segment of the actual image to be recognized. The ground truth selected from the image, e.g., text, in accordance with Nagy's technique is actually keyed in to the system by a user. Using the ground truth, a matching occurs between pairs of words from the image and the ground truth to determine matching characters and to estimate the position of each character within each word (see, e.g., Nagy, supra., p. 278.)
While the above-described adaptive OCR techniques are useful in character recognition, the reliance on ground truth and the derivation of such ground truth directly from the image to be recognized does present certain disadvantages. In particular, prior to any classification of the image, the ground truth must be selected, processed and inputted into the OCR system for each image to be recognized. Thus, certain preprocessing overhead is inherently associated with these types of ground truth based adaptive OCR techniques.
Therefore, a need exists for a adaptive OCR technique for character recognition without reliance on ground truth derived from the image itself and provided as input to the OCR system prior to classification and recognition.