Information in the form of language symbols (i.e., characters) or other symbolic notation that is visually represented to a human in an image on a marking medium, such as in a printed text document, is capable of manipulation for its semantic content by a computer system when the information is accessible to the processor of the system in an encoded form, such as when each of the language symbols is available to the processor as a respective character code selected from a predetermined set of character codes (e.g. ASCII code) that represent the symbols to applications that use them. Traditionally, manual transcription by a human typist has been used to enter into computers the character codes corresponding to the language symbols visually represented in the image of text documents. In recent years, a mostly automatic software operation variously called "recognition," or "character recognition," or "optical character recognition" (OCR) has been performed on text document images to produce the character codes needed for manipulation by the computer system, and to automate the otherwise tedious task of manual transcription. However, manual text entry output and the output of character recognition operations, both referred to as transcriptions herein, are inherently error-prone; some kind of proofreading or correction process is usually needed when an accurate transcription is desired.
Given a text document image and a transcription of the document image, there are a number of approaches that might be taken to correct errors in that transcription. Before the widespread use of computers, manual proofreading by a human operator was about the only method available, and is still the method of choice when the greatest possible accuracy is needed. For example, the University of Washington provides an extensive document image database for use in various aspects of document processing research; these document images have associated transcriptions that reportedly were each carefully proofread by three separate people, requiring one hour per proofreader per page. But even this close attention had a residual error rate considerably higher than the goal of one per million characters. Automatic proofreading and correction is clearly desirable in order to reduce the manual labor required to produce a final, accurate transcription.
1. Error correction systems
Various types of automatic error correction techniques are commonly used to improve transcription accuracy. These existing techniques are seldom totally automatic because of inherent limitations in their methodologies that prevent them from making what might be called final decisions as to correctness; consequently, these methods almost always involve the manual intervention of a human user who must be the final arbiter as to the corrections to be made. One type of error correction methodology involves performing a character recognition operation on the original document image and comparing the output transcription produced to the original transcription to be corrected in order to highlight the differences between the two transcriptions and to identify likely locations for at least some of the errors in the original transcription. When the original transcription has been generated by a first OCR operation, a second, different OCR operation is used for the correction or proofreading operation. Two types of problems occur using this approach. The first is that it is very likely that many of the recognition errors in the second transcription may be the same as those appearing in the original transcription and thus cannot be detected by comparing the two transcriptions. This is because the vast majority of current commercial OCR technology is designed to be "omnifont;" that is, able to handle a wide range of text fonts and typographic constraints so as to be generally useful with a wide variety of document images. This generality, however, is the very characteristic that leads to errors: subtle cues that are useful or necessary for accurate recognition within a particular character font are typically not represented in the feature-based character templates that are used in omnifont recognizers. For example, there is often only a very slight difference between the glyphs representing the letter "l" and the numeral "1" in a given font, typically less difference than between the glyphs for "1" in different fonts, so 1/1 confusion errors are common in omnifont OCR systems. (The term "glyph" refers to an image that represents a realized instance of a character.) A second limitation of this correction approach is that even when a disagreement between the first, original transcription and the second transcription produced by the omnifont recognizer is found, it typically cannot be directly and automatically determined which transcription is in error and which, if either, is correct. Therefore, the best this type of approach can accomplish is to flag potential errors for manual intervention by a human operator.
Another category of error correction methodology employs some type of language modeling. A spelling corrector that ensures that each word in the transcription is a correctly spelled word from some dictionary is a simple form of such language modeling. Contextual postprocessing error correction techniques make use of language structure extracted from dictionary words and represented as n-grams, or n-character subsets of words. More advanced forms of language modeling include examining the parts of speech, sentence syntax, etc., to ensure that the transcription correctly follows the grammar of the language the document is written in. There are, however, several limitations to the language modeling approach. First, an extensive dictionary and/or grammar is needed for the particular language represented by the character strings that appear in the document; obviously, a dictionary and grammar must be available for each language represented in the document transcriptions to be corrected. It is also very likely that a given document will contain character strings that do not occur in even a very large dictionary or that are "ungrammatical," e.g., names, numbers, etc.; a special mechanism must be available for handling these portions of the transcription if they are to be evaluated for correction.
A significant limitation of the language modeling approach to error correction is the fact that language modeling involves using the content of a transcription as a guide to determining what errors exist, and ignores the content of the original document image. If the original document contains spelling or grammatical errors, those will be detected as "errors" by the language model, even though they may be correctly transcribed. Conversely, if such misspellings or grammatical errors in the document are mis-transcribed into correctly spelled words or into grammatically correct strings, a transcription error has occurred that cannot be detected by language modeling. Some error correction systems that use a language modeling approach compensate for this by merely flagging potential errors, leaving the final determination and correction to a human operator, and consequently requiring some level of human intervention in the correction process.
Language modeling may also be of limited value for post-recognition correction of a transcription that has been generated by a computer-implemented OCR operation because most commercial OCR systems already include some sort of language modeling as part of the recognition process. Therefore, transcription errors that still occur after recognition have presumably not been corrected by the language model component of the recognizer, and therefore are likely to be the type of errors that are not readily detected by language modeling. For example, U.S. Pat. No. 4,979,227 issued to Mittelbach et al. discloses a character recognition system that includes a context lexicon that is used to correct word-based strings of recognized characters. Current recognized strings are compared to the strings of the context lexicon and that string in the lexicon which is optimum with respect to similarity and frequency is selected for further evaluation. A correction is only executed when the substitution transposition is probable based on the classifier characteristic for the characters under consideration. U.S. Pat. No. 4,654,875 issued to Srihari et al. discloses a character recognizer that includes lexical information in the form of acceptable words represented as a graph structure.
U.S. Pat. No. 5,257,328 issued to Shimizu discloses a document recognition device capable of correcting results obtained from recognizing a character image using a post-recognition correction operation that includes a correction data base in which is registered correction information on misrecognized characters that are specified as targets to be corrected by an operator. The post-recognition correction operation also includes an automatic correction process that corrects results recognized by the character recognizer using the correction data base, an operator's correction process that allows an operator to correct erroneous results of the automatic correction process, and a correction data base update operation that updates the correction data base with correction information made by the operator. Automatic correction to characters are made on the basis of statistics collected in the correction data base as a result of the recognition operation that indicate the number of times a character occurs in an image and the number of times it has been corrected to other characters.
J. J. Hull and S. N. Srihari discuss the use of contextual constraints in text recognition and error correction, in "Experiments in Text Recognition with Binary n-Gram and Viterbi Algorithms," in IEEE Transactions on Pattern Analysis and Machine Intelligence, September, 1982, pp. 520-530. Such contextual constraints may take a variety of forms including vocabulary, probabilities of co-occurring letters, syntax represented by a grammar, and models of semantics. An example of a structural representation of contextual knowledge, known as the binary n-gram algorithm, utilizes contextual knowledge in the form of sets of binary arrays that represent legal letter combinations in the language used in the document being recognized. The binary n-gram algorithm utilizes an abstraction of a dictionary that is assumed to contain all allowable input words. The method attempts to detect, as well as correct, words with errors. J. J. Hull and S. N. Srihari disclose a binary n-gram procedure for correcting single substitution errors.
Two major recognition problems contributing to transcription inaccuracy that may not be entirely addressed by the types of error correction methodologies just described are image noise and glyph segmentation errors. Glyphs occurring in bitmapped images produced from well-known sources such as scanning and faxing processes are subject to being degraded by image noise and distortion which contribute to uncertainty in the actual appearance of the glyph's bitmap and reduce recognition accuracy. A degraded bitmap appearance may be caused by an original document of poor quality, by scanning error, by image skewing, or by similar factors affecting the digitized representation of the image. Particular problems in this regard are the tendencies of characters in text to blur or merge, or to break apart. Such a degraded image is referred to herein as a "noisy" image. Many OCR errors occur from recognizers attempting to recognize individual glyphs in images that have been degraded by such noise and distortion.
Image noise is often a contributing problem to accurate glyph segmentation. Since many commercial omnifont recognizers depend upon the accuracy of a pre-recognition segmentation process to isolate glyphs for recognition, individual glyphs must occur within the image either in nonoverlapping bounding boxes, or, if the glyph samples are not so restricted, the recognition process must provide for a way to assign pixels in the image to a specific glyph, so that the glyphs may be isolated, recognition may be performed and character labels may be assigned to the glyphs. This requirement of the input image will be hereafter described as requiring that the input glyph samples be "segmentable" during the recognition process, either by determining the bounding box around each glyph, or by some other process that assigns pixels to glyph samples. Some images may contain glyphs representing characters in fonts or in character sets that do not lend themselves easily to such segmentation, or image noise may prevent accurate glyph image segmentation.
Some post-recognition error correction methods have been specifically developed to overcome the recognition problems introduced by noisy images and poor segmentation. For example, U.S. Pat. No. 5,048,113 issued to Yamagata et al. is concerned inter alia with the correction of errors that result from misrecognition of characters in multiple fonts or multiple type sizes that occur in one document image. Yamagata et al. disclose a feature-based character recognizer in which the recognition results include certain information about the recognition reliability of each character in the transcription. A character string, typically a word, is selected from the transcription and a candidate reference character in the character string is identified and selected on the basis of certain factors of the recognitron results. The candidate reference character is then located in the original document image, and certain image processing analysis is performed on the character image to develop reference image attributes, such as height and baseline position, by which to judge the correctness of the remaining characters in the character string, and to correct them if necessary on the basis of the reference image attributes.
Another technique that makes use of post-recognition image analysis for recognition error correction is disclosed in U.S. Pat. No. 5,048,097 issued to Gaborski et al. A neural network character recognizer makes use of a post-recognition processor that attempts to find and separate, in the original document image being recognized, adjacent characters which are kerned and characters which are touching that have been identified by the neural network as having low recognition scores. Characters successfully segmented in the post-recognition processor are fed back to the neural network for re-recognition. Dekerning and character segmentation are accomplished using various image processing techniques including connected component analysis.
U.S. Pat. No. 3,969,700 issued to Bollinger et al. discloses a system for selecting the correct form of a garbled input word that is misread by an optical character reader so as to change the number of characters in the word by character splitting or concatenation. The error correction technique uses what is referred to as a regional context maximum likelihood procedure performed using a conditional probabilistic analysis that evaluates the likelihood that each member of a predetermined class of reference words, stored as a dictionary of words, being considered could have been mapped into the garbled character string by means of OCR segmentation error propensities. The segmentation error propensity data are represented as a table of independent conditional probabilities for various types of substitution and segmentation errors for the stored dictionary words. When a garbled OCR word is input to the system, it is compared with each stored dictionary word by loading the two words in a pair of associated shift registers and aligning their letters on one end. The method then calculates the total conditional probability that the OCR word in the first shift register was misread given that the dictionary word was actually scanned by the OCR. The OCR and dictionary words are realigned as needed during this computation if the total probability computed indicates that a segmentation or concatenation error has occurred.
This brief discussion of the wide variety of, and distinctly different approaches to, post-recognition error correction methodologies developed to improve the accuracy of the output transcription points to at least two major causes of transcription errors. The first of these is inadequate character template models that are unable to sufficiently discriminate between individual characters in a single font, thereby typically causing substitution errors, the correction of which is typically handled by some sort of language model. The second major cause of errors is a degraded image that results in segmentation and concatenation errors when isolating character images for recognition, typically causing insertion, deletion and substitution errors as a result of presenting an incorrect glyph for recognition. Correction of these types of errors may involve the use of several types of error correction solutions. The variety of solutions that have been developed appears to suggest that the causes of transcription errors are too varied to be susceptible to correction by a single uniform approach, and that several correction operations are'required in combination in order to achieve a significant reduction in errors in most transcriptions.
It is of significance to note that, while some of the techniques (e.g., the language model correction techniques) heretofore described make use of a priori knowledge about the language in which the text of a document appears in order to effect error correction, none of them makes use of explicitly defined a priori knowledge about the text document image itself--in the form of an explicitly defined image model--to improve the accuracy of a transcription. Explicitly defined formal image models have been used in recognition operations; a brief description of formal image models and their use is provided here as relevant background in understanding the present invention.
2. Image Models
An image model is a characterization or description of the set of possible input images for which a recognition system is designed, provided in a form that can be used to determine which one of the possible images best matches a given input image. An image model represents a priori information about this set of input images and is distinguishable from data structures that define a particular input image or contain the results of performing analysis and recognition processing on a particular image.
For example, an image model for individual character images defines the set of possible characters that are expected to be presented for recognition, and indicates the value of each pixel in each character image. A typical form for a character image model is a set of binary or feature templates. An isolated character image model provides a recognition system with the a priori information necessary to determine which character is most likely to correspond to a given input image of an isolated character. Similarly, an image model for isolated text lines might describe the set of possible text line images by specifying the set of possible character sequences within the line and the positioning of the individual character images relative to each other and the text baseline. When used in recognition, a text line image model typically provides the a priori information necessary to determine an output text string that is most likely to correspond to a given observed image of an isolated, but otherwise unsegmented, text line. An image model for a whole page of text might describe the set of possible text line images that can occur and their possible positions relative to each other and to the boundary of the page. When used in recognition, a page image model provides the a priori information required to determine an output sequence of text strings that is most likely to correspond to a given observed input image of a text page. An image model frequently describes conventional text images, but an image model may be constructed to describe any one of a number of classes of input images, including, for example, images of printed music, images of equations, and images with fixed or known structural features such as business letters, preprinted forms and telephone yellow pages.
a. Formal and informal image models.
For purposes of the discussion herein, image models may also be classified as "informal" and "formal." A formal image model describes a set of images using a formal description language, such as a formal grammar or a finite state transition network. A formal grammar is a set of rules that define the allowable formats (syntax) that statements in a specific language are allowed to take. Grammars may be characterized by type as unrestricted, context sensitive, context free and regular, and a particular type of grammar may be more or less suited to a specific image model. In a computer implemented system, a formal image model is typically represented as an explicit data structure that defines the possible constituents of the image and their possible positions in the image. As noted above, the image model represents a priori information and is to be distinguished from data structures constructed to represent a particular input image to a recognition system or the results of recognition processing of a particular input image.
For purposes of this background discussion, and for discussing the present invention, an informal image model includes all approaches to describing a set of possible images other than by use of a formal explicit description system. The design of every text recognition system is based on either an explicit or implicit image model. The distinction to be drawn is whether the image model is explicitly and formally stated in a manner that is independent of the processing algorithms that use the model, or whether the model is only represented implicitly, as a body of code that performs image analysis operations. A formal image model, in this regard, is analogous to a formal grammar in a grammar-based character string parsing system which exists as an explicit data structure independent of the code of the parser that uses it.
b. Zero-, One-, and two-dimensional image models.
Formal image models may take "zero-dimensional" (OD), one-dimensional (1D) or two-dimensional (2D) forms. A OD image model, as that term is used herein, describes images of isolated characters. The most common types of 0D image models are binary and feature-based character templates. A 1D image model, as that term is used here, defines the structure and appearance of a sequence of character images, such as a word or an entire text line, including the appearance and positioning of the individual characters in the line. A primary application of explicit 1D image models is in text line recognition systems that do not attempt to segment the text line into individual character images prior to recognition. The character and text line models used in such systems typically resemble the kinds of models used in speech recognition systems based on hidden Markov models, or simple extensions to such models.
A 2D image model, as that term is used herein, is distinguishable from a 1D image model in that the 2D image model typically defines the recognition process for an entire 2D image by describing how 2D subregions in the image are related to each other, without isolating 1D lines of text or individually segmenting character or word instances in the image in a distinct process prior to recognition. The use of a 2D image model for recognition provides the opportunity to eliminate the pre-recognition step of character, word or text line isolation or segmentation.
Formal 1D image models are used to represent words in S. Kuo and O. E. Agazzi, in "Keyword spotting in poorly printed documents using pseudo 2D hidden Markov models," IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 16, No. 8, August, 1994, pp. 842-848 (hereafter, Kuo et al.,) which discloses an algorithm for robust machine recognition of keywords embedded in a poorly printed document. For each keyword model, two statistical models, named "pseudo 2D Hidden Markov Models", or "PHHMs," are created for representing the actual keyword and all other extraneous words, respectively. C. Bose and S. Kuo, "Connected and degraded text recognition using hidden Markov model," in Proceedings of the International Conference on Pattern Recognition, Netherlands, September 1992, pp. 116-119 disclose a recognition method for recognizing isolated word or line images; the recognizer is based on a formal 1D model expressed as a hidden Markov model.
U.S. Pat. Nos. 5,020,112 and 5,321,773 disclose recognition systems based on formal 2D image models. U.S. Pat. No. 5,020,112, issued to P. A. Chou, entitled "Image Recognition Using Two-Dimensional Stochastic Grammars," discloses a method of identifying bitmapped image objects using a 2D image model represented as a stochastic 2D context free grammar having production rules that define spatial relationships between objects in the image according to a rectangular image model; the grammar is used to parse the list of objects to determine the one of the possible parse trees that has the largest probability of occurrence. The term "stochastic" when used in this context refers to the use of probabilities associated with the possible parsing of a statement to deal with real world situations characterized by noise, distortion and uncertainty. U.S. Pat. No. 5,321,773, issued to G. Kopec and P. A. Chou, entitled "Image Recognition Method Using Finite State Networks" discloses a formal 2D image model represented as a stochastic finite state transition network that defines image generation in terms of a regular grammar, in contrast to the context free grammar used in U.S. Pat. No. 5,020,112. The template model described by the 2D image model defines the sidebearing model of letterform shape description and positioning, where character positioning does not depend on determining rectangular bounding boxes for character images; pairs of adjacent character images are positioned with respect to their image origin positions to permit overlapping rectangular bounding boxes as long as the foreground (e.g., black) pixels of one character are not shared with, or common with, the foreground pixels of the adjacent character. The 2D image model and the template model are also discussed in G. Kopec and P. Chou, "Document Image Decoding Using Markov Source Models," in IEEE Transactions on Pattern Analysis and Machine Intelligence, June, 1994, pp. 602-617 (hereafter, "Kopec and Chou, `Document Image Decoding`".)