Information in the form of language symbols (i.e., characters) or other symbolic notation that is visually represented to a human in an image on a marking medium, such as a computer display screen or paper, is capable of manipulation for its semantic content by a processor included in a computer system when the information is accessible to the processor in an encoded form, such as when each of the language symbols is available to the processor as a respective character code selected from a predetermined set of character codes (e.g. ASCII code) that represent the symbols to the processor. When manipulation of the semantic content of the characters in the image by a processor is desirable, a process variously called "recognition," or "character recognition," or "optical character recognition" must be performed on the image in order to produce, from the images of characters, a sequence of appropriate character codes.
An image is typically represented in a computer system as a two-dimensional array of image data, with each item of data in the array providing a value indicating the color (typically black or white) of a respective location of the image. An image represented in this manner is frequently referred to as a bitmapped image, or a binary image. Each location in a binary image is conventionally referred to as a picture element, or pixel. Sources of binary images include images produced by scanning a paper form of a document using an optical scanner, or by receiving image data via facsimile transmission of a paper document.
Character recognition systems typically include a process in which the appearance of an isolated, input character image, or "glyph," is analyzed and, in a decision making process, classified as a distinct character in a predetermined set of characters. The term "glyph" refers to a character in its exemplary image form; a glyph is an image that represents a realized instance of a character. The classification analysis typically includes comparing characteristics of the isolated input glyph (e.g., its pixel content or other characteristics) to units of reference information about characters in the character set, each of which defines characteristics of the "ideal" visual representation of a character in its particular size, font and style, as it would appear in an image if there were no noise or distortion introduced by the image creation process. The unit of reference information for each character, typically called a "character template," "template" or "prototype," includes identification information, referred to as a "character label," that uniquely identifies the character as one of the characters in the character set. The character label may also include such information as the character's font, point size and style. A character label is output as the identification of the input glyph when the classification analysis determines that a sufficient match between the glyph and the reference information indicating the character label has been made.
The representation of the reference information that comprises a character template may be referred to as its model. Character template models are broadly identifiable as being either binary images of characters, or lists of high level "features" of bitmapped character images. A binary image of a black character template on a white background includes black foreground pixels that collectively make up the template's "support." "Features" are measurements of a character image that are derived from the binary image and are typically much fewer in number than the number of pixels in a binary template. Examples of features include a character's height and width, and the number of closed loops in the character.
Within the category of bitmapped, or binary, character template models, at least two different types of models have been defined. One common model for binary character templates may be called the "segmentation-based" model, and describes a character template as fitting entirely within a minimally-sized rectangular region, referred to as a "bounding box." A bounding box is defined to be the smallest rectangular box that can be drawn around a glyph or character template and include all of the foreground pixels within the rectangular box. The segmentation-based character template model describes the combining of adjacent character templates as being "disjoint"--that is, requiring substantially nonoverlapping bounding boxes. U.S. Pat. No. 5,321,773 discloses another binary character template model that is based on the sidebearing model of letterform shape description and positioning used in the field of digital typography. The sidebearing model, described in more detail below in the discussion accompanying FIG. 21, describes the combining of templates to permit overlapping rectangular bounding boxes as long as the foreground pixels of one template are not shared with, or common with, the foreground pixels of an adjacent template; this is described as requiring the templates to have substantially "disjoint supports."
Training character templates is the process of using training data to create, produce or update the templates for use typically, but not exclusively, in a recognition operation. Training data can be broadly defined as a collection of isolated, individual character image (glyph) samples, each with an assigned character label identifying the character in the character set that it represents, that provide the information necessary to produce templates according to the character template model defining the templates. Existing methods for the estimation (i.e., construction) of binary character templates are relatively straightforward variations of foreground (e.g., black) pixel counting algorithms imposed on collections of isolated glyph samples that are aligned at their bounding boxes. Threshold values are typically used to evaluate the pixel counts to determine whether a pixel in the final template is to be designated a foreground pixel. Binary character templates may also be represented as arrays of probability values, where each pixel location indicates, instead of an ON/OFF value, a probability that reflects the statistical occurrence of an ON or an OFF pixel in the training data for that pixel location. Character templates represented as probability arrays may provide for improved character classification during recognition. For purposes of this background discussion, the term "binary character template" will include both bitmapped character templates and templates of probability values.
The success of training high quality binary character templates using conventional methods ultimately depends on the quality of the glyph samples provided for training. Glyph samples are typically derived from a two-dimensional image of a page that includes text, hereafter also called a text document image. Good quality glyph samples required by conventional training techniques are those (1) that are substantially unimpaired by missing or extraneous foreground pixels when they are input to the training process, and (2) for which all or substantially all foreground pixels have been properly identified for inclusion in the sample prior to training. The first requirement pertains directly to the issue of noise in the input sample, and the second is relevant to the issue of pre-training sample segmentation. These requirements, it will be shown, substantially limit the usefulness and flexibility of existing training processes.
Glyph samples derived from binary images produced from well-known sources such as scanning and faxing processes are subject to being degraded by image noise which contributes to uncertainty in the actual appearance of the bitmap. A degraded bitmap appearance may be caused by an original document of poor quality, by scanning error, by image skewing, or by similar factors affecting the digitized representation of the image. Particular problems in this regard are the tendencies of characters in text to blur or merge, or to break apart. Such a degraded image will be referred to herein as a "noisy" image. The requirement of good quality glyph samples as an input to existing training processes has generally imposed the limitation that the input image used as the source of glyph samples be relatively non-noisy, or, if noisy images are permitted to be used, that there be some process for removing or otherwise compensating for the noise in the samples.
More importantly, existing template construction techniques require that good quality glyph samples be isolated, or segmented, from adjacent glyphs, at least to the extent that, in a text document image in which several adjacent glyph samples occur, a decision has been made with respect to each foreground pixel prior to template training as to which glyph sample the pixel is to be included. Glyph sample segmentation may be accomplished using known methods for automatically finding character bounding boxes. However, such techniques are highly error prone, especially when the text document image containing the samples is noisy. For fonts that permit overlapping bounding boxes, such as those following the sidebearing model, many script fonts, and fonts for languages that have inherently touching symbols, glyph sample segmentation may be impossible to accomplish successfully using automatic methods. For these reasons, when high quality templates are desired, glyph sample segmentation for training purposes is more typically accomplished by requiring the user of the training system to manually isolate each glyph sample from a displayed image of glyph samples.
Recognition systems typically provide distinct training subsystems for the purpose of training the character templates. Training systems may be "supervised" or "unsupervised." A template training system that requires some aspect of the training data to be specially prepared by a user of the system is considered to be a "supervised" training or learning system. Typically, a supervised training system requires samples that have been labeled in advance of training. In contrast, an "unsupervised" training system produces training data automatically, typically as a result of performing a recognition operation on an input text document image, on a series of text lines extracted from an input text document image, or on pre-segmented, isolated character images. In unsupervised training, labels for glyph samples are not known prior to the start of the training process; the source of glyph samples is the image being recognized, and the character labels are assigned to the glyph samples as a result of performing the recognition operation. When the input source of glyph samples is not a collection of pre-segmented, isolated character images, the recognition operation in unsupervised training typically includes a character segmentation process during which the glyph samples themselves are identified. The training data then can be used in a training system without user involvement in its preparation. Unsupervised training is characterized by the fact that the character templates that provide the character labels for the glyph samples in the training data are the same character templates that are being trained. This provides the opportunity to train existing templates using the same image that is used for recognition, toward improving overall recognition accuracy for similar documents in the same font or character set, while eliminating or reducing the direct involvement of a user in the preparation of the training data.
The quality of the training data produced by unsupervised training systems is subject to the same concerns of image noise and segmentation as previously described. In some existing unsupervised training systems, the input image is some type of text document image produced by a well-known source process such as scanning or facsimile transmission, in contrast to an image specially prepared for training purposes in a supervised training system. The quality of the glyph samples identified during unsupervised training is directly dependent on the quality, i.e. the degree of non-noisiness, of the input image source, or, if noisy images are permitted to be used, on the ability of the recognition operation to remove or otherwise compensate for the noise in the glyph samples.
Similarly, with respect to the proper identification of glyph samples, when the character template model of the character templates being trained is the bitmapped, segmentation-based model, the templates are required to each fit within a bounding box. This typically imposes the same requirement on the glyph samples, which in turn may impose a constraint on the type of input image that may be used in the unsupervised training process. If the glyph samples are to be derived from an image of an existing text document, or from an image of a line of text or word in such a document, the glyph samples must occur within the image, line or word either in substantially nonoverlapping bounding boxes, or, if the glyph samples are not so restricted, the recognition operation must provide for a way to assign pixels in the image to a specific glyph sample, so that the samples may be isolated, recognition may be performed and character labels may be assigned to the samples. This requirement of the input image will be hereafter described as requiring that the input glyph samples be "segmentable" during the recognition process, either by determining the bounding box around each glyph, or by some other process that assigns foreground pixels to glyph samples. Requiring segmentable glyph samples generally imposes a limitation on the type of existing text document input image that may be used in an unsupervised training process, since some images may contain glyphs representing characters in fonts or in character sets that do not lend themselves easily to such segmentation. Moreover, even when the samples are segmentable, the effectiveness of the unsupervised training process depends on the ability of the recognition process to correctly segment them, a process that may be adversely affected by factors such as an excessive amount of noise in the input image.
U.S. Pat. No. 5,321,773, issued to G. Kopec and P. A. Chou and entitled "Image Recognition Method Using Finite State Networks" discloses a recognition system that uses binary character templates modeled after the sidebearing model. The recognition system and the template model are also discussed in G. Kopec and P. Chou, "Document Image Decoding Using Markov Source Models," in IEEE Transactions on Pattern Analysis and Machine Intelligence, June, 1994, pp. 602-617 (hereafter, "Kopec and Chou, `Document Image Decoding`".) Training of the character templates used in U.S. Pat. No. 5,321,773 involved both the actual construction of the binary character templates and estimating or computing specific typographic characteristics, or parameters, that are required for proper template positioning; these are known as character sidebearings, set widths and baseline depths, collectively called font metrics. The training process disclosed is illustrative of the problems involved in preparing high quality training data for the training of binary character templates using conventional training techniques.
U.S. Pat. No. 5,321,773 discloses the training of the character templates at col. 11-17, and the training process is further described in G. Kopec, "Least-Squares Font Metric Estimation from Images," in IEEE Transactions on Image Processing, October, 1993, pp. 510-519 (hereafter, "Kopec, `Font Metric Estimation`".) The training technique disclosed is a supervised technique that used a specially prepared input image, shown in FIG. 14 of the patent, and in FIG. 3 of Kopec, "Font Metric Estimation" in which the glyph samples were segmentable. The samples were subjected to a pre-training segmentation step described at pg. 516 in Kopec, "Font Metric Estimation" in which the text lines and individual characters within each line of a font sample page were extracted using simple connected-component based analysis procedures of a text image editor. The text image editor required the input image of samples to be a single column of Roman text laid out in distinct horizontal lines separated by white space. In order to minimize segmentation problems, the space between each glyph sample in the input image was increased when the samples were created. This increased horizontal white space between the glyphs is observable in FIG. 14 of the patent. Each glyph sample isolated by the text image editor was labeled using a prepared text transcription of the sample page that included ordered character labels identifying the samples, paired on a one-for-one basis with the glyph samples in the input image.
The essentially manual supervised training technique disclosed in U.S. Pat. No. 5,321,773 and in Kopec, "Font Metric Estimation" requires that glyph samples be segmentable in the image source in which they occur, while the template model of the templates being trained requires only that pairs of the character images of the templates have substantially disjoint supports. This is because existing template construction techniques using pixel averaging and thresholding techniques are only capable of producing binary character templates from samples that include foreground pixels capable of being isolated within a bounding box that does not overlap with the bounding boxes of adjacent samples. Requiring specially prepared, segmentable glyph samples for training purposes imposes the burden of preparing the training data on the user, and eliminates the possibility of doing unsupervised training of the character templates. In addition, some text document images having character images positioned according to the sidebearing model could not themselves be used as sources of glyph samples for training, since these images might include some adjacent character pairs that would not be segmentable.