The present invention relates generally to the field of computer-implemented methods of, and systems for, text image modeling, recognition and layout analysis, and more particularly to a method and system for training layout parameters specified in a two-dimensional (2D) image grammar that models text images. The image grammar is used in various document processing operations, including document image recognition and layout analysis operations.
1. Document Image Layout Analysis and Image Layout Models.
Document image layout analysis is a type of document image recognition operation implemented in a processor-controlled machine that automatically makes determinations about the geometric, spatial and functional relationships of the physical and logical structures of a text (or document) image. These physical and logical structures are referred to herein as xe2x80x9cimage constituents.xe2x80x9d An image constituent as used herein is a portion of an image that is perceived by an observer to form a coherent unit in the image. Image constituents are typically represented in an image-based data structure as collections of pixels, but are generally described in terms of the perceived unit, rather than in terms of the pixels themselves. Examples of image constituents include the conventional text units of individual character or symbol images (referred to as xe2x80x9cglyphsxe2x80x9d), words, and text lines. Image constituents can contain other image constituents and so may also include groupings of these conventional text units into the logical, or semantic, notions of paragraphs, columns, sections, titles, footnotes, citations, headers, page numbers, and any one of a number of other logical structures to which the observer of a document may assign meaning. A glyph is typically considered to be the smallest image constituent; a group of image constituents is often called a xe2x80x9cblock,xe2x80x9d and includes, by way of example, a word, a text line, or a group of text lines. The layout analysis process typically produces the image locations of blocks with functional labels assigned to them according to their physical features, their physical image location, or their logical meaning, as imposed by the functional requirements of a particular type of document. The location of an image constituent is typically expressed in terms of image coordinates that define a minimally-sized bounding box that includes the image constituent.
To enhance the ability to perform functional layout analysis some document image layout systems use a priori information about the physical structure of a specific class of documents in order to accurately and efficiently identify constituents in documents that include specific types of higher-level image constituents. This a priori information is commonly referred to as a document xe2x80x9cclass layout specification,xe2x80x9d or may be referred to as a document image model. A document class layout specification describes or models the structure of the class of documents to be analyzed and supplies information about the types, locations and other geometrical attributes of the constituents of a given class of document images.
A class layout specification may be supplied in one of two ways: (1) as an explicit data structure input to the layout analysis system, which typically allows for different types of documents to be processed according to the structural information provided by the data structure input; or (2) in the form of document description information that is implicitly built into the processing functionality of the system, on the assumption that all documents to be processed by the system are restricted to having the same structural layout specification. A class layout specification in effect xe2x80x9ctunesxe2x80x9d the layout analysis system to particular document structures and restricts the type of document image for which layout analysis is to be performed.
Examples of document image layout systems that make use of an explicit class layout specification are disclosed in U.S. Pat. No. 5,574,802, entitled, xe2x80x9cMethod and Apparatus for Document Element Classification by Analysis of Major White Region Geometryxe2x80x9d; in G. Story, et al, in xe2x80x9cThe RightPages image-based electronic library for alerting and browsingxe2x80x9d, IEEE Computer, September 1992, pp. 17-26 (hereafter, xe2x80x9cthe Story referencexe2x80x9d); in G. Nagy, et al in xe2x80x9cA prototype document image analysis system for technical journalsxe2x80x9d, IEEE Computer, July, 1992, pp. 10-22 (hereafter xe2x80x9cthe Nagy referencexe2x80x9d); in A. Dengel, xe2x80x9cANASTASIL: a system for low-level and high-level geometric analysis of printed documentsxe2x80x9d, in H. Baird, H. Bunke and K. Yamamoto, Structured Document Image Analysis, Berlin: Springer-Verlag, 1992; in J. Higashino, H. Fujisawa, Y. Nakano, and M. Ejiri, xe2x80x9cA knowledge-based segmentation method for document understandingxe2x80x9d, Proceedings of the 8th International Conference on Pattern Recognition (ICPR), Paris, France, 1986, pp. 745-748; and in L. Spitz, xe2x80x9cStyle directed document recognitionxe2x80x9d, First Intl. Conf. on Doc. Analysis and Recognition (ICDAR), Saint Malo, France, September 1991, pp. 611-619.
U.S. Pat. No. 5,574,802 discloses a system for logically identifying document elements in a document image using structural models; the system includes a geometric relationship comparator for comparing geometric relationships in a document to the geometric relationships in a structural model to determine which one of a set of structural models of document images matches a given input document image. A logical tag assigning system then assigns logical tags to the document elements in the image based on the matching structural model. If the document elements are treated as nodes and the spatial relationships between the document elements are treated as links between the nodes, the document elements and relationships of a structural model form a graph data structure. Structural models are preferably predefined and prestored, and may be created by an end user, using a specific structural model definition support system, based on observation of model documents which best represent the type of document to be represented by a particular structural model. U.S. Pat. No. 5,574,802 discloses further that during creation of the structural model, the end-user may be prompted to designate rectangles for the document elements contained in sample document images, and the structural model definition support system then measures the distances between the designated rectangles for each of the major geometric relationships (i.e., either an xe2x80x9cabove-belowxe2x80x9d or xe2x80x9cright-leftxe2x80x9d relationship) and stores these measurements.
The Story reference discloses the use of explicitly-defined xe2x80x9cpartial order grammarsxe2x80x9d (xe2x80x9cpogsxe2x80x9d) to guide labeling of rectangular blocks that are extracted from journal table of contents page images. During pogs parsing of a page image in the RightPages system, each rectangular block identified and extracted is considered a terminal symbol and two relationships between blocks are defined: a left-right relationship and an above-below relationship. The grammar groups the rectangles into the image constituents.
The Nagy reference discloses a document image analysis system, called the xe2x80x9cGobbledocxe2x80x9d system, that uses an explicit document class layout specification in the form of a collection of publication-specific document grammars that are formal descriptions of all legal page formats that articles in a given technical journal can assume. The document grammar guides a segmentation and labeling process that subdivides the page image into a collection of nested rectangular blocks. Applying the entire document grammar to an input page image results in a subdivision of the page image into nested rectangular blocks. The subdivision is represented in a data structure called the X-Y tree. The rectangular regions are labeled with logical categories such as abstract, title-block, byline-block, reference-entry and figure-caption.
Many existing systems rely on a two-part process to perform document image layout analysis. A first phase that performs feature analysis or extraction, commonly referred to as page segmentation, finds the physical dimensions of blocks on the page, and the second phase applies the class layout specification to these blocks. When a class layout specification is defined as a document grammar in any of these above-referenced examples of layout analysis systems, the grammar is typically used in this second phase, in the form of some type of parsing operation, to identify the logical structure of the physical blocks identified in a first phase of processing.
2. Grammars Used as Image Layout Models.
An example of a document image layout system that makes use of an image grammar as an explicit class layout specification is disclosed in commonly-assigned application Ser. No. 08/491,420, entitled, xe2x80x9cDocument Image Layout Analysis Using an Image Grammar and a Transcription,xe2x80x9d (hereafter the ""420 application). The ""420 application discloses a document image layout analysis method and system for identifying and labeling text image constituents in an input two-dimensional (2D) text image using a formal image model and a page layout transcription as explicit inputs to the layout analysis process. The formal image model models the spatial image structure of a class of document images as an image grammar, while the page layout transcription identifies the specific image constituents that occur in the input text image and constrains the document layout analysis process to look for and identify those specific image constituents, thereby enhancing accuracy of the layout specification output.
The ""420 application applies the use an image grammar as a formal model of image structure to the domain of document image layout analysis. The use of various forms of grammars in text recognition is well-known and is discussed, for example in the Background portion of application Ser. No. 08/491,420, which is hereby incorporated by reference herein. Examples of the use of grammars in text operations such as recognition and layout analysis are disclosed in A. Conway, xe2x80x9cPage grammars and page parsing: A syntactic approach to document layout recognition,xe2x80x9d in Second Intl. Conf on Doc. Analysis and Recognition (ICDAR), Japan, October 1992, pp. 761-764; S. Kuo and O. E. Agazzi, in xe2x80x9cKeyword spotting in poorly printed documents using pseudo 2D hidden Markov models,xe2x80x9d IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 16, No. 8, August, 1994, pp. 842-848 (hereafter, xe2x80x9cthe Kuo and Agazzi keyword spotting referencexe2x80x9d); in U.S. Pat. No. 5,020,112, entitled xe2x80x9cImage Recognition Using Two-Dimensional Stochastic Grammars,xe2x80x9d issued to P. A. Chou, one of the inventors herein; in U.S. Pat. No. 5,321,773, issued to G. E. Kopec and P. A. Chou, also inventors herein, and entitled xe2x80x9cImage Recognition Method Using Finite State Networks;xe2x80x9d and in G. E. Kopec and P. A. Chou, xe2x80x9cDocument image decoding using Markov source models,xe2x80x9d IEEE Trans. Pattern Analysis and Machine Intelligence 16(6):602-617. June 1994 (hereafter, xe2x80x9cKopec and Chou, xe2x80x98Document Image Decodingxe2x80x99xe2x80x9d).
The Conway reference (xe2x80x9cPage grammars and page parsing . . . xe2x80x9d, 1992); discloses a syntactic approach to deducing the logical structure of printed documents from their physical layout. Page layout is described by a 2-dimensional grammar, similar to a context-free string grammar, and a chart parser is used to parse segmented page images according to the grammar. The layout conventions of a class of documents are described to the system by a page; layout grammar similar to a context-free string grammar. String grammars describe structure in terms of concatenation of sub-strings, and the sub-strings here consist of nodes linked together into a graph or tree structure; the grammar rules specify how these substructures are embedded in larger structures. The nodes in the parse structure are page blocks which are groups of neighboring segments. .The layout relationships between page blocks are expressed by a set of page relations that are defined on the bounding rectangles of the blocks and that define a set of neighbors for each block on a page. Constraints can also be attached to grammar rules, to allow use of information such as font size and style, alignment and indentation.
U.S. Pat. No. 5,020,112 discloses a method of identifying bitmapped image objects using a two-dimensional (2D) image model based on a 2D stochastic, context-free grammar. This recognizer is also discussed in an article by P. A. Chou, entitled xe2x80x9cRecognition of Equations Using a Two-Dimensional Stochastic Context-Free Grammar,xe2x80x9d in Visual Communications and Image Processing IV, SPIE, Vol. 1199, 1989, pp. 852-863. The 2D image model is represented as a stochastic 2D grammar having production rules that define spatial relationships based on nonoverlapping rectangles between objects in the image; the grammar is used to parse the list of objects to determine the one of the possible parse trees that has the largest probability of occurrence. The objects are stored in an object template library and are each defined as an n by m bitmapped template of an image object having an associated probability of occurrence in the image to be recognized. The term xe2x80x9cstochasticxe2x80x9d when used in this context refers to the use of probabilities associated with the possible parsing of a statement to deal with real world situations characterized by noise, distortion and uncertainty.
U.S. Pat. No. 5,321,773 discloses a 2D image model represented as a stochastic finite state transition network that defines image production in terms of a regular grammar. The 2D image grammar disclosed therein models a class of document images using a bitmapped character template model based on the sidebearing model of letterform shape description and positioning that is used in digital typography. In the sidebearing character model, pairs of adjacent character images are positioned with respect to their image origin positions to permit overlapping rectangular bounding boxes as long as the foreground (e.g., black) pixels of one character are not shared with, or common with, the foreground pixels of the adjacent character.
A general image production model, of which the image models disclosed in U.S. Pat. Nos. 5,020,112 and 5,321,773 are special cases, is discussed in P. A. Chou and G. E. Kopec, xe2x80x9cA stochastic attribute grammar model of document production and its use in document image decoding,xe2x80x9d in Document Recognition II, Luc M. Vincent, Henry S. Baird, Eds, Proc. SPIE 2422, 1995, pp. 66-73 (hereafter, xe2x80x9cChou and Kopec, xe2x80x98A stochastic attribute grammar modelxe2x80x99xe2x80x9d), which is hereby incorporated by reference herein for all that it teaches as if set out in full.
3. The Training of Image Models.
Image models may include components and make use of various types of parameters that require or permit training in order to improve the operational performance of the model. In certain types of stochastic image models where, for example, a Hidden Markov Model or variation thereof is used to represent a character or a word prototype to be used for recognition, training the character or word model involves training probability parameters that represent the probabilities of features or sequences of letters occurring in a character or word. An example of this type of model training may be found in the Kuo and Agazzi Keyword Spotting reference cited above, which discloses, at pg. 844, that parameters for keyword models are trained from a training set that contains, for each model, the same keyword at different levels of degradation, represented as a feature vector called an observation sequence; the features are extracted from segments of word image samples. Feature-based template training of a stochastic character model is also disclosed in C. Bose and S. Kuo, xe2x80x9cConnected and degraded text recognition using hidden Markov model,xe2x80x9d in Proceedings of the International Conference on Pattern Recognition, Netherlands, September 1992, pp. 116-119. The image models being trained are character or line (word) models, and not 2D models that represent a multiple-line 2D (e.g., page) image.
Training of a 2D image model is disclosed in U.S. Pat. No. 5,020,112, which is discussed briefly above and which describes a stochastic context-free grammar as the 2D image model therein. U.S. Pat. No. 5,020,112 discloses the unsupervised training of the probability parameters associated with the object templates that are the terminal symbols of that grammar model.
Image models of the type disclosed in U.S. Pat. No. 5,321,773 that make use of character template models may benefit from training the character templates in a particular font that occurs in a class of documents to be recognized, in effect training the image model to be used as a font-specific recognizer. An example of this type of training is disclosed in commonly-assigned application Ser. No. 08/431,223 xe2x80x9cAutomatic Training of Character Templates Using a Transcription and a Two-Dimensional Image Source Model,xe2x80x9d and U.S. Pat. No. 5,594,809, entitled xe2x80x9cAutomatic Training of Character Templates Using a Text Line Image Source, a Text Line Transcription and an Image Source Model.xe2x80x9d In one implementation of the invention disclosed in these references, the training of the character templates includes the training of the template""s character set width. Character set width is defined as the distance from the origin of a character template to the origin of the next adjacent character template positioned in the image.
Many types of document processing tasks incorporate layout analysis operations as part of their functionality. Text recognition, for example, uses layout information to locate the image constituents (typically glyphs) being recognized. With a sufficiently precise explicit or implicit image model, an image layout analysis function is able to accurately locate and label many types of image constituents on a document page image. However, the accuracy and completeness of the layout analysis result is typically dependent on the characteristics of the model and on the quality of the page image. For many of the image models described above, it is necessary for a user to manually specify the precise spatial relationships among image constituents for each potential type of document for which an operation that uses layout analysis is to be performed, which can be a tedious and time-consuming process. Some models, such as the those described in the Story and Nagy references may not even provide the ability to describe many image structures in sufficient detail to produce a desired level of precision in the layout analysis. Yet the ability to produce a precise and comprehensive description of the spatial relationship of image constituents has many advantages in the recognition of text images and in the generation of documents that have a specific spatial image structure.
For example, a desirable feature of a commercial optical character recognition (OCR) application is to provide the recognized text of a scanned input text document in the form of a data structure suitable for use by a word processing (WP) application; the WP data structure enables, the WP application to render the recognized text in a document image that substantially resembles the original scanned document in layout format and appearance and permits the recognized text to be edited. In order to place the recognized text in the WP-compatible data structure, the OCR application must perform document layout analysis at a sufficient level to make decisions about the organizational structure of the text and where text is located on the page. Conventional OCR applications that provide this type of functionality typically perform very rudimentary document layout analysis and image processing operations on the original scanned document image to identify only the most basic document structures. Moreover, these document structures are likely to be identified without regard to their relationship to other document structures, which could present editing difficulties when a user attempts to edit the document using the WP-compatible data structure.
In another example, an additional desirable feature of an OCR application is to be able to recognize text in documents with complex layout and formatting. Recognition of mathematical equations and text formatted in complex tabular structures present particularly difficult challenges because of the use of characters in different type fonts and sizes that are positioned in the text image above or below a normal baseline.
Existing image models are unable to describe the spatial structure of a document image and its image constituents with both sufficient precision and flexibility to accurately accommodate a wide variety of document layouts without the necessity of manual layout specification.
The present invention makes use of a 2D image grammar that models the geometric spatial relationships among image constituents that occur in a class of document images as explicit parameters, referred to as layout parameters, in the grammar. Depending upon the details of the particular implementation of the 2D image grammar, the explicit parameters in the mathematical model either directly represent, or form the basis for deriving, one or more text image layout parameters that represent an actual geometric spatial relationship between, or an actual physical measurement of, image constituents in an image in the class of images being modeled. Functionally, the 2D image grammar must be capable of representing at least the layout structure (e.g., the physical locations) of the image constituents of a class of document images, and must specify the layout structure in a manner that makes a measurable spatial relationship between two image constituents explicit in the model.
Thus, the present invention is further premised on the discovery that the ability to capture and represent the spatial relationship between two image constituent in terms of an explicit parameter in the 2D image grammar means that the image grammar is capable of automatically learning the value of the parameter from examples of actual physical layout structures of text document images in the class of images modeled by the grammar. That is, the user of the 2D image model need not manually specify the actual spatial relationship between two image constituents in a particular document before the 2D image model may be used to operate on that document. Instead, by specifying a spatial relationship between image constituents in the model by way of one or more parameterized relationships, one or more of the parameters in these relationships may be given an arbitrary or estimated initial value and then automatically trained (without manual intervention) to a specific value that accurately represents the actual spatial relationship between the image constituents that the parameter represents.
Training involves, for each 2D text image in a set of input training images, producing a data structure representation of the input training image that indicates the layout structure of the training image; in this image representation, image constituent labels, acquired from a transcription associated with the input training image, are aligned with their respective image constituents in the training image using the layout structure specified by the 2D image grammar, and physical locations of labeled image constituents in the training image are then measured or determined. After all input training images have been aligned and estimated locations determined, actual values of the parameters are computed from all of the estimated physical location data, and the 2D image model is updated to reflect these computed parameter values. This sequence of alignment, computing parameters and updating the model may be repeated until a stopping condition is met. The resulting trained 2D image model includes parameters that accurately specify (within acceptable tolerances) the layout structure of the class of modeled images.
Training may be accomplished in either an unsupervised or a supervised mode. Supervised training uses as an explicit input to the training process a transcription of each input training image. However, supervised training requires no manual involvement by a user, since such a transcription may be obtained from a conventional character recognition operation, and the training operation itself aligns the elements of the transcription with the image constituents in the input training image associated with the transcription, according to rules specified by the image grammar. Unsupervised training simply makes use of a recognition operation to produce a transcription of a training image before proceeding with training the layout parameters in the model.
The implications of this discovery are significant for enhancing the efficiency and productivity of document processing operations. For text recognition operations that make use of an explicit 2D image model as input to the recognition process, a 2D image grammar that describes the logical structure of a class of text documents that have complex formatting features can be trained automatically to learn the physical layout structure of the class. When this trained 2D image model is then used in a recognition operation on documents in the class, recognition accuracy improves significantly over that of commercial text recognition systems that make use of less sophisticated layout analysis functionality.
The text layout parameter information that is learned about a class of documents as a result of training according to the present invention may also be useful independently of its use as part of the 2D image model for other document processing operations on that class of documents. For example, the text image layout parameters that are trained may include information about the font metrics of the type font used in the class of documents that is sufficient to produce a font program to recreate any document in that same font. In addition, the text image layout parameters that are trained may include information sufficient to produce, after text recognition, an editable data structure representation of a document in the modeled class that is readable by a word processing program.
Moreover, defining a model of the class of document images being trained as an explicit input to the training process allows for flexibility in defining the class of documents for which training is needed. By defining a new 2D image model for each class of documents, the same training procedure can be used without modification to train text image layout parameters for, for example, business letters on a particular corporate letterhead and journal papers published in a journal. In addition, the invention allows for either the automatic training of all parameters identified in the 2D image model, or for selecting and specifying for training one or more individual parameters in the 2D image model.
One implementation for making layout parameters explicit in the 2D image grammar is for a grammar production rule to specify the spatial relationship between first and second image constituents as a parameterized mathematical function, with the coefficients of the function representing the layout parameters. A value for the function is measured for each occurrence in a training image of an image constituent produced according to the production rule. The measured values for the function for a given rule, as measured from all of the training data, represent the observed values from which a value for a layout parameter can be computed. In effect, finding the value of the layout parameter from the measured values is an optimization problem where the measured values represent values for some overall function of the layout parameter for which an optimal value can be computed.
An illustrated implementation of the invention makes use of a stochastic context-free attribute grammar in which synthesized and inherited attributes and synthesis and inheritance functions are associated with production rules in the grammar. The attributes express physical spatial locations of image constituents in the image, and the parameterized functions express physical relationships among image constituents in the image. An annotated parse tree produced by the grammar explicitly represents both the logical and layout structures of a text image in the class of images being modeled. During training, an annotated parse tree of an input training image is produced, and the measured values of the parameterized functions are taken from data in the parse tree. Implementation examples of the use of an attribute grammar as a 2D image model described herein illustrate the training of text image layout parameters for classes of documents that include journal papers and equations.
Therefore, in accordance with one aspect of the present invention, there is provided a method for operating a processor-controlled machine to determine an unknown value of a text image layout parameter used with a two-dimensional (2D) image model. The machine operated by the invention includes a signal source for receiving data; memory for storing data; and a processor connected for accessing instruction data which is stored in the memory for operating the machine. The processor is further connected for receiving data from the signal source; and connected for storing data in the memory. The method comprises operations performed by the processor in carrying out instructions to operate the machine according to the invention. The processor, in executing the instructions, obtains a data structure indicating a 2D image model modeling as an image grammar an image layout structure common to a class of 2D text images. The 2D image model includes a production rule that indicates that first and second image constituents occurring in the 2D text image produce a third image constituent occurring therein. The production rule includes a text image layout parameter that indicates the spatial relationship between the first and second image constituents. At the start of the training operation, a value of the text image layout parameter is unknown and computing a value is the object of the training operation. The processor then receives a plurality of input two-dimensional (2D) text image data structures from the signal source. Each input 2D text image has the image layout structure common to the class of 2D text images and includes at least one occurrence of first and second image constituents. For each respective input 2D text image, the processor produces a data structure, using the 2D image model, indicating first and second image positions in the input 2D text image identifying respective locations of the first and second image constituents therein. The processor then obtains document-specific measurement data from the data structure. The document-specific measurement data indicates the spatial relationship between the first and second image constituents identified therein. When all of the input 2D text images have been processed, the processor computes a value for the text image layout parameter using the document-specific measurement data obtained from the data structures for the respective input 2D text images. The value computed for the text image layout parameter represents a class-specific value for all text images in the class of 2D input text images being modeled by the 2D image model.
In another aspect of the present invention, the production rule specifies the spatial relationship between the first and second image constituents as a mathematical function of a characteristic of at least one of the first and second image constituents. The mathematical function includes the text image layout parameter as a parameter therein. The document-specific measurement data obtained from the data structure include values for the mathematical function measured from the data structure. The values of the function indicating the spatial relationship between the first and second image constituents in each respective input 2D text image. Then, computing a value for the text image layout parameter includes using the values for the mathematical function measured from each respective input training image.
In still another aspect of the present invention, the 2D image model is represented as a stochastic context-free attribute grammar.