The present invention relates to the field of document image processing, and more particularly to processing document images that have been symbolically compressed.
Storage and transmission of electronic document images have become increasingly prevalent, spurring deployment and standardization of new and more efficient document compression techniques. Symbolic compression of document images, for example, is becoming increasingly common with the emergence of the JBIG2 standard and related commercial products. Symbolic compression techniques improve compression efficiency by 50% to 100% in comparison to the commonly used Group 4 compression standard (CCITT Specification T.6). A lossy version of symbolic compression can achieve 4 to 10 times better compression efficiency than Group 4.
In symbolic compression, document images are coded with respect to a library of pattern templates. Templates in the library are typically derived by grouping (clustering) together connected components (e.g., alphabetic characters) in the document that have similar shapes. One template is chosen or generated to represent each cluster of similarly shaped connected components. The connected components in the image are then represented by a sequence of template identifiers and their spatial offsets from the preceding component. In this way, an approximation of the original document is obtained without duplicating storage for similarly shaped connected components. Minor differences between individual components and their representative templates, as well as all other components which are not encoded in this manner, are optionally coded as residuals.
Many document management activities, such as document classification, duplicate detection and language identification, are based on the semantic content of document images. Consequently, in traditional document management systems, compressed document images are first decompressed then subjected to optical character recognition (OCR) to recover the semantic information needed for classification, language identification and duplicate detection. In the context of a database of symbolically compressed document images, the need to decompress and perform OCR consumes considerable processing resources. Also, because OCR engines are usually limited in the number and variety of typefaces they recognize, recovery of semantic information through conventional OCR techniques may not be possible for some symbolically compressed documents.
A method and apparatus for extracting information from symbolically compressed document images are disclosed. An input document image is represented by a sequence of template identifiers to reduce storage consumed by the input document image. The template identifiers are replaced with alphabet characters according to language statistics to generate a text string representative of text in the input document image. In one embodiment, the template identifiers are replaced with alphabet characters according to a hidden Markov model. Also, a conditional n-gram technique may be used to obtain indexing terms for document matching and other applications.
These and other features and advantages of the invention will be apparent from the accompanying drawings and from the detailed description that follows below.