1. Field of the Invention
The present invention relates generally to image compression, and more particularly, to a method and apparatus for compressing a corpus of document images using structured tokenized representations that are resolution-dependent.
2. Description of Related Art
Structured document representations provide digital representations for documents that are organized at a higher, more abstract level than merely an array of pixels. Known structured document representation techniques pose a tradeoff between the speed with which a document can be rendered (i.e., converted to a displayable or printable output) and the expressiveness with which it can be represented. One high-level resolution-independent structured document representation is a page description language (PDL), such as PostScript.RTM.. PDLs tend to be high-level structure document representations because they contain expressions which include a great deal of information about document structure. In contrast, purely textual representations of a document that is encoded in ASCII (American Standard code for Information Interchange) has no formatting information. Because of its simplicity, an ASCII encoded document generally requires less time to render than a PDL document with formatting information.
In contrast to resolution-independent document representations, documents represented in a DigiPaper file format are resolution-dependent. The DigiPaper file format is a token-based structured document representation that is both highly expressive and fast to render. The DigiPaper structure document format is described in detail in U.S. patent application Ser. Nos. 08/652,864 and 08/752,497. In the DigiPaper format, pages of a document are represented using a "dictionary" of tokens or symbols that appear in the document. In addition to the dictionary of tokens, each page includes position information specifying where tokens on the page appear. Each token in the dictionary of tokens is a portion of a document image such as a bitmap of a character.
In addition to being resolution-dependent, the DigiPaper file format achieves some degree of lossless data compression. Unlike prior symbol-based token matching which have been used only for lossy image compression, a DigiPaper representation of a document image can be used to achieve lossless compression of original document images produced from structured document representations. The DigiPaper file format achieves high compression ratios because each symbol is stored just once per document in the dictionary of tokens, rather than once for each occurrence in a document. Further compression is achieved by encoding the sequence of positions of tokens in the dictionary using for example Huffman coding.
In general, the DigiPaper file format described in U.S. patent application Ser. Nos. 08/652,864 and 08/752,497 can be used in any environment where quick, high-quality document rendering is required. For example in production printing, the compression achieved using the DigiPaper file format enables documents to be rendered in one location and printed in another location. In addition to being compact, the DigiPaper file format is easy to decode thereby enabling other applications such as prepress viewing, desktop publishing, document management systems, and distributed printing applications, as well as fax communications. This aspect of the DigiPaper file format guarantees document fidelity during prepress viewing, without requiring the development of special prepress viewers.
A large number of documents are represented on the World Wide Web using HTML (HyperText Markup Language). Generally, HTML allows markup of the structure of a document, but not markup of the layout of a document. For example, a block of text can be specified as a "first-level" heading with no font or justification. Consequently, the manner in which an HTML document is rendered depends on a user's particular browser or computer. In contrast, documents represented in the DigiPaper file format can be rendered with fidelity comparable to print media, because of its tokenized file format. In particular, with the emergence of standard programmable viewers (i.e., Java enabled internet browsers), the DigiPaper file format can be used to define self-rendering documents. That is, a Java applet can be used to perform the rendering of a document in a DigiPaper file format independent of the particular internet browser or computer. In addition, documents encoded in the DigiPaper file format can be rendered at speeds of under one second per page for text and graphics. This means fewer unwanted delays for users downloading documents from remote servers on the internet.
Because of the ease with which documents can be accessed using an internet browser such as Netscape's Navigator or Microsoft's Explorer, more and more documents are being stored on the internet and on intranets. These documents may in some instances form a part of a large corpus of heterogenous documents. Users browsing a large corpus of documents on the internet and on intranets have the propensity for browsing or retrieving more than one document from the corpus during a single session. For example, a user searching a corpus of documents tends to examine several documents before identifying one or more of interest to be printed or retrieved. In the event the documents in the corpus are encoded in the DigiPaper file format, it would be desirable to have a compression technique that more efficiently compresses a corpus of documents where each document in the corpus is individually encoded in the DigiPaper file format. More generally, it would be desirable to have a compression technique which maximizes compression for a collection of heterogenous document images.