The present invention relates to structured document representations and, more particularly, relates to structured document representations suitable for rendering into printable or displayable document raster images, such as bit-mapped binary images or other binary pixel or raster images. The invention further relates to data compression techniques suitable for document image rendering and transmission.
Structured Document Representations
Structured document representations provide digital representations for documents that are organized at a higher, more abstract level than merely an array of pixels. As a simple example, if this page of text is represented in the memory of a computer or in a persistent storage medium such as a hard disk, CD-ROM, or the like as a bitmap, that is, as an array of 1s and 0s indicating black and white pixels, such a representation is considered to be an unstructured representation of the page. In contrast, if the page of text is represented by an ordered set of numeric codes, each code representing one character of text, such a representation is considered to have a modest degree of structure. If the page of text is represented by a set of expressions expressed in a page description language, so as to include information about the appropriate font for the text characters, the positions of the characters on the page, the sizes of the page margins, and so forth, such a representation is a structured representation with a great deal of structure.
Known structured document representation techniques pose a tradeoff between the speed with which a document can be rendered and the expressiveness or subtlety with which it can be represented. This is shown schematically in FIG. 1 (PRIOR ART). As one looks from left to right along the continuum 1 illustrated FIG. 1, the expressiveness of the representations increases, but the rendering speed decreases. Thus, ASCII (American Standard Code for Information Interchange), a purely textual representation without formatting information, renders quickly but lacks formatting information or other information about document structure, and is shown to the left of FIG. 1. Page description languages (PDLs), such as PostScript(copyright) (Adobe Systems, Inc., Mountain View, Calif.; Internet: http://www.adobe.com) and Interpress (Xerox Corporation, Stamford, Conn.; Internet: http://www.xerox.com), include a great deal of information about document structure, but require significantly more time to render than purely textual representations, and are shown to the right of continuum 1.
Continuum 1 can be seen as one of document representations having increasing degrees of document structure:
At the left end of continuum 1 are purely textual representations, such as ASCII. These convey only the characters of a textual document, with no information as to font, layout, or other page description information, much less any graphical, pictorial (e.g., photographic) or other information beyond text.
Also near the left end of continuum 1 is HTML (HyperText Markup Language), which is used to represent documents for the Internet""s World Wide Web. HTML provides somewhat more flexibility than ASCII, in that it supports embedded graphics, images, audio and video recordings, and hypertext linking capabilities. However, HTML, too, lacks font and layout (i.e., actual document appearance) information. That is, an HTML document can be rendered (converted to a displayable or printable output) in different yet equally xe2x80x9ccorrectxe2x80x9d ways by different Web client (xe2x80x9cbrowserxe2x80x9d) programs or different computers, or even by the same Web client program running on the same computer at different times. For example, in many Web client programs, the line width of the rendered HTML document varies with the dimensions of the display window that the user has selected. Increase the window size, and line width increases accordingly. The HTML document does not, and cannot, specify the line width. HTML, then, does allow markup of the structure of the document, but not markup of the layout of the document. One can specify, for example, that a block of text is to be a first-level heading, but one cannot specify exactly the font, justification, or other attributes with which that first-level heading will be rendered. (Information on HTML is available on the Internet from the World Wide Web Consortium at http://www.w3.org/pub/WWW/MarkUp/.)
At the right end of continuum 1 are page description languages, such as PostScript and Interpress. These PDLs are full-featured programming languages that permit arbitrarily complex constructs for page layout, graphics, and other document attributes to be expressed in symbolic form.
In the middle of continuum 1 are printer control languages, such as PCL5 (Hewlett-Packard, Palo Alto, Calif.; Internet: http://www.hp.com/), which includes primitives for curve and character drawing.
Also in the middle of continuum 1, but somewhat closer to the PDLs, are cross-platform document exchange formats. These include Portable Document Format (Adobe Systems, Inc.) and Common Ground (Common Ground Software, Belmont, Calif.; Internet: http://www.commonground.com/). Portable Document Format, or PDF, can be used in conjunction with a software program called Adobe Acrobat(trademark). PDF includes a rich set of drawing and rendering operations invocable by any given primitive (available primitives include xe2x80x9cdraw,xe2x80x9d xe2x80x9cfill,xe2x80x9d xe2x80x9cclip,xe2x80x9d xe2x80x9ctext,xe2x80x9d etc.), but does not include programming language constructs that would, for example, allow the specification of compositions of primitives.
Known structured document representation techniques assume that the rendering engine (e.g., display driver software, printer PDL decomposition software, or other software or hardware for generating a pixel image from the structured document representation) have access to a set of character fonts. Thus a document represented in a PDL can, for example, have text that is to be printed in 12-point Times New Roman font with 18-point Arial Bold headers and footnotes in 10-point Courier. The rendering engine is presumed to have the requisite fonts already stored and available for use. That is, the document itself typically does not supply the font information. Therefore, if the rendering engine is called upon to render a document for which it does not have the necessary font or fonts available, the rendering engine will be unable to produce an authentic rendering of the document. For example, the rendering engine may substitute alternate fonts in lieu of those specified in the structured document representation, or, worse yet, may fail to render anything at all for those passages of the document for which fonts are unavailable.
The fundamental importance of fonts to PDLs is illustrated, for example, by the extensive discussion of fonts in the Adobe Systems, Inc. PostScript Language Reference Manual (2d ed. 1990) (hereinafter PostScript Manual). At page 266, the PostScript Manual says that a required entry in all base fonts, encoding, is an xe2x80x9c[a]rray of names that maps character codes (integers) to character namesxe2x80x94the values in the array.xe2x80x9d Later, in Appendix E (pages 591-606), the PostScript Manual gives several examples of fonts and encoding vectors.
A notion basic to a font is that of labeling, or the semantic significance given to a particular character or symbol. Each character or symbol of a font has an unique associated semantic label. Labeling makes font substitution possible: Characters from different fonts having the same semantic label can be substituted for one another. For example, each of the characters 21, 22, 23, 24, 25, 26 in FIG. 2 (PRIOR ART) has the same semantic significance: Each represents the upper-case form of xe2x80x9cE,xe2x80x9d the fifth letter of the alphabet commonly used in English. However, each appears in a different font. It is apparent from the example of FIG. 2 that font substitution, even if performed for only a single character, can dramatically alter the appearance of the rendered image of a document.
A known printer that accepts as input a PDL document description is shown schematically in FIG. 3 (PRIOR ART). Printer 30 accepts a PDL description 35 that is interpreted, or decomposed, by a rendering unit 31 to produce raster images 32 of pages of the document. Raster images 32 are then given to an image output terminal (IOT) 33, which converts the images 32 to visible marks on paper sheets that are output as printed output 36 for use by a human user. Unfortunately, the speed at which the rendering unit 31 can decompose the input PDL description cannot, in general, match the speed at which the IOT 33 can mark sheets of paper and dispense them as output 36. This is in part because the result of decomposing the PDL description is indeterminate. As noted above, a PDL description such as PDL description 35 does not correspond to a particular image or set of images, but is susceptible of differing interpretations and can be rendered in different ways. Thus rendering unit 31 becomes a bottleneck that limits the overall throughput of printer 30.
Accordingly, a better structured document representation technology is needed. In particular, what is needed is a way to eliminate the tradeoff between expressiveness and rendering speed and, moreover, a way to escape the tyranny of font dependence. The structured document representation should also be easily searchable for content.
Data Compression for Document Images
Data compression techniques convert large data sets, such as arrays of data for pixel images of documents, into more compact representations from which the original large data sets can be either perfectly or imperfectly recovered. When the recovery is perfect, the compression technique is called lossless; when the recovery is imperfect, the compression technique is called lossy. That is, lossless compression means that no information about the original document image is irretrievably lost in the compression/decompression cycle. With lossy compression, information is irretrievably lost during compression.
Preferably, a data compression technique affords fast, inexpensive decompression and provides faithful rendering together with a high compression ratio, so that compressed data can be stored in a small amount of memory or storage and can be transmitted in a reasonable amount of time even when transmission bandwidth is limited.
Lossless compression techniques are often to be preferred when compressing digital images that originate as structured document representations produced by computer programs. Examples include the printed or displayed outputs of word processing programs, page layout programs, drawing and painting programs, slide presentation programs, spreadsheet programs, Web client programs, and any number of other kinds of commonly used computer software programs. Such outputs can be, for example, document images rendered from PDL (e.g., PostScript) or document exchange format (e.g., PDF or Common Ground) representations. In short, these outputs are images that are generated in the first instance from symbolic representations, rather than originating as optically scanned versions of physical documents.
Lossy compression techniques can be appropriate for images that do originate as optically scanned versions of physical documents. Such images are inherently imperfect reproductions of the original documents they represent. This is because of the limitations of the scanning process (e.g., noise, finite resolution, misalignment, skew, distortion, etc.). Inasmuch as the images themselves are of limited fidelity to the original, an additional loss of fidelity through a lossy compression scheme can be acceptable in many circumstances.
Known encoding techniques that are suitable for lossless image compression include, for example, CCITT Group-4 encoding, which is widely used for facsimile (fax) transmissions, and JBIG encoding, a binary image compression standard promulgated jointly by the CCITT and the ISO. (CCITT is a French acronym for Comitxc3xa9 Consultatif International de Txc3xa9lxc3xa9graphique et Txc3xa9lxc3xa9phonique. ISO is the International Standards Organization. JBIG stands for Joint Bilevel Image Experts Group.) Known encoding techniques that are suitable for lossy image compression include, for example, JPEG (Joint Photographic Experts Group) encoding, which is widely used for compressing gray-scale and color photographic images, and symbol-based compression techniques, such as that disclosed in U.S. Pat. No. 5,303,313, xe2x80x9cMETHOD AND APPARATUS FOR COMPRESSION OF IMAGESxe2x80x9d (issued to Mark et al. and originally assigned to Cartesian Products, Inc.(Swampscott, Mass.)), which can be used for images of documents containing text characters and other symbols.
As compared with lossy techniques, lossless compression techniques of course provide greater fidelity, but also have certain disadvantages. In particular, they provide lower compression ratios, slower decompression speed, and other performance characteristics that can be inadequate for certain applications, as for example when the amount of uncompressed data is great and the transmission bandwidth from the server or other data source to the end user is low. It would be desirable to have a compression technique with the speed and compression ratio advantages of lossy compression, yet with the fidelity and authenticity that is afforded only by lossless compression.
The present invention provides a structured document representation that is at once highly expressive and fast and inexpensive to render. According to the invention, symbol-based token matching, a compression scheme that has hitherto been used only for lossy image compression, is used to achieve lossless compression of original document images produced from PDL representations or other structured document representations. A document containing text and graphics is compiled from its original structured representation into a token-based representation (which is itself a structured document representation), and the token-based representation, in turn, is used to produce a rendered pixel image. The token-based representation can achieve high compression ratios, and can be quickly and faithfully rendered. The token-based representation includes a semantic label set which allows for quick and efficient searches by content.
In one aspect of the invention, a processor is provided with a first set of digital information including a first structured representation of a document. A plurality of image collections (such as page images) are obtainable from the first representation. Each such obtainable image collection includes at least one image. Each image in each such collection is an image of at least a portion of the document. With a processor, from the first set of digital information a second set of digital information is produced. The second structured representation is a lossless representation of an image collection that is one of the plurality of image collections obtainable from the first structured representation. The second structured representation includes a plurality of tokens and a plurality of positions. At least one token in the plurality of tokens has an associated semantic label. The second set of digital information is produced by extracting the plurality of tokens from the first structured representation, each token comprising a set of pixel data representing a subimage of the image collection, and determining from the plurality of positions from the first structured representation, each position being a position of a token subimage in the particular image collection. At least one token subimage having a plurality of pixels and occurs at more than one position in the image collection. The second set of digital information thus produced are then made available for further use.
According to another aspect of the present invention, the first structured representation includes a page description language representation, a document exchange format representation, a print control language representation, or a mark-up language representation.
According to another aspect of the present invention, the associated semantic label includes a numeric code representing a character. The numeric code may be an ASCII code. The semantic label may also be stored in a residual block of the second structured representation of the document.
According to still another aspect of the present invention, the providing step further comprises providing a font specific optical character recognizer software program for obtaining the associated semantic label.
According to another aspect of the present invention, the method further comprises the step of searching the second structured representation of the document using the associated semantic label.
According to still another aspect of the present invention, an article of manufacture comprising an information storage medium wherein is stored information comprising a computer program for facilitating production by a processor of a second set of digital information from a first set of digital information. The first set of digital information comprising a first structured representation of a document, having a plurality of image collections. Each such obtainable image collection comprising at least one image. Each image in each such collection being an image of at least a portion of a document. The second set of digital information comprising a second structured representation of a document. The second structured representation being a lossless representation of a particular image collection. The particular image collection being one of a plurality of image collections obtainable from the first structured representation. The second structured representation including a plurality of tokens and a plurality of positions, wherein at least one token of the plurality of tokens has an associated semantic label. Each token comprising a set of pixel data representing a subimage of the particular image collection. Each position being a position of a token of subimage in the particular image collection. A token subimage being one of the subimages from one of the tokens. At least one token subimage having a plurality of pixels and occurring at more than one position in the particular image collection.
According to another aspect of the present invention, an apparatus comprising a processor, an instruction store, and a data store is provided. The instruction store comprises an article of manufacture as described above. The data store includes the first and second sets of digital information.
According to still a further aspect of the present invention, a method for providing a low resolution representation and a high resolution representation of a document is provided. A processor is provided with a first set of digital information comprising a first structured representation (hereinafter, xe2x80x9cthe starting representation) of a document. The starting representation being a resolution-independent representation. A plurality of image collections are obtainable from the starting representation and each such obtainable image collection comprises at least one image. The image in each such collection being an image of at least a portion of the document and the image in each such collection having a characteristic resolution.
A second set of digital information comprising a second structured representation (hereinafter, xe2x80x9cthe low-resolution representationxe2x80x9d) of the document is produced from the first set of digital information. The low-resolution representation being a lossless representation of a particular image collection (hereinafter, xe2x80x9cthe low-resolution image collectionxe2x80x9d). The low-resolution image collection being one of the plurality of image collections obtainable from the starting representation. Each image in the low resolution image collection having a first characteristic resolution (hereinafter, xe2x80x9cthe low resolutionxe2x80x9d). The low resolution representation including a plurality of tokens (hereinafter, xe2x80x9cthe low-resolution tokensxe2x80x9d) and a plurality of positions. The second set of digital information being produced by extracting the low-resolution tokens from the starting representation. Each low-resolution token comprising a set of pixel data representing the subimage of a low-resolution image collection. The plurality of positions of the low-resolution representation is determined from the starting representation. Each position of the low-resolution representation being in a position of a subimage (hereinafter, xe2x80x9cthe low-resolution subimagexe2x80x9d) and the low-resolution image collection. A low-resolution subimage being one of the subimages from one of the low-resolution tokens. At least one low-resolution subimage having a plurality of pixels and occurring at more than one position in this image collection.
A third set of digital information comprising a third structured representation (hereinafter, xe2x80x9cthe high resolution representationxe2x80x9d) of the document is produced from the first set of digital information. The high resolution representation being a lossless representation of a particular image collection (hereinafter, xe2x80x9cthe high resolution image collectionxe2x80x9d). The high resolution image collection being one of a plurality of image collections obtainable from the starting representation. Each image in the high resolution image collection having a second characteristic resolution (hereinafter, xe2x80x9cthe high resolutionxe2x80x9d) being greater than the low resolution. The high resolution representation including a plurality of tokens (hereinafter, xe2x80x9cthe high resolution tokensxe2x80x9d) and a plurality of positions, wherein at least one high resolution token of the plurality of tokens has an associated semantic label. The third set of digital information being produced by extracting the high resolution tokens from the starting representation. Each resolution token comprising a set of pixel data representing a subimage of a high resolution image collection. The plurality of positions of the high resolution representation is determined from the starting representation. Each position of the high resolution representation being a position of a subimage (hereinafter, xe2x80x9cthe high resolution subimagexe2x80x9d) in the high resolution image collection. The high resolution subimage being one of the subimages from the high resolution tokens. At least one resolution subimage has a plurality of pixels and occurs at more than one position in the image collection. The third and second set of digital information are then available for further use.
The invention will be better understood with reference to the drawings and detailed description below. In the drawings, like reference numerals indicate like components.
Other aspects and advantages of the present invention can be seen upon review of the figures, the detailed description, and the claims which follow.