1. Field of the Invention
This invention generally relates to the field of data compression, and more particularly relates to a method and system for compressing data that contains a frequent occurrence of data elements that have the same value.
2. Description of Related Art
Data sets that contain various types of data, including data that defines images, executable programs and other data, are often very large. An example of a large data set is the raw data set describing an image, which is sometimes referred to as a raster data set. The size of a raster data set depends on the image size and resolution, as well as the colorspace used to represent the data and number of bits per color plane. The image size and resolution govern the number of pixels in the image, while the colorspace and the number of bits per color plane define the amount of data needed to describe each pixel.
A simple example is a bilevel image (i.e., where the image simply comprises a number of pixels which are each either on or off) which is a letter size image at a common print resolution of 600 dots per inch. Such an image contains over 33 million pixels that will require approximately 4 MBytes for uncompressed storage, given that each byte can describe eight pixels. Color images are commonly described using 8 or more bits per color. Techniques known as CIEL*a*b* and RGB colorspaces are used, respectively, with three color planes each, for device independent archival storage or for on screen image presentation. The CMYK (Cyan, Magenta, Yellow, Black) colorspace, which specifies four colorplanes, is often used for printing. Color images require between 24 and 32 bits per pixel, depending upon the inclusion of independent black color data, if the intensity of each color is represented with 8 bits per pixel. A letter size image with a resolution of 600 dots per inch requires approximately 96 MB of storage for a three-component colorspace and 128 MB for a four-component colorspace.
The size of a dataset impacts not only the storage of the data but also electronic communication of the dataset. The transmission of a dataset defining a highly detailed image that is to be transmitted to a high speed printer is a particularly difficult problem. Printers, which are capable of printing in excess of several hundred pages per minute, require that data for these images reach the printer with comparable speed. Data communication links are typically inadequate for the communication of the raw image data set to such high-speed printers. In order to communicate the image data to the high-speed printer, data compression is typically employed.
Compression algorithms that are used for image compression are able to be broadly classified into two categories, lossless compression and lossy compression. In a lossless algorithm, the decompressed image is an identical copy of the original image. As the name indicates, lossy algorithms introduce some data loss and the decompressed image is slightly different than the original image. The examples of commonly used lossless algorithms are ITU-TSS T6 Group 4 (for bilevel images) and Lempel-Ziv and Welch (LZW) for arbitrary data. The best-known lossy image compression algorithm is part of the Joint Photographic Experts Group (JPEG) standard.
The different types of compression algorithms are appropriate for different image types. Images may be classified into linework and continuous tone images (which may also be referred to as xe2x80x98contonexe2x80x99 images for the purposes of this specification). Linework images contain sharp edges and areas of high color contrast. Examples of linework images are rasterized text, pie charts and line drawings. Continuous tone images are distinguished from linework images by constantly varying color and a general lack (or relative unimportance) of sharp edges. Photographs are primary examples of continuous tone images.
Compression of a linework image via a lossy algorithm unacceptably degrades the decompressed image. Examples of unacceptable degradation of a linework image compressed with a lossy algorithm (e.g., the JPEG algorithm) are artifacts and blurring in areas which neighbor sharp edges within the linework image. On the other hand, compressing a continuous tone image via a lossless compression algorithm results in very little data compression (e.g., common reduction in data size is 10% for compression of a continuous tone image via the LZW algorithm). The use of a lossless compression algorithm on a continuous tone image may actually cause the data set to expand (i.e., the compressed data set is larger than the uncompressed data set). Conversely, the lossless data compression algorithms preserve the quality of a linework image and also tend to have good data compression performance. The quality degradation of a decompressed continuous tone image that was compressed via a lossy algorithm is often imperceptible. Image data compression becomes more efficient if the image data is distinguished between continuous tone image data and linework image data and the data is compressed via an algorithm suitable for the type of image. This phenomenon is used in many image data rasterizers that are used in color printing, which process linework and continuous tone differently. While multiple color planes can combine in various ways, the images being carried in the CMYK colorspace (which refers to the four color planes used to encode color data: Cyan, Magenta, Yellow and Black) for printing are almost invariably carried in the planar format, where each color plane of the image is compressed separately.
Linework image data is often encoded using a run length algorithm or a variant of the LZW algorithm. The run length algorithm encodes each scan line in the image separately, by recording the number of pixels that have the same color intensity value. Alternatively, a run end algorithm can be used where the position of the last pixel in a xe2x80x9crunxe2x80x9d of pixels with the same color is recorded instead of the length of each run. A number of run lengths and run end compression forms are currently used, such as the MRLE runlength format used to communicate linework image data to the Xeikon high speed color printheads. Since the runlength/run end compression format record the changes in color on each scan line, the resulting datasets tend to be quite efficient for linework data.
The runend/runlength format serves as the intermediate format for the MMR family of algorithms, such as ITU-TSS T6 Group4, and is therefore heavily used in processing of bilevel data.
The LZW algorithm compresses an arbitrary stream of data (i.e., its use is not necessarily restricted to image data). The LZW algorithm operates by building a dictionary of code words that each represents a sequence of bytes. The dictionary is implicit, which is to say that the dictionary is never explicitly embedded into the compressed data stream. The dictionary of the LZW algorithm is dynamically constructed by the compressed data stream decoder as the encoded data is processed. The code words of typical LZW encoding techniques are 9-12 bits long. When the dictionary is full (i.e., all of the code words have been used), a special code, i.e., the CLEARCODE, is encoded. Upon receipt of the CLEARCODE by an LZW decoder, the code word dictionary is erased and the algorithm restarts.
In general terms, the encoding process of the LZW algorithm operates by building data strings and maintaining a dictionary of code words to represent data strings that contain previously observed data patterns. These code words are used to replace subsequent occurrences of those data strings. The algorithm maintains a currently active string. The currently active string always has a corresponding code word in the dictionary. When the next character is processed from the un-encoded data input, a new string is considered, which comprises the currently active string with the current input character added to the end. If the dictionary already contains a code word describing the new string, that code word becomes the current string and the algorithm advances to the next character. Otherwise, the code word for the current string is output, a new code word is assigned to the xe2x80x9cnewxe2x80x9d string, which comprises the xe2x80x9ccurrent stringxe2x80x9d and the xe2x80x9ccurrent input character,xe2x80x9d and the xe2x80x9ccurrent stringxe2x80x9d is reinitialized to the current input character. The processing of the algorithm then proceeds to the next character. The LZW algorithm uses the character itself as a code word to encode a single-character string. The main computational cost in the LZW compression is checking whether the xe2x80x9cnewxe2x80x9d string has an existing code word. Modern LZW implementations use a hash table lookup to locate code words corresponding to input data strings. Each lookup entry is composed of the code (xe2x80x9cprefixxe2x80x9d) that denotes the current string and the additional character.
Typical implementations of LZW encoders store these two entries (which fit into fewer than 20 bits) into a single integer value. The LZW algorithm is incorporated into a number of variations. All variations use the above described processing, but differ in their implementation details. One LZW encoding format often used for linework image compression is the TIFF LZW format. A similar format often used for image compression is the GIF LZW format, but that format restarts the compressor more often.
The PostScript LZWEncodeFilter uses a different variant of the LZW algorithm. The LZW algorithm is also used to compress arbitrary (non-image) data. For example, the xe2x80x9ccompressxe2x80x9d utility shipped with Unix systems uses a variant of the TIFF LZW algorithm that results in the same compressed data length as the TIFF LZW algorithm, but which use different code words. The generality and simplicity of the LZW algorithm are both strengths and weaknesses in the algorithm. The strengths are obvious from the widespread use in manifold contexts. The weakness of the LZW algorithm is that for a particular application (say, compressing image data), the algorithm does not use all the available information to maximize the performance.
In most instances, an entire page to be printed does not entirely comprise only continuous tone or linework image data. It is inefficient to compress the whole output page for both data types, since that would incur both extra compression processing costs and, more importantly, extra decompression processing costs in the receiver. The common solution to this problem is to divide the output page into a set of rectangular tiles, wherein each tile contains different image data types. An example of such tiled image format is IOCA (Image Object Content Architecture) Function Set 45. The IOCA tiling scheme has been disclosed in U.S. patent application Ser. No. 09/305,258, entitled xe2x80x9cEfficient Presentation of Imagesxe2x80x9d, filed on May 4, 1999 and U.S. patent application Ser. No. 09/571,333, entitled xe2x80x9cMethod and System for Efficient Transmittal and Presentation of Complex Imagesxe2x80x9d, filed on May 15, 2000 and both of which are incorporated hereinto in their entirety. Proper generation of a IOCA FS45 datastream allows the receivers (in particular high speed color printers) to efficiently process the datastream at high speeds. This proper generation of the datastream, on the other hand, is a complex process and imposes high processing costs.
These prior art techniques integrate data compression processing with the image data generation functions to identify the nature of the image tile, i.e. continuous tone or linework, in order to efficiently compress the data for transmission. These techniques are ineffective when used in conjunction with raw image data that has already been generated.
Therefore a need exists for a set of process and system that ensure both reasonable processing throughput of a datastream compression process and efficient generation of a compressed datastream when processing a pre-existing dataset that comprises repetitive data patterns.
According to a preferred embodiment of the present invention, a system and method for efficient data compression is a method for compressing data that identifies a data pattern within a dataset that ends with at least one pre-defined data value and that then stores a code word that represents the identified data pattern. The code words are stored in at least one alternate data structure. The method then identifies a subsequent occurrence of the previously identified data pattern and retrieves the previously stored code word from the at least one alternate data structure.
In another aspect of the present invention, according to a preferred embodiment of the present invention, a system and method for efficient data compression is a system for compressing data that has a data sequence analyzer that identifies a data pattern within a dataset, were the pattern ends with one of an at least one pre-defined data value. The system also has a codeword data storage that stores, in at least one alternate data structure, a codeword that corresponds to the data pattern. The system then utilizes a codeword data producer that determines the codeword that represents the identified data pattern by first determining if the codeword is stored within the codeword data storage, and then retrieving the codeword if the codeword is stored within the codeword data storage or calculating the codeword if the codeword is not stored in the codeword data storage.