The present invention relates generally to data compression, and more particularly, to systems and methods of implementing dictionary-based compression.
A wide variety of digital data signals such as data files, documents, photographic images and the like are often compressed to save storage costs or to reduce transmission time through a transmission channel. By decreasing the required memory for data storage and/or the required time for data transmission, compression can yield improved system performance and a reduced cost.
A well known and widely used type of lossless compression, generally referred to as substitutional or dictionary-based compression, exploits the property of many data types to contain repeating sequences of characters. Good examples of such data are text files (a sequence of alphanumeric characters) and raster images (a sequence of pixels). Dictionary-based compression methods exploit this tendency to include repeating character sequences by replacing substrings in a data stream with a code word that identifies that substring in a dictionary. This dictionary can be static if knowledge of the input stream and statistics are known, or it can be adaptive. Adaptive dictionary schemes are better at handling data streams where the statistics are not known or vary.
Adaptive dictionary-based compression techniques can be typed into two related groups. Methods of the first group determine if a character sequence currently being compressed has already occurred earlier in the input data and, if so, rather than repeating it they output a pointer to the earlier occurrence. With this type, the dictionary is represented by the strings of characters occurring in the previously processed data. Methods of the second group build the dictionary entries using character strings encountered in the data stream as it is processed. With both groups, the dictionary is all or a portion of the input stream that has been processed previously. Using previous strings from the input stream often makes a good choice for the dictionary, as substrings that have occurred will likely reoccur. The other advantage to these types of dictionary based compression is that the dictionary is transmitted essentially at no cost, because the decoder can generate the dictionary from the previously coded input stream.
Both groups of dictionary coders can be represented by two related techniques developed by Lempel and Ziv. Methods of the first group are based on an algorithm often referred to as LZ77 and the methods of the second group are based on an algorithm often referred to as LZ78. The many variations of dictionary-based compression algorithms differ primarily in how pointers are represented and to what the pointers are allowed to refer.
Briefly, LZ77 type coding operates on an input stream comprising the sequence of characters to be compressed. Encoders of this type are relatively easy to implement and generally perform a pattern matching technique followed by a variable bitlength encoding scheme such as Huffman encoding. These encoders search a sliding window to locate the longest match with the character sequence beginning with the character at the current coding position. If a match is found, a pointer is provided that identifies the location in the window at which the matching string begins and the length of the string. Searching can be accelerated by indexing prior substrings with a tree, hash table, or binary search tree.
In contrast to LZ77, where pointers can refer to any substring in the window of prior data, the LZ78 method places restrictions on which substrings can be referenced. However, LZ78 does not have a window to limit how far back substrings can be referenced. LZ78 type encoders build the dictionary by matching the current substring from the input stream to a dictionary of previously encountered strings. This stored dictionary is adaptively generated based on the contents of the input stream. The encoding process analyzes a string comprising a prefix and a current character in the data stream, beginning with an empty prefix. If the corresponding string (prefix+the current character) is present in the dictionary, the prefix is extended with the current character and a new string comprising the extended prefix and next character is analyzed. This extending is repeated until a string which is not present in the dictionary is encountered. At that point, the encoder outputs (a) a code word that represents the current prefix and (b) the current character. The encoder also creates a new dictionary entry comprising the current prefix and current character string. The encoder then begins building a new string with an empty prefix and the next character in the data stream. Further information on dictionary based compression can be found in U.S. Pat. No. 4,558,302 entitled xe2x80x9cHigh Speed Data Compression and Decompression Apparatus and Methodxe2x80x9d incorporated herein by reference.
Dictionary-based lossless compression adapt well to a variety of input raster data types and thus are well suited for use in digital printing systems. However, with raster data it has been seen that better matches are often found at scan line intervals in the history buffer. This requires implementations of dictionary-based lossless compression systems to have a large history buffer that has to contain several scan lines of data. In both software and hardware, implementations increasing the size of this buffer are more expensive in terms of implementation costs or reduced performance. In particular for hardware implementations, this memory is often a specialized memory such as a content addressable memory which requires more circuits to implement vs. standard memory that is not content addressable. Another disadvantage with the dictionary based encoders is that the implementation is inherently serial and does not make use of the inherent parallelism available in many processor architectures resulting in lost (or unused) instruction slots and decreased performance.
In accordance with one aspect of the teachings herein, there is provided an improved dictionary-based compression method in which a sliding window data is searched locate a longest string within the sliding window that matches a string beginning at a current coding position. The improved method limits the data within the sliding window searched to data strings occurring at each discrete match location within a plurality of predefined discrete match locations, the plurality of predefined discrete match locations comprising a set of non-continuous data positions within the window of data.
There is further provided a method of compressing data that includes receiving an input stream of data, the input stream including a sequence of pixels to be compressed; identifying a coding position; comparing strings of pixels occurring at each match location within a plurality of predefined match locations to identify a match with a compress string, the compress string including a string of pixels occurring at the coding position, the plurality of predefined match locations defining a set of discrete, non-continuous pixels from the input stream; and providing a pointer, the pointer identifying a predefined match location which matches the compress string and the length of the compress string.
The teachings herein further provide a method of compressing data that exploits the property that, for some types of data, it is possible to identify certain match locations within the data that are more likely to contain a matching pattern than the average location. To exploit this property, one aspect of the present teachings is a compression method which limits the search for matching strings within the window of data to those character strings occurring at a such match locations. By identifying areas of a data stream that are more likely to contain matching data and limiting the search for compression strings to those areas, the compression process can operate in parallel to simultaneously compare data at the match locations. This parallel operation can result in a reduction in the processing time necessary to compress a file as compared to conventional methods. One such embodiment of a method of compressing data includes receiving an input stream of data, the input stream including a sequence of data elements to be compressed; selecting a compress string within the input stream, the compress string including at least one data element occurring at a coding position; identifying a plurality of match locations associated with the coding position; setting a status for each match location with the plurality of match locations, the status identifying whether the corresponding match location is active or inactive; simultaneously comparing the compress string with data elements match locations having an active status to determine if a match exists at the respective match location, and updating the status of the match location based on the comparison; increasing the length of the compress string by adding at least one data element to the compress string; and repeating the steps of simultaneously comparing and increasing the length of the compress string until all match locations within the plurality of match locations have an inactive status; and providing a pointer, the pointer identifying a match location which matches the compress string and the length of the compress string.