The present invention is directed toward the field of data compression. In particular, a block-wise adaptive statistical data compressor is disclosed that adapts its data model on a block by block basis. The data model generated by the adaptive statistical data compressor consists of a plurality of super-character codewords that correspond to a plurality of super-character groups, wherein each super-character group contains data regarding the frequency of occurrence of one or more individual characters in an applicable character data set. The use of these super-character codewords and groups to model the data in a particular block minimizes the amount of model data that must be included with the compressed data block to enable decompression.
Also disclosed in this application is a preferred multi-stage data compressor that includes the block-wise adaptive statistical data compressor, as one stage, and also includes a clustering stage and a reordering stage, which, together, reformat the data in the data block so that the frequency distribution of characters in the data block has an expected skew. This skew can then be exploited by selecting certain super-character groupings that optimize the compression ratio achievable by the block-wise adaptive statistical stage. In an alternative embodiment, additional stages are added to the clustering, reordering and adaptive statistical stages to improve data compression efficiency.
The present invention finds particular use in data communication devices in which it is desirable to reduce the quantity of data transmitted while maintaining the integrity of the data stream. Although the disclosed data compressor (in its various embodiments) can be used for general data compression on a personal computer or workstation to compress, for example, data files for easier transport or electronic transmission, the preferred application of the data compressor is for use with mobile data communication devices that transmit packets (or blocks) of data, such as E-mail messages, via a wireless packet network. The data compressor is preferably implemented as a sequence of computer program instructions that are programmed into the mobile data communication device, but could, alternatively be implemented in hardware or as a sequence of instructions that are stored on a disk as an article of manufacture.
Data compression (or compression) refers to the process of transforming a data file or stream of data characters so that the number of bits needed to represent the transformed data is smaller than the number of bits needed to represent the original data. The reason that data files can be compressed is because of redundancy. The more redundant a particular file is, the more likely it is to be effectively compressed.
There are two general types of compression schemes, lossless and lossy. Lossless compression refers to a process in which the original data can be recovered (decompressed) exactly from the compressed data. Lossy compression refers to schemes where the decompressed data is not exactly the same as the original data. Lossless schemes are generally used for data files or messages where the content of the file must be accurately maintained, such as an E-mail message, word processing document, or other type of text file. Lossy schemes are generally used for data files that already include a certain degree of noise, such as photographs, music, or other analog signals that have been put into a digital format and therefore the addition of a bit more noise is acceptable.
The present invention is a lossless data compression scheme. In the field of lossless data compression there are two general types of compressors: (1) dictionary based (or sliding-window); and (2) statistical coders. Dictionary based compressors examine the input data stream and look for groups of symbols or characters that appear in a dictionary that is built using data that has already been compressed. If a match is found, the compressor outputs a single pointer or index into the dictionary instead of the group of characters. In this way, a group of characters can be replaced by a smaller index value. The main difference between the numerous dictionary based schemes is how the dictionary is built and maintained, and how matches are found. Well-known dictionary based schemes include LZ77 (where the dictionary is a fixed-length sliding window that corresponds to the previous N-bytes of data that have been compressed); LZ78 (where the dictionary is an unlimited-sized tree of phrases that are built as the data is being compressed); and various improvements on LZ77 and LZ78, including LZSS, LZW, and numerous other schemes that employ a "hash" function to find the position of a particular token in the dictionary.
Statistical coders are typically either Huffman coders or arithmetic coders. Statistical coders build a model of the data stream or block and then replace individual characters in the data block with a variable-length code that corresponds to the frequency of occurrence of the particular character in the data block. Huffman coding assigns variable-length codes to characters based on their frequency of occurrence. For example, in the English language the letters "E", "T", "A", "I", etc., appear much more frequently than the letters "X", "Q", "Z", etc., and Huffman coding takes advantage of this fact by assigning (for a fixed Huffman coder) a lower number of bits to letters that occur more frequently and a higher number of bits to characters that occur less frequently.
There are two basic types of Huffman coders, a fixed Huffman coder and a purely adaptive Huffman coder. The fixed coder builds a tree of codes based on statistics concerning all the symbols (or characters) actually contained in the data file. This "Huffman tree" must be passed to the decompression device in order to properly decompress the file, which adds to the overhead and thus reduces the effective compression ratio. For example, a fixed Huffman coder for 7-bit characters would normally require 128 bytes of data to model the character set, while for 8-bit characters, 256 bytes are normally required. Thus for small data blocks, on the order of several KB, the overhead of the fixed Huffman coder is undesirable.
The adaptive Huffman coder assumes an initial distribution of characters in the block, and then changes the coding of individual symbols in the tree based on the actual content of the symbols in the data file as they are being processed. The advantage of the adaptive coder is that the tree is not passed to the decompression device, but the decompression device must assume the initial distribution of symbols in the tree. The main problem with the adaptive stage is that it takes a certain amount of data to be processed before the model becomes efficient, and therefore it is also undesirable for small blocks of data.
Presently known dictionary based and statistical compressors suffer from several disadvantages that are addressed by the present invention. First, neither of these types of compressors are optimized for relatively small data blocks. In fact, some of these schemes exhibit poor performance for small blocks of data, as are commonly transmitted over wireless packet data networks.
The presently known dictionary based schemes can provide good compression ratios, but generally require a large amount of memory to operate in order to store the dictionary. This is particularly true of the LZ78 variants where the dictionary is not limited to any particular size. This is a problem for small mobile computers that have a limited memory capacity. In addition, these schemes require a search and replace function that can be computationally intensive and time consuming depending on the size of the dictionary, the data structure used to store the data in the dictionary, and the method employed to find a matching string. This is an additional problem for mobile computers that generally have limited processing power.
The presently known statistical compressors suffer from several disadvantages: (1) they generally do not provide enough compression; (2) the fixed type of coder requires the additional overhead of passing a code for each character in the alphabet to the decompressing device, thus reducing the overall compression ratio, particularly where a relatively small block of data is being compressed; (3) the purely adaptive type of coder requires a great deal of processing power on the compression side to constantly update the statistical model of the data stream, and therefore is not well suited for small, mobile data communication devices that have limited processing power; and (4) also with respect to the purely adaptive type of coder, this type of coder only becomes efficient after a particular amount of data has been compressed, and therefore it is very inefficient for small data blocks, where the compressor may require more data to become efficient than is contained in the block.
Therefore, there remains a general need in the art of data compression for a data compressor that is optimized to compress relatively small blocks of data.
There remains a more particular need for a data compressor that is optimized for use with mobile data communication devices that have limited memory and processing capabilities.
There remains still a more general need for a data compressor that adapts its data model to each data block that is being compressed, but at the same time minimizes the amount of model data that must be transmitted to the decompression device to decompress the block.
There remains a more particular need for such a data compressor that, while adapting to each block of data, minimizes the processing power required of the device operating the compressor and is efficient for relatively small data blocks.
There remains yet another need for a multi-stage data compressor that includes, as one stage, a block-wise adaptive statistical data compressor that satisfies the above-noted needs.
There remains an additional need for such a multi-stage data compressor that includes a clustering stage and a reordering stage for transforming each data block such that there tends to be an expected skew in the frequency distribution of characters in the data block.
There remains an additional need for such a multi-stage data compressor that includes additional compression stages, such as dictionary based or statistical coder stages, in order to increase the overall compression efficiency of the data compressor.
There remains an additional need for such a multi-stage data compressor in which the clustering stage utilizes the Burrows-Wheeler Transform ("BWT") and the reordering stage utilizes a move-to-the-front ("MTF") scheme.