1. Field of the Invention
The present invention relates in general to data compression and more particularly to a method and system for compressing data using clustering techniques.
2. Background Art
Data compression is vitally important as demand increases for more rapid processing, access, storage and transmission of data. Data may include text, sounds, images and video. Improvements in data transmission and storage systems have come slower than the historical dramatic increase in processing power in computer systems. As a result, there has been a steadily increasing incentive to use additional processor resources to encode and decode data in a format which reduces the relative demand upon the storage and communications resources of computer systems and computer networks.
Data compression schemes that enable an exact reproduction of the initial input data are referred to as using lossless data compression. In lossless data compression, the primary goal is to change the representation of the data so that the more likely sequences of data from a source are represented using fewer bits of information. Since likely sequences of data are encoded with fewer bits, other sequences must necessarily be encoded with longer bit sequences. The best performance on average is achieved when the number of bits used to represent a sequence is proportional to the logarithm of the inverse of the probability of the sequence. The expected length of the bit sequence that gives this optimal compression can be calculated directly from the probability distribution and is typically referred to as the entropy of the distribution. If both the encoder and decoder are given information allowing them to calculate the probability of any sequence from a source, then there are existing techniques that can compress the sequence to within 1 bit of this entropy. For this reason, the problem of minimum size lossless data compression is equivalent to finding an accurate probabilistic model for a data source.
One approach to probabilistic modeling is to estimate the probability of each symbol in a sequence based upon the frequency of the same symbol given similar patterns of prior characters. These sets of "similar" patterns of prior characters will be referred to as contexts. One simple modeling scheme uses the prior character as a context and maintains a probability distribution for each possible prior character. Character based probabilistic models can generally be described by the method for choosing and possibly adapting the choice of contexts and the method for choosing and possibly adapting the estimated probability distribution for each context.
Compression systems frequently have a preliminary encoding followed by a secondary encoding. For example, the preliminary encoder of a Lempel-Ziv, 1977 (LZ-77) compression scheme takes a sequence of characters and replaces recurring groups of characters with a pair of tokens called the offset and the length. The offset represents the number of input characters since the previous occurrence of the identical grouping of characters. The length indicates the number of characters in the grouping of characters. A secondary encoding transforms the offset and length pairs into a variable-length sequence of bits such that the more frequently occurring offset and length pairs are represented by shorter bit sequences. Another example of a preliminary encoder is a Move-to-Front (MTF) encoder. A MTF encoder maintains an ordered list of all possible characters. For each input character the MTF encoder replaces the character with its position on the list. Further, after the encoder uses each character the character is moved to the front of the list. Therefore, more frequently occurring characters will tend to have lower position number on the list than less frequent characters.
Typically the tokens from the preliminary encoder are then converted to variable length bit sequences appropriate for the probability distribution of tokens. Both the encoder and decoder must maintain identical estimates for the distribution of tokens. Adaptive encoders achieve this consistency by updating the probability estimate of tokens as they are decoded. Non-adaptive encoders typically use a fixed distribution which is stored in a header which precedes the encoded tokens. When random access to small compressed records is required, the efficiency of the adaptive approach is reduced because a short sequence cannot be used to accurately estimate the probability distribution of tokens. Similarly, the efficiency of the non-adaptive approach using a header for each record is reduced because the overhead for representing the header must be amortized over fewer tokens in a shorter record. For this reason, efficient access to small records requires a system which effectively shares the modeling information used to decompress the multiple records.