In “Efficient Universal Lossless Data Compression Algorithms Based on a Greedy Sequential Grammar Transform—Part One: Without Context Models”, E.-h. Yang and J. C. Kieffer, IEEE Transactions on Information Theory, VOL. 46, NO. 3, May 2000, pp. 755-777, and “Grammar based codes: A new class of universal lossless source codes,” J. C. Kieffer and E.-h. Yang, IEEE Transactions on Information Theory, VOL. 46, pp. 737-754, May 2000, a compression algorithm which uses a grammar transform to construct a sequence of irreducible context free grammars to compress a data sequence is described. The entire contents of both are hereby incorporated by reference. This algorithm has been called the YK compression algorithm in the art, and will be so referred herein. The YK compression algorithm describes a set of reduction rules for producing an irreducible grammar for encoding an original data sequence. This grammar can then be used to recover the original data sequence.
In many instances, such as compression of web pages, java applets, or text files, there is often some a priori knowledge about the data sequences being compressed. This knowledge can often take the form of so-called “context models.” Accordingly, context based compression techniques are particularly efficient for encoding web pages in which the content of a web page changes often, while the underlying structure of the web page remains approximately constant. The relative consistency of the underlying structure provides the predictable context for the data as it is compressed.
U.S. Pat. No. 6,801,141, issued on Oct. 5, 2004 to En-Hui Yang and Da-Ke He, and “Efficient Universal Lossless Data Compression Algorithms Based on a Greedy Sequential Grammar Transform—Part Two: With Context Models”, En-Hui Yang and Da-Ke He, IEEE Transactions on Information Theory, VOL. 49, NO. 11, November 2003, pp. 2874-2894 both describe an improvement to the YK compression algorithm by using contexts, and both of which are hereby incorporated by reference in their entirety—as are the references cited therein. We will refer to the methods and techniques described therein as context based YK compression (CBYK).
One aspect of the CBYK described therein relates to a method of sequentially transforming an original data sequence associated with a known context model into an irreducible context-dependent grammar, and recovering the original data sequence from the grammar. The method includes the steps of parsing a substring from the sequence, generating an admissible context-dependent grammar based on the parsed substring, applying a set of reduction rules to the admissible context dependent grammar to generate a new irreducible context-dependent grammar, and repeating these steps until the entire sequence is encoded. In addition, a set of reduction rules based on pairs of variables and contexts represents the irreducible context-dependent grammar such that the pairs represent non-overlapping repeated patterns and contexts of the data sequence.
CBYK compression can provide significant compression gains over the context-free YK compression algorithm, especially when it is combined with interactive compression. In brief, context based YK compression uses the context as a form of predictor of the next parsed symbol or phrase and the corresponding estimated conditional probability for coding, in order to achieve good compression. In theory, the better the context model used by the CBYK, the more likely the compression rate will be optimized.
In general, for (CBYK) compression, a good context model acts as a good form of predictor of the next parsed symbol or phrase. In this regard, improvements to the context model can increase the effectiveness of the compression. However, if improving the context model increases the size of the context model, practical limits need to be considered.
It has been found that the memory requirements used to process the CBYK increase significantly with the size of the context model. For example, if a context model is not chosen properly, the number of grammar variables can be significantly higher than the number in context-free YK resulting in higher memory usage. If the memory usage of CBYK exceeds the constraints or the available capacity, then the use of CBYK is not desirable regardless of how significant the increase in compression gain is. Depending on the application and the devices running the CBYK, even a simple context model, such as using the last byte of the previous parsed phrase as the context, can exceed memory constraints. As the context length grows, the number of contexts grows exponentially. On the other hand, since CBYK uses in general more resources than context-free YK, it would not be preferable without a benefit in compression gain.
Therefore, it is desirable to provide a method for creating a context model, for example, for use with CBYK, that provides a suitable trade-off between memory requirements and compression gain. In particular, it is desirable to provide a context model that uses less memory, but still retains acceptable compression gains, compared to larger context models.