The invention relates generally to the field of data compression and more particularly to the use of a sequential grammar transform and encoding methods to compress data with a known context.
Universal data compression methods can be divided into two subdivisions: universal lossless data compression and universal lossy data compression. Conventional universal lossless data compression methods typically employ arithmetic coding algorithms, Lempel-Ziv algorithms, and their variants. Arithmetic coding algorithms are based on statistical models of the data to be compressed. To encode a data sequence using an arithmetic coding algorithm, a statistical model is either built dynamically during the encoding process, or assumed to exist in advance. Several approaches exist to build a statistical model dynamically, including prediction by partial match algorithm; dynamic Markov modeling; context gathering algorithm; and context-tree weighting. Each of these methods predicts the next symbol in the data sequence using the proper context and encodes the symbols using their corresponding, estimated conditional probabilities.
Most arithmetic coding algorithms and their variants are universal only with respect to the class of Markov sources with a Markov order less than some designed parameter value. Furthermore, arithmetic coding algorithms encode the data sequence letter by letter.
In contrast, Lempel-Ziv algorithms and their variants do not use statistical models. Lempel-Ziv algorithms parse the original data sequence into non-overlapping, variable length phrases according to a string matching mechanism, and then encode them phrase by phrase. In addition, Lempel-Ziv algorithms are universal with respect to a broader class of sources than the class of Markov sources of bounded order, that being the class of stationary, ergodic sources.
Other conventional universal compression methods include the dynamic Huffman algorithm, the move-to-front coding scheme, and some two-stage compression algorithms with codebook transmission. These conventional methods are either inferior to arithmetic coding and Lempel-Ziv algorithms or too complicated to implement. More recently, a new class of lossless data compression algorithms based on substitution tables was proposed that includes a new coding framework, but no explicit data compression algorithms were introduced. However, this method has a disadvantage in that the greedy sequential transformation used in the encoding process is difficult to implement and does not facilitate efficient coding because the initial symbol s0 is involved in the parsing process.
Furthermore, these algorithms do not assume any prior knowledge about the data sequences being compressed. While making them suitable for general purpose data compression needs, they are not particularly efficient for specific applications of data compression. In many instances, such as compression of web pages, java applets, or text files, there is often some a priori knowledge about the data sequences being compressed. This knowledge can often take the form of so-called xe2x80x9ccontext models.xe2x80x9d
What is needed is a method of universal lossless data compression that overcomes the above-described disadvantages of existing compression methods while taking advantage of the a priori knowledge of the context of the data sequence being compressed.
In accordance with the present invention, a universal lossless data compression method is provided. This method employs source coding to construct on-line, tuneable, context-dependent compression rules. Unlike alternative methods previously used that compress the individual objects, this method compresses the numerical set of rules. To achieve on-line compression of web-based data for example, the method receives the data, constructs the rules dynamically, and then encodes the rules. Because the set of rules is dynamic, when the structure of web text data changes, the corresponding rules are updated; similarly, when the content of web objects is updated, the corresponding rules are updated accordingly. This approach is particularly efficient for encoding web pages, because the content of a web page can change often, while the underlying structure of a web page remains approximately constant. The relative consistency of the underlying structure provides the predictable context for the data as it is compressed.
One aspect of the invention relates to a method of sequentially transforming an original data sequence associated with a known context into an irreducible context-dependent grammar, and recovering the original data sequence from the grammar. The method includes the steps of parsing a substring from the sequence, generating an admissible context-dependent grammar based on the parsed substring, applying a set of reduction rules to the admissible context dependent grammar to generate a new irreducible context-dependent grammar, and repeating these steps until the entire sequence is encoded. In addition, a set of reduction rules based on pairs of variables and contexts represents the irreducible context-dependent grammar such that the pairs represent non-overlapping repeated patterns and contexts of the data sequence.
In another aspect of the invention, the method relates the use of adaptive context-dependent arithmetic coding to encode an irreducible context-dependent grammar associated with a known context model from a countable context model set. Furthermore, a set of reduction rules are applied to represent the irreducible context-dependent grammar based on pairs of variables and contexts such that the pairs represent non-overlapping repeated patterns and contexts of the data sequence.
In yet another aspect of the invention, a method is provided to encode an data sequence with a known context model by transforming the data sequence into a irreducible context-dependent grammar; converting the irreducible context-dependent grammar into its sorted form; constructing a generated sequence from the sorted irreducible context-dependent grammar; and encoding the generated sequence using an adaptive context-dependent arithmetic code.
The invention also relates to a method of sequentially transforming an original data sequence associated with a known context model into a sequence of irreducible context-dependent grammars; and further encoding the data sequence based on each of the irreducible context-dependent grammars by using adaptive context-dependent arithmetic coding. The method comprises the steps of parsing a substring from the sequence, encoding the substring by utilizing the structure of the previous irreducible context-dependent grammar and by using adaptive context-dependent arithmetic coding, generating an admissible context-dependent grammar based on the substring, the current context, and the previous irreducible context-dependent grammar, applying a set of reduction rules to the admissible context-dependent grammar to generate a new irreducible context-dependent grammar, and repeating these steps until all of the symbols of the sequence are parsed and coded.