1. Field of the Invention
The present invention relates to an improved system and method for performing lossless data compression and decompression. More particularly, the present invention relates to a system and method for performing lossless data compression and decompression which employs a trie-type data structure to efficiently parse the data string being compressed, while also taking into account any pre-defined grammar and pre-defined source statistics relating to the data in the data string, as well as error handling at the decoder and memory constraints for both the encoder and decoder.
2. Description of the Related Art
Lossless data compression algorithms can be broadly classified into two types, namely, dictionary coding and statistical coding. The most widely used dictionary coding algorithms are the Lempel-Ziv algorithms and their variants. Dictionary coding techniques achieve compression by exploiting redundancy in data through some kind of string matching mechanism. In contrast, statistical coding methods such as arithmetic coding exploit the redundancy in data through a statistical model. Very recently, a new type of lossless source code called a grammar-based code was proposed in a publication by J. C. Kieffer and E. -H. Yang entitled xe2x80x9cGrammar Based Codes: A New Class of Universal Lossless Source Codes,xe2x80x9d IEEE Transactions on Information Theory, the entire contents of which is incorporated herein by reference.
The class of grammar-based codes is broad enough to include Lempel-Ziv types of codes as special cases. To compress a data sequence, a grammar-based code first transforms the data sequence into a context-free grammar, from which the original data sequence can be fully reconstructed by performing parallel substitutions. This context-free grammar, which is a compact representation of original data, is compressed using arithmetic encoding. It has been proved in the Kieffer publication referenced above that if a grammar-based code transforms each data sequence into an irreducible context-free grammar, then the grammar-based code is universal for the class of stationary, ergodic sources. Grammar-based codes offer a design framework in which arithmetic coding and string matching capability can be combined in an elegant manner.
Within the framework of grammar-based codes, an efficient greedy grammar transform was developed as described in a publication by E. -H. Yang and J. C. Kieffer, entitled xe2x80x9cEfficient Universal Lossless Data Compression Algorithms Based on a Greedy Sequential Grammar Transformxe2x80x94Part one: Without Context Modelsxe2x80x9d, IEEE Transactions on Information Theory, the entire contents of which is incorporated by reference herein. This greedy grammar transform sequentially constructs a sequence of irreducible context-free grammars from which the original data sequence can be recovered incrementally. The greedy grammar transform sequentially parses the original data into non-overlapping, variable-length phrases, and updates the grammar accordingly.
Based on this greedy grammar transform, three universal lossless data compression algorithms are proposed in the Yang publication cited above, namely, a hierarchical algorithm a sequential algorithm, and an improved sequential algorithm These algorithms jointly optimize, in some sense, string matching and arithmetic coding capability. Although the three algorithms are based on the same grammar transform, they differ in their encoding strategies.
The hierarchical algorithm encodes the final irreducible grammar using a zero order arithmetic code with a dynamic alphabet, while the two sequential algorithms use arithmetic coding to encode the sequence of parsed phrases, instead of the final grammar. The improved sequential algorithm is an improved version of the sequential algorithm that better exploits the structure of the grammar at each step. The improved sequential algorithm, henceforth referred to as the YK algorithm, is of particular interest, since experimental results using this algorithm have yielded superior compression performance compared to the other two algorithms. In addition, unlike the hierarchical algorithm this algorithm is sequential and does not require the whole data to be present before starting the encoding operation. This algorithm has been proved to be universal in the sense that it can asymptotically achieve the entropy rate of any stationary, ergodic source. In addition, the algorithm has been shown to have linear time complexity with data size.
Experimental results using the YK algorithm have shown that the YK algorithm effectively compresses files of sizes ranging from small, such as internet datagrams to big files, such as those occurring in archiving applications. The YK algorithm significantly outperforms Lempel-Ziv type algorithms such as Gzip for both small and large file sizes. In addition, the YK algorithm is more effective than the Burrows-Wheeler transform based algorithm BZIP2, particularly for small files.
The basic data structure used by the YK algorithm is a context-free grammar, whose components are a source alphabet, a set of variables, and production rules that map each variable into a finite string composed of source symbols and variables. A grammar based compression algorithm transforms the original data into such a grammar, with the additional requirement that the grammar be irreducible. A grammar is irreducible if it satisfies a certain definition of compactness. There are several ways to construct an irreducible grammar that represents the original data. The YK algorithm is based on one such grammar transform that sequentially constructs a sequence of irreducible grammars in a greedy manner.
At an intermediate stage of the YK algorithm, there exists an irreducible grammar, defined by the source alphabet, the current set of variables, and the corresponding production rules. Also, there exists a frequency distribution on the source alphabet and the variables. The basic implementation of the YK encoding algorithm consists of a sequentially iterative application of three fundamental steps, namely, parsing, updating and encoding. The parsing operation determines the longest prefix of the remaining part of the original data string that is represented by one of the current variables. The updating operation subsequently updates the grammar after adding the new parsed substring, and modifies the frequency distribution of the source alphabet and/or the variables. The encoding operation uses the frequency distribution on the symbols and arithmetic encoding to code the parsed substring. The decoder sequentially uses arithmetic decoding to determine the parsed substring, followed by updating, to produce an identical sequence of grammars as in the encoder, from which the original data is recovered incrementally.
The following is a theoretical description of the YK algorithm. In defining the variables used in the following description, let A be the source alphabet with cardinality |A| greater than or equal to 2, let A+ denote the set of all finite strings drawn from A, let x=x1x2 . . . xn be a finite string drawn for the alphabet A that is to be compressed, and let S={s0, s1, s2, . . . } be a countable set, disjoint from A. For jxe2x89xa71, let S(j)={s0, s1, . . . sjxe2x88x921} and S(j)+={s1, s2, . . . sjxe2x88x921}. A context-free grammar G is a mapping from S(j) to S(j)∪A)+ for some jxe2x89xa71. The mapping G is explicitly represented by writing each relationship (si, G(si)) as sixe2x86x92G(si), for i less than j. G(s) is called as the production rule for s. The symbol so is special in the sense that the A-string represented by so is the original data string. so is referred to as the start symbol, while si for i less than 1 are called variables. An example of a grammar is shown below:
s0xe2x86x92s3s1s1bbs2bs4s4c
s1xe2x86x92s2a
s2xe2x86x92ac
s3xe2x86x92ab
s4xe2x86x92s3s2c
This is an example of an irreducible grammar representing the data sequence abacaacabbacbabaccabaccc. This is a simplified example where A={a, b, c}. In a conventional data compression system, A can contain up to 256 ASCII characters.
This grammar is an example of an admissible grammar because one obtains a sequence from A after finitely many parallel replacements of variables in G(si), and every variable si(i less than j) is replaced at least once by G(si). The resulting sequence from A is called A-string represented by the respective variable. As stated above, the symbol so is special in the sense that the A-string represented by so is the original data string. In addition, this grammar is an irreducible grammar because none of Reduction Rules 1 to 5 in the Yang publication cited above can be applied to it to get a new admissible grammar. Details of the concepts of admissible grammar and irreducible grammar can be found in the Yang publication.
The Yang publication referenced above, as well as a publication by E. H. Yang entitled xe2x80x9cEfficient Universal Lossless Data Compression Algorithms Based on a Greedy Sequential Grammar Transform: Implementation and Experimental Resultsxe2x80x9d, Technical Report released to the Hughes Network Systems, Mar. 31, 1999, the entire content of which is incorporated by reference herein, describe the working of the YK algorithm and its basic implementation in detail. The main steps of the algorithm are briefly summarized below, with x=x1x2 . . . xn being a sequence from A, that is to be compressed. The YK algorithm proceeds through the following steps:
Parsing: The parsing operation parses the sequence x sequentially into non-overlapping substrings {x1, x2 . . . xn2, . . . , xnixe2x88x921+1 . . . xn1}, where 1xe2x89xa7ixe2x89xa7t, n1=1, and ni=n. The first substring is x1. At the ith step, suppose that the first i phrases x1, x2 . . . xn2, . . . , xnixe2x88x921+1 . . . xni have been parsed off. Suppose the variable set of the current grammar Gi is s(ji)={s0, s1, . . . sjixe2x88x921}, with j1=1. The next substring xni+1 . . . xni+1 is the longest prefix of xni+1 . . . xn that can be represented by sj if such a prefix exists in s(ji). If such a prefix exists, the next parsed phrase is sj; otherwise, the next parsed phrase is the symbol xni+1.
Grammar Update: Let xcex1 denote the last symbol on the right end of Gi(s0). The parsed phrase (denoted by xcex2) is appended to the right end of Gi(s0) to give the appended grammar Gi. The appended grammar is reduced, if possible, using Reduction Rules (the Yang publication) to yield an irreducible grammar Gi+1. Define the indicator sequence I: {1,2, . . . , t}xe2x86x92{0,1} as follows: I(1)=0, and for any i greater than 1, I(i) is equal to 0 if Gi+equals Gi(i.e. reduction was not possible), and 1 otherwise (i.e. reduction was possible). It is proved in the first Yang publication referenced above that the grammar Gi is reducible if and only if xcex1xcex2 is the only non-overlapping repeated pattern of length xe2x89xa72 in Gi. To determine if the appended grammar is reducible, a list L1(xcex3) is maintained for all symbols in A∪S(ji), where L1(xcex3) consists of triples (xcex7, m,n), where xcex7xcex5A∪S(ji), and m and n are row and column locations in the grammar Gi of the pattern xcex3xcex7, which can potentially be repeated in future appended grammars. If xcex7 is the first component of L1(xcex3), then xcex7 is said to be in the list L1(xcex3). Depending on the values of I(i+1) and I(i), there are three possible distinct cases:
Case 0: I(i+1)=0
Case 10: I(i+1)=1, I(i)=0
Case 11: I(i+1)=1, I(i)=1
Under Case 0, Gi+1 is equal to Gi. Under Case 10, Gi+1 is obtained by adding a new row to Gi representing the repeated phrase xcex1xcex2. Under Case 11, Gi+1 is obtained by adding xcex2 to the end of the row corresponding to the last new variable. The lists L1(xcex3) are appropriately modified in each of the three cases. The details of the grammar and list update operations can be found in the first Yang publication referenced above.
Arithmetic Encoding: An order 1 arithmetic code is used to encode the sequence {I(i)}i=1t, and the parsed phrases xcex2 are encoded using an order 0 arithmetic encoder in an interleaved manner. The alphabet used for the order 1 arithmetic encoder is {0,1}, and the counters c(0,0), c(0,1), c(1,0), and c(1,1) are used for the order 1 frequency distribution on the alphabet. Also, for each xcex3xcex5S∪A, two sets of counters c(xcex3) and ĉ(xcex3) are used to model the frequency distribution for coding the parsed phrases xcex2. Depending on the three cases, the coding of xcex2 uses the following alphabets and counters:
Case 0: The alphabet used for coding xcex2 is A∪S(ji)∪{xcfx86}xe2x88x92L2(xcex1), where L2(xcex1) is a list related to the list L1(xcex1). L2(xcex1) is defined as L1(xcex1)∪xc2x7 where xc2x7 includes all those xcex7 such that there exists a variable whose production rule is xc2x7 xcex7, and xcfx86 is the end-of-sequence symbol disjoint from A and S(ji). (See the first Yang publication for the precise definitions of L1(xc2x7) and L2(xc2x7)). The counters c(xcex3) are used for the frequency distribution.
Case 10: The alphabet used for coding xcex2 is the list L1(xcex1). The counters ĉ(xcex3) are used for frequency distribution.
Case 11: The parsed phrase xcex2 is the only element in the list L1(xcex1), and hence is not coded.
When the YK encoder reaches the end of the file being compressed, it encodes the indicator 0(corresponding to case 0), and subsequently encodes the end-of-sequence symbol xcfx86. The counters are initialized at the start of the YK algorithm, and are updated suitably in each of the three cases. At initialization, the counters c(r, s) equal 1 for r, s=0,1, while the counters c(xcex3) and ĉ(xcex3) are given the value 1 for xcex3xcex5A, and 0 otherwise. The details of the arithmetic coding can be found in the first Yang publication referenced above. The details of arithmetic coding can be found in a publication by I. H. Witten, R. Neal, and J. G. Cleary entitled xe2x80x9cArithmetic Coding for Data Compression,xe2x80x9d Communications of the ACM, Vol. 30, pp. 520-540, 1987, the entire content of which is incorporated by reference herein. Also, to overcome the problem of dealing with potentially large alphabet in Case 0, the arithmetic coding in this case can be efficiently performed using a recently proposed technique called multilevel arithmetic coding, as described in a publication by E. H. Yang and Y. Jia, xe2x80x9cUniversal Lossless Coding of Sources with Large or Unbounded Alphabets,xe2x80x9d, Dept. of Electrical and Computer Engineering, University of Waterloo, 1999, the entire content of which is incorporated by reference herein.
The YK decoder essentially decodes the sequence I(i)i=1t using an order 1 arithmetic decoder, and based on one of the three cases discussed above, decodes the parsed phrase xcex2. Then it updates the grammar, the lists, and the counters in exactly the same manner as in the YK encoder. The decoder stops when it decodes the indicator 0, followed by the end-of-sequence symbol xcfx86. However, the YK decoder does not perform the parsing operation.
As discussed above, parsing is one of the three main operations in the YK encoder. However, if the parsing operation is not performed efficiently, it would consume a significant portion of the overall time taken by the encoder, and make the YK algorithm slow. Also, if the parsing operation is inefficient, there would be a significant time gap between the operations of the encoder and the decoder, since there is no parsing operation in the decoder. This is crucial in real time applications, since it is appropriate for the encoder and decoder to work at substantially similar speeds in order to avoid substantial overflow or underflow of their buffers. The parsing operation in the second Yang publication may be slow because the search for the next parsed phrase is performed by individually checking a group of many variables.
In addition, the compressed bit-stream from the YK encoder can be subjected to bit-errors during transmission. A naxc3xafve YK decoder can possibly get into an inconsistent mode of operation where there are bit errors in the compressed bit stream, which is inevitable in certain applications. In unreliable systems, the decoder has to operate on a possibly erroneous bit-stream As a result, the decoder can get into a situation that can put it in an ambiguous state, and cause the software or hardware devices to crash. For instance, the decoder can get into a non-terminating loop of operation, producing a potentially unbounded decompressed data. Another example would be for the decoder to decode an erroneous symbol that would lead to a conflicting state while updating its existing grammar.
As can also be appreciated from the above description of the YK algorithm, the memory requirement of the YK algorithm grows with data size. For certain applications, the memory requirement can go beyond what the system can afford. It is imperative in such situations to have a compression algorithm that can continue to work without exceeding a pre-defined limit on the memory requirement. The two extreme ways known as freezing and restarting are possible solutions, but they often lead to poor compression efficiency.
In addition, the basic YK algorithm starts with a flat frequency distribution on the symbols, and adaptively updates the frequency counts as the algorithm proceeds. However, starting with a flat frequency distribution is inefficient, especially when a file having a small size is being compressed and decompressed. Furthermore, the basic YK algorithm starts with a null grammar, and builds the whole grammar from scratch. However, starting with a null grammar is inefficient, especially when a small file is being compressed and decompressed as the benefits of having a grammar are never really realized compressing and decompressing small files.
A need therefore exists for an improved data compression and decompression system and method which eliminates the deficiencies discussed above.
It is therefore an object of the present invention to provide a system and method employing an improved lossless compression algorithm that overcomes the above deficiencies.
It is also an object of the present invention to provide a system and method employing a YK compression algorithm that uses a trie type data structure that can easily be computer implemented which provides faster parsing in data compression.
It is yet another object of the present invention to provide a system and method employing a decoder that can handle bit-errors entering the decoder without causing the system to crash or enter an infinite loop.
It is a further object of the present invention to provide a system and method which constrains the sizes of grammar, lists, and parsing tries used when compressing a large file so that the constraint mechanisms do not lead to poor compression efficiencies.
It is another object of the present invention to provide a system and method which starts a compression of a file with a pre-defined frequency distribution at both the encoder and the decoder in order to improve compression efficiency.
It is yet another object of the present invention to provide a system and method that starts a YK compression of a file with a pre-defined grammar built using typical training data set at both the encoder and the decoder.
It is yet another object of the present invention to provide a system and method that performs YK compression of incoming data packets and produces outgoing compressed data packets in such a way that each outgoing data packet has sufficient information to recreate the corresponding input data packet without knowledge of future compressed data packets.
These and other objects of the present invention are substantially achieved by providing system and method employing an improved data compression and decompression technique for use in a communication system. Specifically, the system and method employs an improved YK algorithm which uses an appropriate form of the trie data structure for the parsing operation. With the use of this data structure, the parsing complexity using this data structure is essentially proportional to the data size, and hence is very fast. The improved YK algorithm also is capable of handling errors in data without getting into an ambiguous state or causing the hardware or software system to crash. The improved YK algorithm also sequentially updates the grammar, while keeping the number of variables below a pre-defined limit. This changes the grammar gradually, and can potentially increase the compression efficiency.
The system and method are capable of parsing an input string into irreducible grammar by using a trie-type data structure that represents the variables of the irreducible grammar, updating the grammar based on the last character to he parsed in the input string, and arithmetically encoding the irreducible grammar into a stream of bits to be sent to a decoder. The trie-type data structure comprises a root and a plurality of nodes, and traversal of the trie-type data structure from the root to one of said plurality of the nodes achieves a representation of variables of the grammar.
The system and method further are capable of decompressing an input bit stream at a decoding system The system and method arithmetically decode the input bit stream based on irreducible grammar, and update the grammar based on the last character in the input bit stream. The decoding and updating steps are performed in such a way to substantially prevent bit-errors from affecting operation of the decoding system.
The system and method also operate to parse an input string into irreducible grammar, update the grammar based on the last character to be parsed in the input string, and arithmetically encode the irreducible grammar into a string of bits to be sent to a decoder. The system and method repeats the parsing, updating and arithmetically encoding steps until a number of variables for said grammar reaches a predetermined limit, and then reusing a previous variable by deleting the previous variable from the grammar and creating a new variable that is defined by additional input data than the previous variable, to prevent said grammar from exceeding a predetermined limit of memory. The system and method further begin with a predefined frequency count distribution of a source alphabet, or a predefined grammar, and then operate to parse the input string based on the predefined frequency count distribution or predefined grammar, while also updated the frequency count distribution or predefined grammar based on the last character to be parsed in the input bit stream.
Furthermore, the system and method can operate to compress or decompress a data bit stream configured in the form of data packets. Specifically, during compression, the system and method parse the bits in each of the data packets into irreducible grammar based on a trie-type data structure that represents the variables of the irreducible grammar. The system and method perform the parsing of data in each respective data packet without using knowledge of the data bits in the subsequent data packets, and also update the grammar based on the parsed characters in the respective data packets without using knowledge of the data bits in the subsequent data packets. During decompression, the system and method decompress the data packets based on irreducible grammar in a trie-type data structure that represents the variables of the irreducible grammar. The system and method perform the decompressing of data in each respective data packet without using knowledge of the data bits in the subsequently received data packets, and also update the grammar based in the respective data packets without using knowledge of the data bits in the subsequently received data packets.