1. Field of the Invention
The present invention relates to an improved system and method for implementing the YK lossless data compression and decompression. More particularly, the present invention relates to a system and method for performing lossless data compression and decompression using irreducible grammar whose elements are represented as linked-list data structures. The present invention proposes a three-module computational architecture for implementing the YK lossless compression and decompression algorithm, with minimal communication across modules.
2. Description of the Related Art
Data compression algorithms have been developed to reduce the size of a data string so that the data string can, for example, be more efficiently stored and transmitted. An example of a grammar-based framework for lossless data compression algorithm is described in a publication by J. C. Kieffer and E.-H. Yang entitled xe2x80x9cGrammar based codes: A new class of universal lossless source codes,xe2x80x9d IEEE Transactions on Information Theory, vol. 46, no. 3, pp. 737-754, May 2000, the entire content of which is incorporated herein by reference.
As described in the Kieffer and Yang publication, to compress a data sequence, a grammar-based code first transforms the data sequence into a context-free grammar, from which the original data sequence can be fully reconstructed by performing parallel substitutions. This context-free grammar, which is a compact representation of original data, is compressed using arithmetic coding. If a grammar-based code transforms each data sequence into an irreducible context-free grammar, then the grammar-based code is universal for the class of stationary, ergodic sources.
Grammar-based codes offer a design framework in which arithmetic coding and string matching capabilities are combined in an elegant manner. Within the framework of grammar-based codes, an efficient greedy grammar transform, known as the Yang-Kieffer grammar transform, is described in a publication by E. H. Yang and J. C. Kieffer entitled xe2x80x9cEfficient universal lossless data compression algorithms based on a greedy sequential grammar transformxe2x80x94Part one: Without context models,xe2x80x9d IEEE Transactions on Information Theory, vol. 46, pp. 755-777, May 2000, the entire content of which is incorporated herein by reference. This grammar transform sequentially constructs a sequence of irreducible context-free grammars from which the original data sequence can be recovered incrementally. The transform sequentially parses the original data in non-overlapping, variable-length phrases, and updates the grammar accordingly. Based on this grammar transform, a sequential lossless data compression, known as the YK algorithm, was developed.
The YK algorithm efficiently encodes the grammar by exploiting the structure of the grammar and using arithmetic coding to encode the sequence of parsed phrases at each step. The YK algorithm jointly optimizes, in some sense, string matching and arithmetic coding capabilities to achieve excellent compression performance on a large class of data. The following is a brief description of the major operations in the YK algorithm.
In defining the variables used in the following description, let A be the source alphabet with cardinality |A| greater than or equal to 2. let A+ denote the set of all finite strings drawn from A, let x=x1x2K xnx=x1x2K xn be a finite string drawn for the alphabet A that is to be compressed, and let S={sn,s1,s2,K} be a countable set, disjoint from A. The symbol s0 is called the start symbol, while elements of S+={s1,s2,K} are called variables. For jxe2x89xa71, let S(j)={s0,s1,K,sjxe2x88x921}, and let S+(j)={s1,K,sjxe2x88x921}. A context-free grammar G is a mapping from S(j) to (S(j)∪A)+ for some jxe2x89xa71. The mapping G is explicitly represented by writing each relationship (si,G(si)) as sixe2x86x92G(si), for i less than j. G(s) is called as the production rule for
This grammar is an example of an admissible grammar because one obtains a sequence from A after finitely many parallel replacements of variables in G(si), and every variable si (i less than j) is replaced at least one by G(si). The resulting sequence from A is called A-string represented by the respective variable. These A-strings are intended to represent repeated phrases in the input data string. The start symbol is special because the A-string represented by it is the complete input data string.
In addition, this grammar is an irreducible grammar because none of Reduction Rules 1 to 5 described in the second Yang and Keiffer publication referenced above can be applied to it to get a new admissible grammar. Further details of the concepts of admissible grammar and irreducible grammar can be found in that publication. An example of an irreducible grammar representing the data string abacaacabbacbabaccabaccc is shown as follows:
s0xe2x86x92s3s1s1bbs2bs4s4c
s1xe2x86x92s2a
s2xe2x86x92ac
s3xe2x86x92ab
s4xe2x86x92s3s2c
A grammar-based framework was proposed in a publication by C. Nevill-Manning and I. Witten entitled xe2x80x9cCompression and explanation using hierarchical grammar,xe2x80x9d Computer Journal, vol. 40, pp. 103-116 1997, the entire content of which is incorporated herein by reference. However, the grammar described in that publication need not satisfy the irreducibility property, and hence can be suboptimal.
The main steps of the YK algorithm are briefly summarized here. Let x=x1x2xcex9 xn be a sequence from A, that is to be compressed. The YK algorithm proceeds through an iterative application of the following three main operations:
1. Parsing: The parsing operation parses the sequence x sequentially into non-overlapping substrings {x1,x2xcex9 xn2,K,xntxe2x88x921+1xcex9 xnt} and builds sequentially an irreducible grammar Gi for each x1xcex9 xni, where 1xe2x89xa6i=1, n1=1, and n1=n. The first substring is x1, and the corresponding irreducible grammar G1 consists of only one production rule s0xe2x86x92x1. At the ith step, suppose that the first i phrases x1, x2xcex9 xn2, . . . , xnixe2x88x921+1xcex9 xni have been parsed off. Suppose the variable set of the current grammar Gi is S+(ji)={s1,xcex9,sjixe2x88x921}, with j1=1. The next substring xni+1xcex9 xni+1 is the longest prefix of xni+1xcex9 xn that can be represented by sj if such a prefix exists in S+(ji). If such a prefix exists, the next parsed phrase is sj; otherwise, the next parsed phrase is the symbol xni+1.
2. Grammar and Search-List Update: Let xcex1 denote the last symbol on the right end of Gi(s0). The parsed phrase (denoted by xcex2) is appended to the right end of Gi(s0) to give the appended grammar Gxe2x80x21. The appended grammar is reduced, if possible, using Reduction Rules as described, for example, in the second Yang and Keiffer publication referenced above to yield an irreducible grammar Gi+1. Define the indicator sequence I:{1,2,xcex9,t}xe2x86x92{0,1} as follows: I(1)=0, and for any i greater than 1, I(i) is equal to 0 if Gi+1 equals Gxe2x80x2i (i.e. reduction was not possible), and 1 otherwise (i.e. reduction was possible). It is proved in ref. 1 that the grammar Gxe2x80x2i is reducible if and only if xcex1xcex2 is the only non-overlapping repeated pattern of length xe2x89xa72 in Gxe2x80x2i. To determine if the appended grammar is reducible, a search-list L1(xcex3) is maintained for all symbols in A Y S(ji), where L1(xcex3) consists of all the elements xcex7 such that the appended grammar Gxe2x80x2i would have the pattern xcex3xcex7 as a potential match if the next parsed phrase is xcex7. Depending on the values of I(i+1) and I(i), there are three possible distinct cases:
Case 0: I(i+1)=0
Case 10: I(i+1)=1,I(i)=0
Case 11: I(i+1)=1,I(i)=1
Under Case 0, Gi+1 is equal to Gxe2x80x2i. Under Case 10, Gi+1 is obtained by adding a new row to Gi representing the repeated phrase xcex1xcex2. Under Case 11, Gi+1 is obtained by adding xcex2 to the end of the row corresponding to the last new variable. The search-lists L1(xcex3) are appropriately modified in each of the three cases. The details of the grammar and search-list update operations can be found in the publication by J. C. Kieffer and E.-H. Yang, referenced above.
3. Arithmetic Encoding: An order 1 arithmetic code is used to encode the sequence {I(i)}ti=1, and the parsed phrases xcex2 are encoded using an order 0 arithmetic encoder in an interleaved manner. The alphabet used for the order 1 arithmetic encoder is {0,1}, and the counters c(0,0), c(0,1), c(1,0), and c(1,1) are used for the order 1 frequency distribution on the alphabet. Also, for each xcex3xcex5S+ Y A, two set of counters c(xcex3) and ĉ(xcex3) are used to model the frequency distribution for coding the parsed phrases xcex2. Depending on the three cases, the coding of xcex2 uses the following alphabets and counters:
Case 0: The alphabet used for coding xcex2 is A Y S+(ji)Y{xcfx86}xe2x88x92L2(xcex1), where L2(xcex1) is another search-list given by L1(xcex1)∪xcexa3, where xcexa3 contains all those symbols xcfx84 such that xcex1xcfx84 is the right side of the production rule of one of the variables in S+(ji), and xcfx86 is the end-of-sequence symbol disjoint from A and S(ji). The counters c(xcex3) are used for the frequency distribution.
Case 10: The alphabet used for coding xcex2 is the search-list L1(xcex1). The counters ĉ(xcex3) are used for frequency distribution.
Case 11: The parsed phrase xcex2 is the only element in the search-list L1(xcex1), and hence is not coded.
When the YK encoder reaches the end of the file being compressed, it encodes the indicator 0 (corresponding to case 0), and subsequently encodes the end-of-sequence symbol xcfx86. The counters are initialized at the start of the YK algorithm, and are updated suitably in each of the three cases. At initialization, the counters c(r,s) equal 1 for r,s=0,1, while the counters c(xcex3) and ĉ(xcex3) are given the value 1 for xcex3 xcex5 A, and 0 otherwise.
The YK decoder essentially decodes the sequence I(i)ti=1 using an order 1 arithmetic decoder, and based on one of the three cases discussed above, decodes the parsed phrase xcex2. Then it updates the grammar, the search-lists, and the counters in exactly the same manner as in the YK encoder. The decoder stops when it decodes the indicator 0, followed by the end-of-sequence symbol xcfx86. Note that the YK decoder does not perform the parsing operation. At each iteration step, the decoder translates the parsed phrase xcex2 into the corresponding A-string.
The following provides a basic implementation technique for the three major operations of the YK algorithm, based on the scheme proposed in a publication by E. H. Yang entitled xe2x80x9cEfficient universal lossless data compression algorithms based on a greedy sequential grammar transform: Implementation and experimental resultsxe2x80x9d, Technical Report, Hughes Network Systems, March 1999, the entire content of which is incorporated by reference herein. First, the main data structures used by the implementation are introduced. The data structure used to represent the grammar consists of two dynamic two-dimensional arrays: A symbol array D1 and a string array D2, with D1(sj) and D2(sj), for 1xe2x89xa6jxe2x89xa6jixe2x88x921, denoting the respective rows of D1 and D2, corresponding to variable sj at the end of the i th step.
Rows of D1 store the production rules of each variable, while rows of D2 store the A-string represented by each variable. Elements of the source alphabet A are represented by the non-negative integers {0,1,xcex9,|A|xe2x88x921}, and the variable sj is represented by the integer |A|+jxe2x88x921 for jxe2x89xa71. Rows D1(sj) are formed of symbols from {xe2x88x921}∪A∪{s0,s1,xcex9}. The special symbol xe2x88x921 can be considered as a dummy placeholder that represents the deleted entries in grammar rows that result when the grammar undergoes a reduction process under Case 10 and Case 11. The placeholder is important to avoid the computational complexity of shifting big chunks of data whenever a gap is created during the reduction process of the grammar. Note that apart from the xe2x88x921 integers, all other integers in the grammar data structure are non-negative. Rows D2(sj) are formed of symbols from A.
Elements in search-lists L2(xcex3) and its subsets L1(xcex3), for xcex3 xcex5{0,1,xcex9,|A|xe2x88x921,xcex9,jt+|A|xe2x88x921} are represented in form of a common quadruplet (xcex7,m,n,xcfx81), where m and n represent the row and column location of the pattern xcex3xcex7 in Gi, xcfx81 is a search-list-indicator such that xcfx81 is 1 if the element xcex7 belongs to L1(xcex3), else xcfx81 is 0. The first element xcex3 is referred to as the xe2x80x9ckeyxe2x80x9d of the quadruplet. Each search-list is represented by an array of such quadruplet search-list element structures, where elements of the array are arranged in increasing order of their keys.
1. Parsing: A simple grouping technique, for implementing the parsing operation is proposed in the E. H Yang publication referenced above, where the variables in the current grammar are grouped in terms of the first two letters of the A-string represented by each variable. Therefore, the search for the next parsed phrase xcex2 can be done within the group corresponding to the first two letters of the remaining input data string.
2. Grammar and Search-List Update: The indicator I(i+1) is determined by searching for the parsed phrase xcex2 in the search-list L1(xcex1), where xcex1 is the rightmost element in the top row of the grammar Gi. Since the elements of search-lists are arranged in increasing order, an efficient binary search can be used for efficiency. The grammar update procedure for each of the three cases is as follows:
Case 0: The integer xcex2 is simply appended to the end of D1(|A|) which represents the top row of the grammar.
Case 10: Suppose (xcex2,m,n,1) is the element in the search-list L1(xcex1) that identifies the matching pattern xcex1xcex2 at the location given by row m and column n The integer xcex2 is then searched after this location by skipping all the contiguous xe2x88x921. Consequently, this integer xcex2 is then replaced with a xe2x88x921, and the integer xcex1 at location given by row m and column n is replaced with the integer sji=|A|+jixe2x88x921. A new row D1(sji) is created consisting of the pattern xcex1xcex2. A new row D2(sji) is created to store the A-string corresponding to D1(sji).
Case 11: In this case xcex1 is equal to sji=|A|+jixe2x88x921. Similarly, as in Case 10, the integer xcex2 is searched after the location identified by the only element (xcex2,m,n,1) in the search-list L1(xcex1). Consequently, this integer xcex2 is then replaced with the dummy xe2x88x921. The integer xcex2 is appended to the end of the row D1(xcex1).
The search-list update procedure for each of the three cases is as follows. It is assumed that the grammar update procedure is complete before the search-list update operations are performed. When an integer i is searched in the search-list of j, it is understood that the integer i is searched amongst the first entries of the elements of the search-list of j, and this search is not performed if either i or j are null. When an integer i is added to the search-list of an integer j, it is understood that the integer i is added in the appropriate position (in increasing order) of the search-list of the integer j, provided that both i and j are not null, and the integer i is not already present as the first entry of an element of the search-list of j. When an integer i is deleted from the search-list of an integer j, it is understood that the integer i is deleted from the search-list of the integer j, provided that both i and j are not null.
Case 0: Search the first non-negative integer, call it "sgr", before the last element xcex1 of the top row of grammar Gi. The integer xcex1 is then added to the search-list of the integer "sgr". The corresponding row and column locations are added as the next two entries, and the search-list-indicator is set to 1. Note that since the search-list entries are maintained in their increasing order, the appropriate location in L1("sgr") needs to be searched for introduction of the integer xcex1.
Case 10: Suppose (xcex2,m,n,1) is the element in the search-list L1(xcex1) that identifies the location of the matching pattern xcex1xcex2 in terms of row m and column n. Search the first two non-negative integers, call them xcex7 and "xgr" respectively (i.e. xcex7 is to the right of "xgr"), before the integer xcex1 at the location identified by this search-list element, if possible (it may not always be possible because one might already reach the beginning of the row before finding either of these integers, if it is not possible to find such integer(s), the corresponding symbols are set to null). The non-negative integer xcex2 that followed xcex1 in grammar Gi was already replaced by a xe2x88x921, and xcex1 was replaced with the integer sji, as described above. Search the first two non-negative integers, call it xcex3 and xcex4 respectively (i.e. xcex3 is to the left of xcex4), after this integer sji, if possible. At most, four search-lists need to be updated. These correspond to the respective search-lists of the integers xcex1, xcex2, xcex7, and sji. The search-list update operations on these four search-lists are described below.
Search-list of xcex1: The element (xcex2,m,n,1) in the search-list of xcex1 is replaced with the element (xcex2,m,n,0). This captures the fact that the integer xcex2 still belongs to L2(xcex1), but not to L1(xcex1).
Search-list of xcex2: The search-list of xcex2 is searched for the integer xcex3; if xcex2=xcex3=xcex4, then the column location field of the searched search-list element is updated to correspond to the location of the phrase xcex3xcex4 (instead of the original location of the phrase xcex2xcex3), else this search-list element is removed.
Search-list of xcex7: The integer sji is added to the appropriate position in the search-list of xcex7, along with the corresponding row and column locations, and the search-list-indicator is set to 1 if the number of non-negative integers in the corresponding row is greater than 2, and 0 otherwise. The search-list of xcex7 is searched for the element corresponding to the integer xcex1; if xcex1=xcex7="xgr", the column location field of the searched search-list element is updated to correspond to the location of "xgr", else this search-list element is removed.
Search-list of sji: Unless the integer xcex3 is located at the rightmost position of the top row of the updated grammar, the integer xcex3 is added to the hitherto empty search-list of sji, along with the corresponding row and column locations, and search-list-indicator is set to 1 if the corresponding row contains more than 2 non-negative integers, 0 otherwise.
Case 11: In this case, xcex1 equals sji. Let the integers xcex3 and xcex4 mean the same as in Case 10. Find the two rightmost non-negative integers in the row D1(sji), call these xcex8 and "THgr" respectively (i.e. xcex8 is to the let of "THgr"). At most, four search-lists need to be modified in this case. These correspond to the search-lists of sji, xcex2, "THgr", and xcex8, as described below:
Search-list of sji: Unless the integer xcex3 is located at the rightmost position of the top row of the updated grammar, the integer xcex3 is added to the hitherto empty search-list of sji, along with the corresponding row and column locations, and search-list-indicator is set to 1 if the corresponding row contains more than 2 non-negative integers, and 0 otherwise.
Search-list of xcex2: The search-list of xcex2 is searched for the integer xcex3; if xcex2=xcex3=xcex4, then the column location field of the searched search-list element is updated to correspond to the location of xcex3, else this search-list element is removed.
Search-list of "THgr": The integer xcex2 is added to the search-list of "THgr", along with the corresponding row and column location, and search-list-indicator is set to 1.
Search-list of xcex8: Note that the integer "THgr" is already present in the search-list of xcex8. Search for this search-list element. If the number of elements in the row D1(sji) equals 2, this search-list-indicator field (which was previously 0) is set to 1.
3. Arithmetic Encoding: The parsed integer xcex2 is arithmetic encoded to produce the compressed bit-stream. A fixed-point implementation of arithmetic coding is described in a publication by I. H. Witten, R. Neal, and J. G. Cleary entitled xe2x80x9cArithmetic coding for data compression,xe2x80x9d Communications of the ACM, Vol. 30, pp. 520-540, June 1987, the entire content of which is incorporated herein by reference. To overcome the problem of dealing with potentially large alphabet in Case 0, the arithmetic coding in this case can be efficiently performed using a recently proposed technique called multilevel arithmetic coding as described in a publication by E.-H. Yang and Y. Jia entitled xe2x80x9cUniversal lossless coding of sources with large or unbounded alphabets,xe2x80x9d Numbers, Information and Complexity (Ingo Althofer, et al, eds.), Kluwer Academic Publishers, pp. 421-442, February 2000, the entire content of which is incorporated herein by reference.
The basic idea behind multi-level arithmetic coding is to represent the alphabet using an unbalanced binary tree, where the leaves represent small subsets of the source alphabet, and xcex2 is encoded by first coding the path needed to reach the leaf corresponding to the sub-alphabet containing xcex2, followed by coding the index of xcex2 in this sub-alphabet. For coding the index of xcex2 in this sub-alphabet, the cumulative frequency array for this sub-alphabet needs to be computed. Subsequently, the frequency counters are updated accordingly.
In each iteration step, the YK decoder obtains xcex2 by decoding the compressed bit-stream to obtain the path in the multi-level alphabet tree and subsequently decoding the index in the sub-alphabet corresponding to the leaf. The frequency, grammar and search-list update operations are exactly like in the YK encoder. A parallel replacement procedure is performed on the decoded integer xcex2 to obtain the original data substring that was parsed by the encoder. Note that since no parsing is performed at the decoder, the two dimensional array D2 is not needed in the decoder.
The basic implementation described above is suitable for implementation of the YK algorithm in software. However, in hardware, the basic implementation suffers from certain drawbacks:
The parsing operation using the grouping method entails searching among potentially multiple rows of the two-dimensional array D2, and comparing each of these rows with the remaining portion of the input data stream. This tends to make the parsing operation slow.
The rows of the two-dimensional array D1 and D2 typically show a great deal of variation in size. While many rows tend to be of size 2, several other rows can be much longer. Hence, the memory requirements for such two-dimensional arrays need to be dynamically managed. Whenever the need arises for increasing the size of a row beyond what has been allocated, the whole row needs to be moved to another memory location where a sufficiently long array capable of representing the longer row exists. This is a time consuming process on hardware platforms that are not equipped with a DMA (direct memory access) engine. Moreover, if longer sizes are pre-allocated in advance for the rows of the grammar, this could lead to a lot of memory wastage.
Another problem with the two-dimensional array based representation of the grammar is that a large number of the dummy integer xe2x88x921 tends to get created, and they never get removed. This leads to further memory wastage, which also causes additional memory accesses for skipping over the dummy entries to find a non-negative grammar element.
The search-list update operation requires possibly multiple searches through different search-lists in order to find an element that needs to be deleted, and also to find the appropriate position where a new search-list element needs to be inserted (to maintain the increasing order of the first entries of the elements of the search-list). Moreover, every insertion or deletion from a search-list involves forward or backward shifting of all the search-list elements that lie after the position where the insertion or deletion is made respectively.
Furthermore, the alphabets used under both Case 0 and Case 10 depend on certain search-lists, as seen in the description of the YK algorithm above. This implies that the arithmetic encoder and decoder need to access these search-list data structures in order to compute the cumulative frequency array. This makes it difficult to realize a clean separation of the two logically distinct components of the YK algorithmxe2x80x94(a) the grammar-related part and (b) the arithmetic coding-related part.
The parallel replacement procedure that is used by the decoder to translate the decoded integer xcex2 into the corresponding A-string can potentially require a large number of recursive operations.
An implementation scheme was proposed in the E. H. Yang publication referenced above that is suitable for efficient software implementation of the YK algorithm. However, Yang""s scheme involves certain algorithmic steps that may be somewhat difficult to implement in hardware.
Accordingly, a need exists for an improved data compression and decompression system and method.
An object of the present invention is to provide an improved system and method for implementing the YK lossless data compression of a data string.
Another object of the present invention is to provide an improved system and method for performing lossless data compression of a data string by parsing the data string and representing the parsed characters of the data string as irreducible grammar that is efficiently updatable in a software or a hardware implementation.
These and other objects of the present invention are substantially achieved by providing a system and method for performing data compression of a data string. The system and method are each capable of parsing the data string into variables of an irreducible grammar, such that each variable represents a respective plurality of data characters of the data string, and formatting each element of the irreducible grammar as a linked list data structure having forward and backward pointers pointing to linked list data structures representing other grammar elements. The system and method are each further capable of updating the irreducible grammar based on at least one character to be parsed in the input string by changing at least one pointer of at least one of the linked list data structures to point to a linked list data structure representing a different element of the grammar than that to which the at least one pointer pointed prior to updating. Associated with the irreducible grammar, the system and method maintains a set of search-lists. The system and method further represent each element of these search-lists as a (different) linked-list data structure, having forward and backward pointers, possibly pointing to linked-list data structures representing other search-list elements. The two types of linked-lists are further cross-coupled in the following sense: A linked-list data structure representing a grammar element also includes a pointer, which can point to a linked-list data structure representing a search-list element; moreover, a linked-list data structure representing a search-list element also includes a pointer, which can point to a linked-list data structure representing a grammar element. The system and method is further capable of encoding the irreducible grammar into a string of bits. The system and method is further capable of decoding such a string of bits into the original data string. Furthermore, the system and method can employ a separate parse module, grammar transform module, and arithmetic coder module to perform the parsing, linked list formatting, and encoding operations, respectively.