1. Field of the Invention
The invention relates to the field of data compression and more particularly to the field of incremental and continuous data compression.
2. Description of Background Art
An important goal of conventional communication systems is to improve the bandwidth and throughput of data. Instead of sending every bit of data, conventional systems use compression algorithms to reduce the amount of data that needs to be transmitted from a source to a destination. Two classes of compression algorithms are: loss-less compression algorithms and lossy compression algorithms. Loss-less compression algorithms convert data into a form in which none of the information contained in the data is lost. In contrast, lossy compression algorithms generate a representation in which some details of the data may be excluded.
Compression algorithms can also be divided loosely into two categories: targeted and general purpose. Targeted compression and decompression is applied to data for which a priori knowledge of the data characteristics are available. For instance, video data may be known to consist of individual frames, each of which differs from its previous or subsequent frame by a small amount. In such a case, a targeted compression scheme can take advantage of this a priori knowledge to design a more specific and hence potentially more efficient compression and decompression algorithm. General purpose compression algorithms (also known as universal compression algorithms) do not assume any a priori knowledge of the data characteristics or of the source that is generating the data. General purpose compression is therefore often less efficient, in that it more frequently results in a smaller degree of compression than a targeted compression algorithm for specific types of data. However, general purpose compression algorithms are more flexible because they can be effectively applied to many different types of data and can be applied when information about the data is not known beforehand, as described above.
One class of general purpose compression algorithms is based on the identification and elimination of repetitions in the data. These methods are referred to as dictionary based compression techniques since they attempt to discover a dictionary of repeated terms or phrases. The learned dictionary terms are then used to eliminate repetitions of these terms in a set of target data.
Two other types of compression algorithms are incremental compression algorithms and continuous compression algorithms. These two types of algorithms are not exclusive, i.e., a compression algorithm can be both incremental and continuous. An incremental compression algorithm is an algorithm that does not require processing of either the entire input stream or entire blocks of the input stream in order to generate its output. Instead, an incremental compression algorithm processes the input on a symbol-by-symbol basis (i.e., incrementally) and generates its output while it is still processing its input—rather than after it has processed all of the input. For example, an algorithm that computes the total number of vowels in a piece of text is fundamentally a non-incremental algorithm since it has to process the entire text input to compute the total number of vowels. On the other hand, an algorithm that converts lower case text to upper case can be incremental since it can process each input character independently and can generate its output as it processes each input character.
A continuous algorithm is one that can run indefinitely on an infinite stream of input data without running out of system resources such as memory, disk space, etc. Continuous algorithms are also often referred to as streaming algorithms. Note that a non-incremental algorithm that generates output only after processing all of the input is by definition non-continuous (since the input is infinitely long in the case of continuous algorithms).
The cost of storage and transmission of data is directly correlated with the size of the data object. Hence, removing redundancy from the data is a highly effective means to improving the efficiency of storage and transmission of the data. Most general purpose loss-less data compression algorithms attempt to remove redundancy from data by two principle means: (1) identification and elimination of repeated terms or phrases; and (2) encoding of the data in a more efficient form.
Identification of repeated terms or phrases can be performed by various techniques. The general principle involved can be illustrated by an example. Consider the sequence of characters in equation (1).S=aabcaabdaabeaabf  Equation (1)
A dictionary based compression algorithm could identify that the phrase “aab” is repeated 4 times in this sequence. The sequence could then be more efficiently stored or transmitted if the algorithm replaced all instances of “aab” with a new symbol, e.g., A. The compressed sequence would then look like the sequence in equation (2).S=AcAdAeAf  Equation (2)
In addition to the above compressed sequence, the algorithm would also have to store or transmit an additional instruction to indicate that all instances of A should be replaced by “aab” during decompression. Therefore, the instruction A=aab is the dictionary term upon which the compression is based. The dictionary as well as the compressed string must be stored or transmitted to enable decompression. Though in this case the dictionary was easily determined, it has been shown that for a given input sequence, the problem of finding the dictionary that would yield the highest degree of compression is NP-complete which is described in J. A. Storer, Data compression via textual substitution, Journal of the Association for Computing Machinery, 29(4): 928-951 (1982), which is incorporated by reference herein in its entirety.
The encoding of data involves modifying the representation of the data on a per-character basis such that frequently occurring characters can be represented more efficiently (e.g., with a fewer number of bits). Consider the sequence of 14 characters in equation (3).S=abacadaeafagah  Equation (3)
In this case the character “a” occurs 7 times while each of the characters “b-h” each occur only once. If the entire alphabet consisted of only the 8 characters “a-h”, they could be represented in binary form using 3 bits per character as illustrated in table 1.
TABLE 1a000b001c010d011e100f101g110h111
This would result in the sequence (S) requiring a total of 14×3=42 bits. On the other hand, since we can see that the character “a” occurs more frequently in the data, it may be more efficient to represent “a” with fewer bits at the cost of increasing the number of bits for the remaining characters in the alphabet. For instance, the 8 characters could instead be represented as illustrated in table 2.
TABLE 2a0b1000c1001d1010e1011f1100g1101h1110
In this case, the string S would require 1 bit to represent each of the 7 “a” characters and 4 bits to represent each of the remaining characters. Hence the total space required for S would be 7*1+7*4=35 bits. This represents a savings of over 16 percent.
There are various examples of such statistical coding methodologies, such as Huffman coding and arithmetic coding. A more detailed description of such methodologies is in: D. A. Huffman, A method for the construction of minimum-redundancy codes, Proceedings IRE, 40:1098-1101 (1952) and in
Witten, Neal, and Cleary, Arithmetic coding for data compression, Communications of the Association for Computing Machinery, 30(6):520-540 (1987) which are incorporated by reference herein in their entirety.
A general principle that applies to many such statistical coding techniques was proposed by Shannon in 1948 in Shannon, A Mathematical Theory of Communication, Bell System Technical Journal, 27:389-403 (1948) that is incorporated by reference herein in its entirety. Shannon showed that the number of bits required to encode a character or string which occurs with probability P is −log2P. Hence, if the eight characters a-h each occurred with equal probability, P=1/8, each character could be encoded in −log2(1/8)=3 bits. But in our example above, we know that “a” occurs with probability 7/14 while the remaining characters occur with probability 1/14. Hence “a” can encoded in −log2(7/14)=1 bit, while each of the remaining characters can be encoded in −log2(1/14)=3.8 bits.
Another coding methodology is run-length encoding. In this case sequences of the same character are replaced by a single instance of the character followed by a number which indicates the number of times the character is repeated. One example is given in equation (4).S=aaaaabbbbb  Equation (4)This string of characters can be encoded using run-length encoding as shown in equation (5).S=a5b5  Equation (5)
Many conventional dictionary based, general purpose, loss-less compression algorithms are based on a combination of the two approaches described above, e. g., first a dictionary based compression of repeated phrases followed by statistical encoding of the resulting compressed stream. Some of these conventional compression techniques are now described.
One conventional compression technique was described in Ziv and Lempel, A Universal Algorithm for Sequential Data Compression, IEEE Transactions on Information Theory, IT-23(3):337-343 (1977) which is incorporated by reference herein in its entirety. This widely used dictionary based general purpose compression technique is known as LZ77 and has formed the basis of several other compression algorithms. For instance, the “gzip” compression program, which is widely distributed with UNIX based operating systems, uses a variant of the LZ77 method. LZ77 is based on the use of pointers to previous instances of a phrase within a window of fixed size. Repeated phrases in the data are found by sliding a window across the input sequence and searching for any duplicated strings within the window. For example, consider the input sequence in equation (6).S=abcdefbcdgh  Equation (6)
The LZ77 methodology determines that the phrase “bcd” is repeated twice and uses this information to compress the sequence. The first instance of “bcd” is unmodified. The second instance is replaced by a pointer consisting of the distance from the beginning of S to the first instance of “bcd” as well as the length of the repeat. Hence the sequence S would be represented by LZ77 as per equation (7).S=abcdef(1,3)gh  Equation (7)
The pointer (1,3) indicates that the phrase starting at distance 1 from the start of the window and extending to the right by 3 characters has been repeated at the current position of the pointer. A variation of this scheme uses the distance back from the current position as the first element of the pointer (instead of the distance forward from the start of the window). In this case S would be represented as per equation (8).S=abcdef(5,3)gh  Equation (8)
Here the pointer (5,3) indicates that the phrase starting at distance 5 back from the current position and extending to the right by 3 characters has been repeated.
Conventional LZ77 based compression programs use the above described pointer based methods to convert variable length repetitions into fixed length pointers. The resulting sequence of symbols and pointers is then compressed by applying a statistical coding technique. These programs can use different methods for discovering repeated phrases and encoding the final data stream.
One problem with the LZ77 method is that it is able to detect repetitions only within a window of fixed size. The limited window size prevents detection of repeated data that are separated by a distance larger than the window size. For instance, in the above example if the window size is reduced to 5 characters, the repetition of “bcd” would not be detected since the total distance from the beginning of the first instance of “bcd” to the end of the second distance is greater than 5. The size of the window is limited in LZ77 methods in order to limit the time required to search for repetitions. The complexity and execution time of the search algorithms used with the LZ77 method are typically a function of the size of the input string which is being searched. Conventional LZ77 compression techniques therefore usually limit the size of the window to a few thousand characters. For instance, the “gzip” program uses a window of 32 Kbytes. Increasing the window size would result in a very significant increase in the execution time of the LZ77 algorithm.
Another problem with the LZ77 compression method is that it requires a second stage of statistical coding to provide adequate compression rates. The statistical encoding techniques employed by LZ77 methods are non-incremental and hence non-continuous (e.g., gzip uses Huffman coding which is non-incremental). Non-incremental coding techniques must completely process of a block of data before outputting a coding-tree for that block of data. The block sizes used by non-incremental techniques must also be sufficiently large to ensure that the coding scheme generates an efficient coding tree. LZ77 techniques are therefore not amenable to real-time or on-line compression where there is a continuous stream of data that must be processed incrementally.
Yet another problem with LZ77 techniques is that the number of possible pointers is very large since they can point to any position in the window.
Ziv and Lempell addressed some of the problems with the LZ77 technique in 1978 by proposing a new compression scheme known as LZ78. This is described in Ziv and Lempel, Compression of Individual Sequences Via Variable Rate Coding, IEEE Transaction on Information Theory, IT-24(5):530-536 (1979) that is incorporated by reference herein in its entirety. Instead of using pointers to a position in the window, LZ78 methods use an explicit representation of a dictionary of all phrases that are encountered in the input stream. The dictionary is constructed incrementally by building upon previous dictionary terms. Every time a new phrase is seen it is added to the dictionary under the assumption that it may be used in the future. Consider the input sequence in equation (9).S=cbaabacaccacccacccc  Equation (9)
LZ78 generates the phrase (0,c) where 0 is the null string and c is the first character. The next two characters will also result in two new phrases (0,b) and (0,a). The final sequence of phrases is illustrated in table 3.
TABLE 3Input Phrase #Output phrasec1(0, c)b2(0, b)a3(0, a)ab4(3, b)ac5(3, c)acc6(5, c)accc7(6, c)acccc8(7, c)
The final encoding of the sequence S will therefore be the column of output phrases shown in table 3. As can be seen in this example, the dictionary entries 1, 2, and 4 are never used in this encoding and are therefore wasted entries in the dictionary. For instance, while the dictionary entry for “ac” is re-used to incrementally generate “acc”, “accc”, and “acccc”, the dictionary entry for “ab” is never used again and is hence wasted.
One problem with the LZ78 technique is that it uses a very aggressive and speculative dictionary construction scheme, which often results in the construction of terms that are not productively used. Hence, the dictionary can become very large and result in an inefficient use of system resources and a decrease in the compression efficiency. In addition, the rate of convergence of the LZ78 scheme is slow because the dictionary grows at a slow rate. LZ78 based compression programs also often use non-incremental statistical coding techniques to improve compression efficiency and program speed and hence cannot be used with on-line or continuous data. Furthermore, there is no provision for forgetting (deleting) phrases or dictionary terms that are no longer used. For a continuous, and potentially infinite stream of data, it is essential not only to dynamically generate new dictionary terms but also to forget terms that are being used infrequently in order to reuse system resources which is not practiced by the LZ78 algorithm. Hence the LZ78 algorithm is not a continuous compression algorithm.
A third type of compression algorithm is the Sequitur algorithm that is described in Nevill-Manning and Witten, Compression and Explanation Using Hierarchical Grammars, Computer Journal, 40(2): 103-116 (1997) that is incorporated by reference herein in its entirety. The Sequitur algorithm infers a context free grammar from a sequence of discrete symbols. The grammar hierarchically represents the structure of the sequence and can be used to produce useful visual explanations of the structure of the sequence and to infer morphological units in the sequence. Since the grammar fully represents the entire input sequence, Sequitur can also be used for data compression.
Sequitur works by enforcing two constraints on the input sequence. The first constraint is that that no pair of adjacent symbols should appear more than once. The second constraint is that every rule generated by the algorithm should be used more than once. Sequitur applies these constraints by examining the input sequence incrementally and ensuring that both constraints are satisfied at each point in the sequence. For instance, in the input sequence illustrated in equation (10),S=abcdbcabcd  Equation (10)Sequitur would generate the grammar in equation (11).S=BABA=bcB=aAd  Equation (11)where, A and B are rules in the grammar—which are similar to dictionary terms. The above grammar satisfies the first constraint since no pair of symbols appears more than once. The second constraint is also satisfied since both A and B are used at least twice in the grammar.
Since the entire input sequence is represented by the grammar, Sequitur uses this algorithm for compression by applying arithmetic coding to encode the complete grammar. The rules of the grammar (i.e., the dictionary terms) are transmitted by pointers to previous instances of a repeat, which is similar to the technique describe above with reference to LZ77. When a rule is encountered for the first time in the grammar, its contents are transmitted. The second instance of the rule is transmitted as a pointer to the region of the sequence (e.g., the contents of the first instance of the rule) that was used to construct the rule. All further instances of this rule are transmitted as a rule number under the assumption that the decoder and encoder can keep track of each other's rule numbers.
One problem with the Sequitur compression technique is that it is not implicitly incremental. In order to ensure that the grammar is transmitted with the fewest number of symbols, Sequitur requires that the grammar be fully constructed before it is transmitted. Sequitur can be made to appear to be incremental by selecting transmission points along the sequence S at which the probability of transmitting extra symbols is low. The algorithm for detecting whether a certain point in the compressed sequence is a safe point to transmit the sequence requires examining all previous instances of the symbol just before this point. Since Sequitur needs to select these points dynamically throughout the compression of the input sequence, the algorithm incurs a significant amount of extra processing to continuously search for these transmission points. This additional processing (which is necessary to make Sequitur incremental) makes the overall compression algorithm non-linear and hence significantly less efficient.
Another problem with the Sequitur algorithm is that it is not continuous. The algorithm does not provide any means for incrementally transmitting the compressed output while simultaneously deleting rules and symbols that are infrequently accessed (in order to re-use system resources). Hence Sequitur cannot be applied to an infinite or very large stream of input data to generate a continuous stream of compressed output in linear time.
The Sequitur algorithm is also inefficient in its use of system resources since it requires complex data structures to enable the frequent creation and deletion of rules of variable length. In addition, the algorithm is computationally inefficient at detecting long repetitions since each pair of symbols in the repeated phrase requires the creation and deletion of a rule. Hence, each additional instance of the repetition will incur the computational overhead of multiple rule creations and deletions. Sequitur's technique for transmitting the second instance of a rule as a pointer also requires additional processing and memory overheads.
A fourth compression algorithm is the Recursive Pairing (Re-Pair) algorithm that is described in Larsson and Moffat, Offline Dictionary-Based Compression, Proceedings Data Compression Conference, 196-305 (1999) that is incorporated by reference herein in its entirety. The re-pair algorithm attempts to compute an optimal dictionary for compression by recursively examining the entire input sequence to identify the most frequently occurring pairs of symbols. At each stage of the algorithm the most frequently occurring pair of symbols is replaced by a new symbol representing a new addition to the dictionary. The entire modified sequence is then examined again to find the current most frequently occurring pair. This process is iterated until there is no pair that appears more than once. The resulting compressed sequence and dictionary is then encoded to generate the final compressed output.
The primary disadvantage of this algorithm is that it is fundamentally non-incremental and non-continuous. The entire input sequence must be processed by re-pair before any output can be generated. The authors themselves describe the algorithm as being an “offline” technique.
What is needed is a data compression system and method that (1) is a general purpose compression algorithm; (2) is a loss-less compression algorithm; (3) does not require a non-linear increase in execution time for a linear increase in data; (4) does not require a limited data window size; (5) is an incremental compression algorithm; and (6) is a continuous compression algorithm.