The present invention relates to an apparatus and method for processing digital data sequences wherein the data is coded and subsequently restored, and further relates to data compression wherein data that is compressed and then subsequently decompressed is identical to the original.
Any algorithm for lossless data compression must, by definition, allow the original data to be wholly reconstructed from the compressed data. No known algorithm in this class, however, can guarantee compression for all possible input data sets. In other words, for any lossless data compression algorithm there will be an input data set that does not get smaller when processed by the algorithm. Thus any lossless compression algorithm that makes some files shorter will make some files longer as well. Good compression algorithms are those that achieve shorter output on input distributions that occur in real-world data. While, in principle, any general purpose lossless compression algorithm can be used on any type of data, many are unable to achieve significant compression on data that is not of the form for which they are designed to operate.
The most well-known methods for lossless data compression may be classified as follows: (1) run-length coding methods; (2) dictionary-based coding methods, such as Lempel-Ziv algorithms LZ77, LZ78, LZW, and LZRW1; (3) statistics-based coding methods, such as Shannon-Fano coding, Huffman coding (modified Huffman code), and arithmetic coding (binary arithmetic coding, and QM-coder); and (4) coding methods based on data transform, such as Burrows-Wheeler and predictive coding. If the parameters of the algorithms are modified in response to one or more characteristics of the input data, they are referred to as “adaptive;” otherwise, they are considered not adaptive and their parameters are fixed for the whole process of data coding.
Many different run-length coding methods have been developed. Run-length encoding algorithms are based on the observation that certain types of data files frequently contain the same character or digit repeated many times in a row. Digitized signals, for example, often have runs of the same value, indicating that the signal is not changing. In particular, run-length encoding for a data sequence often has frequent runs of zeros. Each time a zero is encountered in the input data, the algorithm writes two values to the output file. The first of these values is a zero, a flag to indicate that run-length compression is beginning. The second value is the number of zeros in the run. If the average run-length is longer than two, compression will take place. On the other hand, many single zeros in the data sequence can make the encoded file larger than the original. Run-length encoding can be used on only one of the characters (as with the zero above), several of the characters, or all of the characters. On the other hand, binary (black-and-white) images, such as standard facsimile transmissions, usually consist of runs of 0's or 1's. While the original binary data requires 65 bits for storage, its compact representation requires 32 bits only under the assumption that 4 bits are representing each length of run. The early facsimile compression standard algorithms were developed based on this principle.
The dictionary-based coding techniques are also often used for data compression. Most of the popular text compression algorithms use the dictionary-based coding approach. In dictionary coding, groups of consecutive input symbols (phrases) can be replaced by an index into some dictionary. Ziv and Lempel described dynamic dictionary encoders, popularly known as LZ77 and LZ78, by replacing the phrases with a pointer to where they have occurred earlier in the text. The LZW method achieves compression by using codes 256 through 4095 to represent sequences of bytes. The longer the sequence assigned to a single code, and the more often the sequence is repeated, the higher the compression achieved. Although this is a simple approach, there are two major obstacles that need to be overcome: (1) how to determine which sequences should be in the code table, and (2) how to provide the decompression program the same code table used by the compression program. The LZW algorithm exquisitely solves both these problems. When the LZW program starts to encode a file, the code table contains only the first 256 entries, with the remainder of the table being blank. This means that the first codes going into the compressed file are simply the single bytes from the input file being converted to 12 bits. As the encoding continues, the LZW algorithm identifies repeated sequences in the data, and adds them to the code table. Compression starts the second time a sequence is encountered. The key point is that a sequence from the input file is not added to the code table until it has already been placed in the compressed file as individual characters (codes 0 to 255). This is important because it allows the decompression program to reconstruct the code table directly from the compressed data, without having to transmit the code table separately.
LZ77, another dictionary-based coding approach, was the first form of Ziv-Lempel coding proposed by Ziv and Lempel in 1977. In this approach, a fixed-size buffer containing a previously encoded character sequence that precedes the current coding position can be considered as a dictionary. The encoder matches the input sequence through a sliding window. The window is divided into two parts: a search window that consists of an already encoded character sequence and a look-ahead buffer that contains the character sequence to be encoded. To encode the sequence in the look-ahead buffer, the search window is searched to find the longest match with a prefix of the look-ahead buffer. The match can overlap with the look-ahead buffer, but cannot be the buffer itself. Once the longest match is found, it is coded into a triple <offset, length, C(char)>, where offset is the distance of the first character of the longest match in the search window from the look-ahead buffer, length is the length of the match, and C(char) is the codeword of the symbol that follows the match in the look-ahead buffer.
LZ78 is the other key algorithm in the L-Z family, proposed by Ziv and Lempel in 1978. Instead of using the previously encoded sequence of symbols (or string) in the sliding window as the implicit dictionary, the LZ78 algorithm explicitly builds a dictionary of patterns dynamically at both the encoder and the decoder.
Turning to statistics-based coding methods, the Shannon-Fano algorithm is well-known for its simplicity. The algorithm makes use of the original messages m(i) and the corresponding probabilities for their appearance P(m(i)). The list is divided into two groups with approximately equal probability. Every message from a first group has “0” as the first code digit; every message from the second group has “1” as the first code digit. Each group is divided into two parts in a similar way and the second digit is added to the code. The process goes on until groups containing one message only are obtained. As a result, every message will have a corresponding code x with length −lg(P(x)). It may be seen that while the Shannon-Fano algorithm is indeed simple, it does not guarantee optimum coding.
Another statistics-based coding technique is the Huffman Algorithm. To describe this algorithm, consider a group of messages m(1), . . . , m(n) that have probabilities P(m(1)), . . . P(m(n)), and let them be arranged such that P(m(1))>P(m(2))> . . . >P(m(N)). Then, let x1, . . . , xn be a set of binary codes with lengths l1, l2, . . . , lN. The task of the algorithm is to define the correspondence between m(i) and xj. It can be proven that for every set of messages there exists a binary code, in which the two codes with lowest probability xN and xN−1 have the same length, and differ only by their last symbol: xN has a last bit of “1”, and xN−1 has a last bit of “0”. The reduced set will have its two codes with lowest probability grouped together as well and the procedure continues in the same way until there remain only two messages.
Although Huffman coding is a very efficient entropy coding technique, it has several limitations. The Huffman code is optimal only if the exact probability distribution of the source symbols is known. It is also clear that each symbol is encoded with an integer number of bits. It is known from Shannon's theory that the optimal length of a binary codeword for a source symbol s from a discrete memoryless source is —log p(s), where p(s) is the probability of appearance of symbol s. This condition is exactly satisfied when the probabilities of the source symbols are negative integer powers of two (e.g., 2−1, 2−2, 2−3, 2−4, etc.). If the probabilities of the symbols significantly deviate from this ideal condition, encoding of these symbols can result in poor coding efficiency. The average code length less the entropy defines redundancy of a source. It can be shown that the redundancy of Huffman codes can be bounded by p+0.086, where p is the probability of the most likely symbol [a]. As a result, the redundancy will be very high if the probability of occurrence of a symbol is significantly greater compared to the others. Huffman coding is not efficient to adapt with the changing source statistics. Another limitation of Huffman coding is that the length of the codes of the least probable symbol could be very large to store into a single word or basic storage unit in a computing system. In the worst-case scenario, if the probability distribution of the symbols generates a Huffman tree that is a skewed binary tree, the length of the longest two codes will be n−1 if there are n source symbols. The Huffman tree for this source will be a skewed binary tree and the Huffman codes of a, b, c and d can be 1, 01, 001 and 000, respectively. Usually the Huffman codes are stored in a table called the Huffman table. In its simplest form of implementation, each entry in the table usually contains a Huffman code. Since the Huffman code is a variable-length code, the length of the longest code usually determines the storage of each entry into the code table. For an arbitrarily large code it is a limitation.
Turning now to arithmetic coding, the basic idea is to consider a symbol as digits of a numeration system, and text as decimal parts of numbers between 0 and 1. The length of the interval attributed to a digit (it is 0.1 for digits in the usual base 10 system) is made proportional to the frequency of the digit in the text. The encoding is thus assimilated to a change in the base of a numeration system. To cope with precision problems, the number corresponding to a text is handled via a lower bound and an upper bound, which remains to associate with a text a subinterval of [0,1]. The compression results from the fact that large intervals require less precision to separate their bounds.
The algorithms for arithmetic coding suffer from a number of limitations. First, the encoded value is not unique because any value within the final range can be considered as the encoded message. It is desirable to have a unique binary code for the encoded message. Second, the encoding algorithm does not transmit anything until encoding of the entire message has been completed. As a result, the decoding algorithm cannot start until it has received the complete encoded data. It may be noted that these first two limitations may be overcome by using binary arithmetic coding. A third limitation is that the precision required to represent the intervals grows with the length of the message. A fixed-point arithmetic implementation is desirable, which can again be achieved using the binary arithmetic coding by restricting the intervals using a scaling approach. Fourth, the use of multiplications in the encoding and decoding process, in order to compute the ranges in every step, may be computationally prohibitive for many real-time fast applications. Finally, the algorithm is very sensitive to transmission errors; a minor change in the encoded data could represent a completely different message after decoding.
Turning finally to coding methods based on data transform, the Burrows-Wheeler Transform (BWT) algorithm works with blocks of data and ensures efficient lossless data processing. The data block resulting from the transform has the same length as the original block, but another arrangement of the participating symbols. The algorithm is more efficient when the processed data block is longer. The algorithm performance may be explained for a limited input data volume (row S with length N). The row S is treated as a sequence of N rows. At first the row S is shifted so that to obtain the new (N−1)st row. In fact the number of rows is not increased but only a set of pointers aimed at a cycle buffer is created, where the initial row S is placed. After that follows the lexicographic arrangement of these pointers. The result of the application of the BWT algorithm is the row L and initial index, representing the number of the row element L, where the first symbol of the original row S is saved.
Predictive Transform (or DPCM) coding, another data transform technique, is based on the idea of coding each symbol in a memoryless fashion. The symbol is predicted on the basis of information that the decoder also possesses, then a prediction residual is formed, and it is coded. The decoder adds the decoded residual to its version of prediction.
It may be seen from this discussion that each of the prior art approaches for data compression have disadvantages. In particular, prior art lossless data compression techniques may, depending upon the data set, actually increase the volume of the data after compression. What is desired then is an improved lossless data compression method and apparatus that decreases or, at worst, does not significantly increase the data volume after compression for any conceivable data set.