Data compression is becoming increasingly important as the size of computer data (such as text, audio, video, image and program files) continues to grow. Data compression is a way of encoding digital data into an encoded representation that uses fewer bits than the original data. Representing the data in fewer bits means that the data occupies less storage space and requires less transmission bandwidth.
In general, data compression compresses a data by predicting the most frequently-occurring data and storing it in less space. Specifically, data compression involves at least two different tasks: (1) defining a data model to predict the probabilities of the input data; and (2) using a coder to generate codes from those probabilities. In addition, some data compression techniques mathematically transform and quantize the data to achieve even greater compression.
A compression technique may be lossless or lossy. A lossless compression technique is reversible such that the original data before encoding and the decompressed data after decoding are bit-for-bit identical. Lossy compression uses the fact there is much repetition in data that can be thrown away with much loss in quality. Lossy compression accepts the loss of some of the original data in order to achieve a higher compression.
Lossless compression typically is used to compress text or binary data, while lossy compression typically is used for audio, image and video data. However, even lossy compression techniques can sometimes use a lossless compression technique. For example, two commonly-used kinds of compression (or coding) technique are transform coding and predictive coding. For such kinds of compression systems, the original data is transformed and then quantized (rounded to nearest integers), or predicted based on (fixed or adaptive) signal models, and the prediction errors (differences between the original and predicted data) are then quantized. In both, cases, the data after quantization are in integer form. Once these integers are obtained, a lossless compression technique is used to encode the quantized values, in order to reduce the number of bits needed to represent the data.
The set of these integer values usually has an associated probability distribution function (PDF). These PDFs have a distribution such that if the data properties are well modeled by the predictor, in predictive coding, then the prediction error should be close to zero most of the time. Similarly, in transform coding, most of the quantized transform coefficients are zero. FIG. 1 illustrates a typical probability distribution for these integer values; zero is the most likely value, and the probabilities of nonzero values decrease nearly exponentially fast as the magnitude increases. The data has a probability distribution shown in FIG. 1 because the data that is being encoded using the lossless compression technique is not the original data. FIG. 1 is the integer data resulting from quantizing transform coefficients or prediction errors.
Mathematically, the problem is to find an efficient solution to encoding a vector x containing N integers. Each of the elements x(n), n=0, 1, . . . , N−1, has a value according to a probability distribution similar to that in FIG. 1, so that the most probable value is zero, and values farther away from zero have fast decreasing probabilities.
A simple mathematical model for probability distributions like the one in FIG. 1 is the Laplacian, or two-sided geometric (TSG) distribution, characterized by a parameter θ:
                              P          ⁡                      (                          x              ,              θ                        )                          =                                            1              -              θ                                      1              +              θ                                ⁢                      θ                                        x                                                                        (        1        )            Note that the parameter θ controls the rate of decay in probability as |x| grows. The larger the value of θ, the faster the decay. The parameter θ can be directly related to the probability that x=0, that is P(0, θ)=(1−θ)/(1+θ). Also, the expected magnitude of the source symbol is:
                              E          ⁡                      [                                        x                                      ]                          =                              2            ⁢            θ                                1            -                          θ              2                                                          (        2        )            The entropy of the source is given in bits/symbol by
                              H          ⁡                      (            x            )                          =                                            log              2                        ⁡                          (                                                1                  +                  θ                                                  1                  -                  θ                                            )                                -                                                    2                ⁢                θ                                            1                -                                  θ                  2                                                      ⁢                                          log                2                            ⁡                              (                θ                )                                                                        (        3        )            Thus, a good encoder should map a vector of N values of x into a bitstream containing not much more than N·H(x) bits, the theoretical minimum.
The Laplacian distribution is a common model in media compression systems, for either prediction errors in predictive coders (like most lossless audio and image coders) or for quantized transform coefficients (like most lossy audio, image, and video coders).
There have been many proposed encoders for sources with a Laplacian/TSG distribution. A simple but efficient encoder is the Golomb-Rice encoder. First, the TSG source values x are mapped to nonnegative values u by the simple invertible mapping:
                    u        =                              Q            ⁡                          (              x              )                                =                      {                                                                                                      2                      ⁢                      x                                        ,                                                                                        x                    ≥                    0                                                                                                                                                                  -                        2                                            ⁢                      x                                        -                    1                                                                                        x                    <                    0                                                                                                          (        4        )            that is equivalent to seeing u as the index to the reordered alphabet {0, −1, +1, −2, +2, . . . }. The new source u has a probability distribution that approximates that of a geometric source, for which Golomb codes are optimal, because they are Huffman codes for geometric sources, as long as the Golomb parameter is chosen appropriately.
An example of Golomb-Rice (G/R) codes is shown in Table 1 for several values of the parameter m. It should be noted that when m equals a power of two, a parameter k is used, which is related to m by m=2k. The main advantage of G/R codes over Huffman codes is that the binary codeword can be computed by a simple rule, for any input value. Thus, no tables need to be stored. This is particularly useful for modern processors, for which reading from a memory location that stores a table entry can take longer than executing several instructions. It is easy to see that the parameter m determines how many consecutive codeword have the same number of bits. That also indicates that computing the codeword involves computing u/m, where u is the input value. For most processors, an integer division takes many cycles, so the G/R code for general m is not attractive. When m=2k is chosen, which corresponds to a Rice code, then the division u/m can be replaced by a shift, because u/m=u>>k (where>>denotes a right shift operator). Thus, computing the G/R code for any input u is easy; simply compute p=u>>k and v=u−(p<<k). The code is then formed by concatenating a string with p 1's with the k-bit binary representation of v.
TABLE 1Inputm = 1m = 2m = 4m = 8valuek = 0k = 1m = 3k = 2m = 5. . .k = 300000000000000001100101000100100012110100011010010001031110101100011011000114111101100101010000111010051111101101101110011000010161111110111001100101010010110711111110111011101010111010011181111111101111001101111000101101000091111111110111101111001100110111100011011111111110111110011101011010110001001011111111111110111110111101111011110011001112111111111111011111100111100111000110101010013111111111111011111101111101011100111011010101..................
It is clear from Table 1 that the choice of the G/R parameter k must depend on the statistics of the source. The slower the decay of probability as u increases, the larger k should be chosen. Otherwise, the codeword lengths grow too quickly. A simple rule for choosing k is that the codeword length for a given input value u should approximate the logarithm base 2 of the probability of occurrence of that value.
Although G/R codes are optimal for geometrically-distributed sources, they are not optimal for encoding symbols from a Laplacian/TSG source via the mapping in Equation 4. This is because for an input variable x with a TSG distribution, the variable u from Equation 4 has a probability distribution that is close to but not exactly geometric. In practice, the performance is close enough to optimal (e.g. with a rate that is typically less than 5% above the entropy), so G/R codes are quite popular. The optimal codes for TSG sources involve a set of four code variants, which are more complex to implement and improve compression by 5% or less in most cases. Therefore, in most cases G/R coders provide the best tradeoff between performance and simplicity.
In FIG. 1, the probability distribution is represented by a single parameter, which is the rate of decay of the exponential. The faster the rate of decay, then the more likely is the value of zero. This means that in many cases zero is so likely that runs of zeros become very likely. In other words, if the probability distribution rate of decay is fast enough then encoding runs is a good idea. Encoding runs of zeros means that just a few bits are used to take care of many entries in the input data.
Encoding runs of data is efficiently performed using Run-Length encoding. Run length encoding is a simple form of data compression in which sequences of the same value repeated consecutively (or “runs”) are stored as a single data value and the length of the run, rather than as the original run.
Prediction errors are much more likely to be zero if the data matches the model used by the predictor in predictive coding, for example. It is possible, however, even with a good model, to every once in a while have a large value. This can occur when a boundary is reached, such as a pixel value goes from a background value to a foreground value. Every now and then big numbers can occur. When this happens, one type of encoding technique that is more useful than Run-Length encoding is known as a “Run-Length Golomb/Rice (RLGR)” encoding technique. One such RLFT encoding technique is disclosed in U.S. Pat. No. 6,771,828 to Malvar entitled “System and Method for Progressively Transform Coding Digital Data” and U.S. Pat. No. 6,477,280 to Malvar entitled “Lossless Adaptive Encoding of Finite Alphabet Data”.
In reality, with the source of data varying, the probabilities will not stay constant and will vary over time. This is true with, for example, images and audio. Typically, these probability variations in the input data are handled in a variety of different ways. In JPEG, for example there is an entropy coder (a Huffman coder) whereby codewords of different lengths are used for different values to be encoded. The Huffman table is usually pre-designed, that is, typically a number of images are obtained, their probabilities are measured, and an average model is constructed that is used for all images. One problem with this approach is that with every portion of an image there is a loss in encoding efficiency, because the probability model being used by the entropy coder is good on average but not necessarily good for that portion of the image.
From Table 1 it can be seen that there are two main issues with Golomb/Rice codes: (1) the probability decay parameter 0, or equivalent the probability P(x=0) must be known, so the appropriate value of k can be determined; and (2) if the decay parameter is too small, the entropy H(x) is less than 1, and thus the Golomb/Rice code is suboptimal, since its average codeword length cannot be less than 1 bit/symbol.
In practice, the first issue (estimation of the optimal Golomb/Rice parameter) is usually addressed by dividing the input vector into blocks of a predetermined length. For each block, the encoder makes two passes over the data. In the first pass, the average magnitude of input values is computed. For that, the parameter θ can be estimated from Equation 2, and the corresponding optimal k can be determined. In a second pass, the encoder generates the bitstream for the block by first outputting the value of k in binary form, followed by the concatenated strings of Golomb/Rice codes for the data values within the block. This is the approach used in essentially all lossless compression systems that use Golomb/Rice codes, such as JPEG-LS for lossless image compression, SHORTEN for lossless audio compression, and others. This is called a “blockwise adaptation” or “forward adaptation” model. The forward adaptation model is forward in the sense that the encoder looks at the data first before encoding, measures a statistical parameter (usually the average magnitude), and then encodes based on that parameter and puts the value of the parameter used to encode the data in a header, for use by the decoder. Instead of trying to code the data all at once, the data is broken up into small portions, or blocks. For each block, the statistics of that block are measured, a statistical parameter is measure for that portion of data that matches what is in the buffer, and the entropy coder is adjusted to that parameter. In the encoded file a header is inserted that indicates the value of the parameter being used to encode that block of data.
The second issue in practice, namely, encoding sources with very low entropy, is usually addressed using a blockwise adaptation or forward adaptation model, and if the average magnitude value of the input symbols in the block is small enough that the estimated entropy H(x) is less than 1, then the encoder uses Run-Length coding, instead of Golomb/Rice coding.
Although these approaches work well in practice, they have two main disadvantages. One disadvantage is that the encoder needs to read each input block twice, such that two passes are performed on the data: a first time to compute the average magnitude to determine the Golomb/Rice parameter, and a second time to perform the actual encoding. This requires the encoder to perform additional work and adds complexity. In some applications encoding time is not an issue, but for digital cameras, for example, it can slow down the encoding process or increase the cost of random-access memory. In particular, the forward adaptation model must first look at the data and measure the statistics, find model parameters, and then encode. This is not an issue if the encoder runs on a personal computer having a great deal of processing power. However, if pictures taken with a cell phone, they are being encoded by the cell phone itself, where processing power is much more limited.
The second and most important disadvantage involves the difficulty in choosing the block size. If the block size is too large, the statistics could change dramatically within the block. On the other hand, if the block size is too small, then the overhead of having to tell the decoder which parameter was used to encode that block of data becomes burdensome. For every block, the encoder must store what parameters values are being used to encode that block. At some point the overhead required to encode the small block is not worth the compression achieved. This is creates a trade-off. On the one hand, if a small block is used, the statistics of the block can be matched, however, measuring the statistics is difficult because there are few numbers, and the overhead of encoding is great. On the other hand, if a large block is used, the problem is that the statistics can vary greatly within the block. In practice, it is hard to find a compromise between those two conflicting factors, so that the block size is usually chosen to be between 128 and 2,048 samples, depending on the type of data to be encoded.
One solution is to use a backward-adaptive technique in the encoder. With backward adaptation, encoding starts with the decoder and encoder agreeing on initial states is for each block. In other words, each parameter is initialized to a predetermined value, and then the encoding begins. Every time the encoder produces an output symbol, that symbol can be sent to the decoder immediately, because the decoder knows the parameter values used to encode it. After the encoder outputs a symbol, it then computes new values for the encoding parameters, depending on the symbol that was output, according to a predetermined adaptation rule. The decoder knows the parameter adaptation rule, and therefore it can also compute the new values for the encoding parameters. Thus, the encoding parameters are adjusted after every encoded symbol, and the encoder and decoder are always in sync, that is, the decoder tracks the changes in the encoding parameters. This means that the encoder does not need to send the decoder any overhead information in terms of what parameter values were used to encode the data.