1. Field of the Invention
This invention relates in general to the field of video processing. In particular, this invention relates to bit allocation during digital image compression.
2. Description of the Related Art
A transition to digital technology has pervaded the communications, information, and entertainment industries in recent years. Digital images and digital video are proliferating and filling an important role in this technological conversion. Digital cameras, DVD movies, and streaming video over the World Wide Web provide a few examples of the ubiquity of digital images. But because digital images, and digital video in particular, contain large quantities of information, they can be costly in terms of storage capacity needed and bandwidth required for transmission. Thus, compression of the data required for storage or transmission is necessary, and consequently the problem of achieving more efficient compression has drawn significant attention.
A variety of techniques for image and video compression have been explored. For example, the H.261, H.263, JPEG, MPEG-1, MPEG-2, and MPEG-4 industry standards have employed various strategies with improving results over time. The feasibility of image compression derives from the fact that images often contain areas that are consistent, or correlated, in color (or in intensity, in the case of gray-scale images). Compression schemes seek to exploit these correlations by summarizing the data by regions rather than point by point. For instance, a typical digital image is stored as an array of pixels with a color value associated to each pixel. The array might be 720 by 480, and the color for each pixel might be represented by three (red, green, blue) component values ranging from 0 to 255. To digitally record three eight-bit numbers for each pixel in a 720 by 480 array requires at least 3×8×720×480=8,294,400 bits, or approximately 1 megabyte. But by noticing that in certain regions neighboring pixels have substantially the same color, one may avoid recording three eight-bit numbers for every pixel and instead record color data for each of several pixel regions, or use a smaller number of bits for some pixels.
In the case of a digital video sequence, one may also exploit the similarities between images that are close together in the sequence to further compress the data. For instance, rather than recording new data for a subsequent frame, one may retain the data for the previous frame and record only the differences between the subsequent frame and the previous frame. If the compression scheme successfully captures the similarities between frames in this way, these difference values for the subsequent frame will be substantially smaller than the original values for that frame's pixels. The crucial observation that makes compression possible is that these smaller values potentially require fewer bits to store. But in order to realize this advantage fully, one must take the additional step of quantizing the data, or replacing each numerical value with an approximation chosen from a discrete set of values. For example, rounding decimal numbers up or down to the nearest integer value is a standard case of quantization. Quantization is necessary because small decimals, for example 0.245, may require as many bits to encode as do large integer values, like 255. But when small numbers are rounded, for instance to integer values such as 0, 1 or −1, they require very few bits to encode. After quantization, the difference values mentioned above will often be small integers and will ideally include many zeros. A number of existing coding strategies dramatically conserve bits on zero values and small integers, so these difference values require far fewer bits to encode than the original values (which in the prototypical example are integers ranging up to 255).
Unfortunately, compression involving quantization comes with some cost in terms of image quality. As pixel values, or even difference values or other specifications for image reproduction generated by an encoder, are quantized, they no longer match exactly the data of the original digital image. Thus, the image reproduced by a decoder will not be exactly the same as the original image. For example, FIG. 1 illustrates the dramatic effect quantization can have. The original image 100 consists of a gradual smooth transition from black to gray to white. The new image 104 shows the result of quantizing (102) all gray-scale values between 0 and 255 into two values: black and white. Values between 0 and 127 are quantized as black, and values between 128 and 255 are quantized as white in this example of coarse quantization. While such a coarse approximation of an original image may be adequate for some purposes, clearly a large amount of information is lost resulting in a noticeable visual disparity between images 100 and 104.
The preceding example indicates an important trade-off in the field of image compression. Generally speaking, the smaller the number of bits used to encode a compressed image, the worse the distortion of that image will be. In fact, in “Coding theorems for a discrete source with a fidelity criterion,” IRE Nat. Conv. Rec. 4, 142–163, 1959, C. E. Shannon showed that a theoretical limit for image quality versus compression can be found under certain assumptions on the class of images. A rate-distortion curve indicates the smallest distortion that is possible for a given bit rate allocation in image compression. A common situation in which this trade-off becomes important involves transferring image or video data under a constraint on the available bandwidth. If the bandwidth constraint is small enough to prevent expedient transfer of the full image or video, then a compression scheme becomes necessary. The desire arises to obtain the least possible distortion (i.e. the best possible image quality) while maintaining the bit rate allocation below the limit imposed by the constraint. The degree to which a compression strategy successfully allocates bits to preserve image quality while meeting its bit-rate constraint largely determines its viability in limited bandwidth applications.
Pyramidal Decomposition
The method of pyramidal image decomposition, which can be applied either to still images or to video, was introduced by Peter J. Burt and Edward H. Adelson in “The Laplacian Pyramid as a Compact Image Code,” IEEE Transactions on Communications, Vol. Com-31, No. 4, April 1983. This method uses the correlation of color values within an image to take differences, resulting in small difference values that can be coded very efficiently. A variety of pyramidal decomposition schemes have been proposed, but the general approach is as follows. A base image I0 is filtered (with some filtering function) and downsampled at a lower resolution resulting in I1. Then, I1 is filtered and downsampled at yet a lower resolution yielding I2, and the process continues sequentially until a hierarchy (or pyramid) I0, I1, . . . , Im of bands of decreasing resolution is generated. Burt and Adelson refer to this hierarchy of filtered and downsampled images layers (called “bands”) as a Gaussian pyramid, due to the approximately Gaussian shape of the filtering function they employ. In this specification, a hierarchy of image bands formed by the above process with any filtering function and any method of downsampling is referred to as a Filtered Pyramid. FIG. 2a is a diagram showing the process for creating Ij+1 given Ij, comprising the steps of applying a filtering function (202) to the initial image 200 and then sampling pixels at a coarser scale (206) to create a new smaller-scale image 208. Next, a prediction function is applied to Im to create an approximation Jm−1 for band Im−1, and this process continues sequentially until a hierarchy Jm−1, . . . , J1, J0 of predicted bands is generated. The pyramid of predicted bands, Im, Jm−1, . . . , J1, J0, is herein referred to as a Predictive pyramid. For band i where i=0, 1, . . . , m−1, let Ei=Ii−Ji, be the difference between the original band image and the predicted image. FIG. 2b illustrates the process of predicting Ji from Ii+1 and computing the error terms Ei. If appropriate filtering and predicting functions are applied, then the Ii and the Ji should be very similar, resulting in small error terms Ei. In theory, the original image I0 can be reconstituted using only Im and Em−1, . . . , E0, E1 since Jm−1 may be computed from Im by applying the prediction function, since Im−1=Em−1+Jm−1, and since one may recursively apply the predict function and add the error term until one arrives at I0. Thus, a decoder can generate the original image I0 if it is given only data for Im and E0, E1, . . . , Em−1. Burt and Adelson refer to their “difference” pyramid composed of Im at the top level and Em−1, . . . , E1, E0 at the lower levels as a Laplacian pyramid. The more general term Difference pyramid is used herein to refer to this collection of Im and Em−1, . . . , E1, E0. 
FIG. 3 illustrates a typical Difference pyramid. In this example, the bands of the pyramid (300) have been generated by the process described above, so that the top band is a coarse approximation of an original image and the lower bands provide error terms that may be used to reconstruct the original image in a downward sequential fashion. Each band consists of an array of pixel values (302). For example, for a color image these values may have three components for three colors (red, green, blue), and each component value may be an integer between 0 and 255. In FIG. 3, each higher band is a smaller array than the band below. Note however that in general the sizes of the bands need not differ (e.g. the subsampling step could sample fully at the same resolution).
Since Im is of very low resolution (i.e. contains few pixels), it requires few bits to encode. If the small error terms Ei are quantized, then they will also require very few bits. As a consequence, encoding Im and all of the Ei (after quantization) can result in substantial savings over encoding the original image I0 directly.
As stated before, the quantization step is crucial to realize this potential compression. But when the Ei are quantized yielding Q(Ei), in general the equations Q(Ei)=Ii−Ji will no longer be exactly true. Even worse, the quantization errors introduced at higher levels of the pyramid will propagate throughout the calculations (this is called quantizer feedback), causing inaccurate predictions i and causing the errors at lower levels to be magnified. As a result, the decoder will not be able to reconstruct an exact copy of the original image I0. A successful pyramidal coding scheme seeks to quantize carefully in order to keep the number of bits required below some bound while minimizing the amount of distortion.
When one considers the application of pyramid decomposition to image or video compression, the importance of the dependent nature of quantization choices becomes clear. By “dependent” it is meant that at least one later stage is affected by the quantization scale selected for an earlier stage of the compression scheme. In such applications, an encoder seeks to compress the data for an image into a smaller number of bits, and a decoder reconstructs an approximate image from this compressed data. A pyramid is useful for compression because the decoder can apply the prediction function to predict a lower band once it knows the next highest band, reducing the amount of data that must be transferred. Two primary coding strategies, open-loop coding and closed-loop coding, have divergent effects on the accumulation of quantization error during the reconstruction process.
In open-loop coding, the encoder sequentially quantizes and transmits the bands of the Difference pyramid to the decoder. At each stage, the decoder uses the previous reconstructed band to predict the next lower band, then it adds the quantized error term (from the Difference pyramid) sent by the encoder to this prediction in order to “correct” it, or bring it closer to the corresponding band in the Filtered pyramid. Using the above notation, the decoder reconstructs Îi=i+Q(Ei), where the hat symbol denotes that these values differ somewhat from the Ii and Ji in the Filtered and Predictive pyramids, respectively. At each stage of this process, the decoder's prediction differs from the corresponding predicted band in the Predictive pyramid (i.e. i differs from Ji) because the decoder makes its prediction from quantized rather than raw data. But the error term Q(Ei) sent to correct this prediction at each stage is designed to correct the corresponding band in the Predictive pyramid, not the decoder's flawed prediction. So even before the error term is quantized, it is clear that it will not correct the discrepancy due to quantization at the previous step. Since the error term is also quantized before reaching the decoder, even more of a discrepancy is introduced. In this way, quantization error mounts at each stage of the open-loop coding process.
In contrast, closed-loop coding seeks to correct the error due to quantization at each new step in the process. Whenever the encoder sends data for a band, some error is again introduced by quantization. Thus, at band i<m the decoder will create a prediction, i that differs from Ji in the Predictive pyramid as before. However, rather than merely encoding the error term contained in the ith band of the Difference pyramid, as in open-loop coding, the encoder performs the same prediction i from quantized data that the decoder performs and subtracts it from Ii in the Filtered pyramid to create a more accurate error-correction term, Êi=Ii−i. Note that this error term cannot be found until after the previous band has already been reconstructed, in contrast to open-loop coding. This term is more accurate because it forces the equation Ii=i+Êi to hold. Of course, this error term is quantized before reaching the decoder, so the decoder reconstructs Îi=i+Q((Êi), which still differs from the original Ii. But the discrepancy at this stage is only due to the most recent quantization error.
FIG. 4 illustrates the stepwise closed-loop process from the perspective of both the encoder and the decoder. In step 1, the encoder forms m bands by sequentially filtering and subsampling the previous band, resulting in a Filtered pyramid. In step 2, the encoder quantizes the top band m and sends the quantized data to the decoder, while the decoder receives said quantized data for band m. In step 3, the encoder and the decoder use an identical process to predict band m−1 using the quantized data for band m. In step 4, the encoder computes a band m−1 difference term by subtracting the predicted band m−1 of step 3 from the raw band m−1 of the Filtered pyramid. In step 5, the encoder quantizes the band m−1 difference term and sends the quantized data to the decoder, while the decoder receives said quantized data for band m−1. In step 6, the encoder and the decoder both add the quantized band m−1 difference term to the predicted band m−1, resulting in a corrected band m−1. In step 7, the encoder and the decoder use an identical process to predict the next lower band m−2 using the corrected band m−1. In step 8, as in step 4, the encoder computes a band m−2 difference term by subtracting the predicted band m−2 of step 7 from the raw band m−2 of the Filtered pyramid. The last box indicates that analogues of steps 5 through 8 should be repeated until no lower bands remain. At this stage, the corrected band 0 will be the decoder's reconstruction of the original image.
The filtering and predicting functions employed by Burt and Adelson and in many other pyramidal decomposition schemes are linear functions. If a function Φ is applied to an image I (an image is here conceived as a map assigning a numerical color value to each pixel in an array of pixels), we refer to the resulting image as Φ(I). The function Φ is linear if two conditions are satisfied: 1) For any two images A and B, Φ(A+B)=Φ(A)+Φ(B), and 2) For any constant number c, then Φ(cA)=cΦ(A). A function that does not satisfy the above two conditions is called nonlinear. A number of proposed pyramidal decomposition schemes employing nonlinear filtering and predicting functions promise more efficient compression than linear pyramidal schemes. For example, in “A study of pyramidal techniques for image representation and compression,” Journal of Visual Communication and Image Representation, 5 (1994), pp. 190–203, Xuan Kong and John Goutsias compare several linear and nonlinear pyramid decomposition schemes and draw experimental conclusions regarding which filtering and predicting functions produce the best results. However, the use of nonlinear filters can complicate the propagation of quantization errors throughout the process of reconstructing a compressed image, so special attention must be paid to the allocation of bits among the various bands. In particular, because of the effects of nonlinear functions on quantization error, the closed-loop coding process described above is often employed. But determining how to allocate bits for such a closed-loop process is a nontrivial problem.
K. Metin Uz, Jerome M. Shapiro, and Martin Czigler propose a method for allocating bits in a closed-loop pyramidal coding scheme in their paper “Optimal bit allocation in the presence of quantizer feedback,” Proc. of ICASSP 1993, Vol. 5, pp. 385–388. In this method, the rate-distortion curves for each band are assumed to be of a simplified parametric form, and modeling is required to estimate the values of the parameters for these curves. Based on these assumptions, the authors use the method of Lagrange multipliers to calculate a closed form optimal solution that depends on the chosen parameters and a Lagrange multiplier (which can be approximated using iterative techniques).
In the unpublished manuscript “Alternative formulations for bit allocation with dependent quantization,” authors Pankaj Batra, Alexandros Eliftheriadis, and Jay Sethuraman provide an integer-programming formulation of the dependent quantization bit allocation problem. They use linear and Lagrange relaxations with randomized rounding to find approximate solutions. Since their formulation only takes account of first-order dependencies (e.g. between adjacent bands) and because it is aimed at temporal rather than spatial dependencies, it does not provide an accurate solution for a multiple-band pyramidal coding problem.
In “Bit allocation methods for closed-loop coding of oversampled pyramid decompositions,” Proceedings of the IEEE International Conference On Image Processing, 26–29 Oct. 1997, Santa Barbara, Calif., authors Uwe Horn, Thomas Wiegand, and Bernd Girod derive an optimal bit allocation model for closed-loop pyramidal coding under certain assumptions. They assume a Gaussian source to derive equations relating distortion to bit rate and use the method of Lagrange multipliers to find an analytical solution to the bit allocation problem. They note that this optimal solution actually results in large distortions at lower-resolution bands of the pyramid, so they propose using the optimal bit allocations for the simpler open-loop coding within the closed-loop scheme to achieve less distortion across all bands.
In “Bit allocation for dependent quantization with applications to multiresolution and MPEG video coders,” IEEE Transactions on Image Processing, Vol. 3, No. 5, September 1994, authors Kannan Ramchandran, Antonio Ortega, and Martin Vetterli discuss an operational approach to allocating bits within compression schemes with quantizer feedback using a discrete set of quantizer choices. This approach requires no assumptions about the input data or quantizer characteristics, unlike the Horn/Wiegand/Girod calculation. The optimal bit allocation/distortion balance is found by minimizing a Lagrangian cost dependent on a Lagrange multiplier parameter, then iterating by adjusting until the solution found satisfies the original bit constraint. This iterative approach approximates the actual optimal allocation for closed-loop pyramidal coding but its computational complexity increases exponentially with the number of bands in the pyramid.