This invention relates to a method and apparatus for compressing a video image for transmission to a receiver and/or decompressing the image at the receiver. More particularly, the invention is directed to an apparatus and method for performing data compression on a video image using weighted wavelet hierarchical vector quantization (WWHVQ). WWHVQ advantageously utilizes certain aspects of hierarchical vector quantization (HVQ) and discrete wavelet transform (DWT), a subband transform.
A vector quantizer (VQ) is a quantizer that maps k-dimensional input vectors into one of a finite set of k-dimensional reproduction vectors, or codewords. An analog-to-digital converter, or scalar quantizer, is a special case in which the quantizer maps each real number to one of a finite set of output levels. Since the logarithm (base 2) of the number of codewords is the number of bits needed to specify the codeword, the logarithm of the number of codewords, divided by the vector dimension, is the rate of the quantizer in bits per symbol.
A VQ can be divided into two parts: an encoder and a decoder. The encoder maps the input vector into a binary code representing the index of the selected reproduction vector, and the decoder maps the binary code into the selected reproduction vector.
A major advantage of ordinary VQ over other types of quantizers (e.g., transform coders) is that the decoding can be done by a simple table lookup. A major disadvantage of ordinary VQ with respect to other types of quantizers is that the encoding is computationally very complex. An optimal encoder performs a full search through the entire set of reproduction vectors looking for the reproduction vector that is closest (with respect to a given distortion measure) to each input vector.
For example, if the distortion measure is squared error, then the encoder computes the quantity .vertline..vertline.x-y.vertline..vertline..sup.2 for each input vector X and reproduction vector y. This results in essentially M multiply/add operations per input symbol, where M is the number of codewords. A number of suboptimal, but computationally simpler, vector quantizer encoders have been studied in the literature. For a survey, see the book by Gersho and Gray, Vector Quantization and Signal Compression, Kluwer, 1992.
Hierarchical vector quantization (HVQ) is VQ that can encode using essentially one table lookup per input symbol. (Decoding is also done by table lookup). To the knowledge of the inventors, HVQ has heretofore not appeared in the literature outside of Chapter 3 of the Ph.D. thesis of P. Chang, Predictive, Hierarchical, and Transform Vector Quantization for Speech Coding, Stanford University, May 1986, where it was used for speech. Other methods named "hierarchical vector quantization" have appeared in the literature, but they are unrelated to the HVQ that is considered respecting the present invention.
The basic idea behind HVQ is the following. The input symbols are finely quantized to p bits of precision. For image data, p=8 is typical. In principle it is possible to encode a k-dimensional vector using a single lookup into a table with a kp-bit address, but such a table would have 2.sup.kp entries, which is clearly infeasible if k and p are even moderately large. HVQ performs the table lookups hierarchically. For example, to encode a k=8 dimensional vector (whose components are each finely quantized to p=8 bits of precision) to 8 bits representing one of M=256 possible reproductions, the hierarchical structure shown in FIG. 1a can be used, in which Tables 1, 2, and 3 each have 16-bit inputs and 8-bit outputs (i.e., they are each 64 KByte tables).
A signal flow diagram for such an encoder is shown in FIG. 1b. In the HVQ of FIG. 1b, the tables T at each stage of the encoder along with the delays Z are illustrated. Each level in the hierarchy doubles the vector dimension of the quantizer, and therefore reduces the bit rate by a factor of 2. By similar reasoning, the ith level in the hierarchy performs one lookup per 2.sup.i samples, and therefore the total number of lookups per sample is at most 1/2+1/4+1/8+. . . =1, regardless of the number of levels. Of course, it is possible to vary these calculations by adjusting the dimensions of the various tables.
The contents of the HVQ tables can be determined in a variety of ways. A straightforward way is the following. With reference to FIG. 1a, Table 1 is simply a table-lookup version of an optimal 2-dimensional VQ. That is, an optimal 2-dimensional full search VQ with M=256 codewords is designed by standard means (e.g., the generalized Lloyd algorithm discussed by Gersho and Gray), and Table 1 is filled so that it assigns to each of its 2.sup.16 possible 2-dimensional input vectors the 8-bit index of the nearest codeword.
Table 2 is just slightly more complicated. First, an optimal 4-dimensional full search VQ with M=256 codewords is designed by standard means. Then Table 2 is filled so that it assigns to each of its 2.sup.16 possible 4-dimensional input vectors (i.e., the cross product of all possible 2-dimensional output vectors from the first stage) the 8-bit index of its nearest codeword. The tables for stages 3 and up are designed similarly. Note that the distortion measure is completely arbitrary.
A discrete wavelet transformation (DWT), or more generally, a tree-structured subband decomposition, is a method for hierarchical signal transformation. Little or no information is lost in such a transformation. Each stage of a DWT involves filtering a signal into a low-pass component and a high-pass component, each of which is critically sampled (i.e., down sampled by a factor of two). A more general tree-structured subband decomposition may filter a signal into more than two bands per stage, and may or may not be critically sampled. Here we consider only the DWT, but those skilled in the art can easily extend the relevant notions to the more general case.
With reference to FIG. 2a, let X=(x(0), x(1), . . . ,x(N-1)) be a 1-dimensional input signal with finite length N. As shown by the tree structure A, the first stage of a DWT decomposes the input signal X.sub.L0 =X into the low-pass and high-pass signals X.sub.L1 =(x.sub.L1 (0), X.sub.L1 (1), . . . X.sub.L1 (N/2-1)) and X.sub.H1 =(X.sub.H1 (0), x.sub.H1 (1), . . . ,X.sub.H1 (N/2-1)), each of length N/2. The second stage decomposes only the low-pass signal X.sub.L1 from the first stage into the low-pass and high-pass signals X.sub.L2 =(x.sub.L2 (0), x.sub.L2 (1), . . . ,x.sub.L2 (N/4-1)) and X.sub.H2 =(x.sub.H2 (0), X.sub.H2 (1), . . . ,x.sub.H2 (N/4-1)), each of length N/4. Similarly, the third stage decomposes only the low-pass signal X.sub.L2 from the second stage into low-pass and high-pass signals X.sub.L3 and X.sub.H3 of lengths N/8, and so on. It is also possible for successive stages to decompose some of the high-pass signals in addition to the low-pass signals. The set of signals at the leaves of the resulting complete or partial tree is precisely the transform of the input signal at the root. Thus a DWT can be regarded as a hierarchically nested set of transforms.
To specify the transform precisely, it is necessary to specify the filters used at each stage. We consider only finite impulse response (FIR) filters, i.e., wavelets with finite support. L is the length of the filters (i.e., number of taps), and the low-pass filter (the scaling function) and the high-pass filter (the difference function, or wavelet) are designated by their impulse responses, 1(0), l(1), . . . ,l(L-1), and h(0), h(1), . . . h(L-1), respectively. Then at the output of the mth stage, EQU x.sub.L,m (i)=1(0)x.sub.L,m-1 (2i)+1(1)x.sub.L,m-1 (2i+1)+ . . . +1(L-1)x.sub.L,m-1 (2i+L-1) EQU x.sub.H,m (i)=h(0)x.sub.L,m-1 (2i)+h(1)x.sub.L,m-1 (2i+1)+ . . . +h(L-1)x.sub.L,m-1 (2i+L-1)
for i=0, 1, . . . , N/2.sup.m. Boundary effects are handled in some expedient way, such as setting signals to zero outside their windows of definition. The filters may be the same from node to node, or they may be different.
The inverse transform is performed by different low-pass and high-pass filters, called reconstruction filters, applied in reverse order. Let 1'(0), 1'(1), . . . , 1'(L-1) and h'(0), h'(1), . . . h'(L-1) be the impulse responses of the inverse filters. Then X.sub.L,m-1 can be reconstructed from X.sub.L,m and X.sub.H,m as: EQU x.sub.L,m-1 (2i)=1'(0)x.sub.L,m (i)+1'(2)x.sub.L,m (i+1)+h'(0)x.sub.H,m (i)+h'(2)x.sub.H,m (i+1) EQU x.sub.L,m-1 (2i+1)=1'(1)x.sub.L,m (i+1)+1'(3)x.sub.L,m (i+2)+h'(1)x.sub.H,m (i+1)+h'(3)x.sub.H,m (i+2)
for i-0,1, . . . ,N/2.sup.m. That is, the low-pass and high-pass bands are up sampled (interpolated) by a factor of two, filtered by their respective reconstruction filters, and added.
Two-dimensional signals are handled similarly, but with two-dimensional filters. Indeed, if the filters are separable, then the filtering can be accomplished by first filtering in one dimension (say horizontally along rows), then filtering in the other dimension (vertically along columns). This results in the hierarchical decompositions illustrated in FIGS. 2B, showing tree structure B, and 2C, in which the odd stages operate on rows, while the even stages operate on columns. If the input signal X.sub.L0 is an N.times.N image, then X.sub.L1 and X.sub.H1 are N.times.(N/2) images, X.sub.LL2, X.sub.LH2, X.sub.HL2, and X.sub.HH2 are (N/2).times.(N/2) images, and so forth.
Moreover, notwithstanding that which is known about HVQ and DWT, a wide variety of video image compression methods and apparatuses have been implemented. One existing method that addresses transcoding problems is the algorithm of J. Shapiro, "Embedded Image Coding using Zerotrees of Wavelet Coefficients," IEEE, Transactions on Signal Processing, December 1993, in which transcoding can be done simply by stripping off prefixes of codes in the bit stream. However, this algorithm trades simple transcoding for computationally complex encoding and decoding.
Other known methods lack certain practical and convenient features. For example, these other known video compression methods do not allow a user to access the transmitted image at different quality levels or resolutions during an interactive multicast over multiple rate channels in a simplified system wherein encoding and decoding are accomplished solely by the performance of table lookups.
More particularly, using these other non-embedded encoding video compression algorithms, when a multicast (or simulcast, as applied in the television industry) of a video stream is accomplished over a network, either every receiver of the video stream is restricted to a certain quality (and hence bandwidth) level at the sender or bandwidth (and CPU cycles or compression hardware)is unnecessarily used by multicasting a number of streams at different bit rates.
In video conferencing (multicast) over a heterogeneous network comprising, for example, ATM, the Internet, ISDN and wireless, some form of transcoding is typically accomplished at the "gateway" between sender and receiver when a basic rate mismatch exists between them. One solution to the problem is for the "gateway"/receiver to decompress the video stream and recompress and scale it according to internal capabilities. This solution, however, is not only expensive but also increases latency by a considerable amount. The transcoding is preferably done in an online fashion (with minimal latency/buffering) due to the interactive nature of the application and to reduce hardware/software costs.
From a user's perspective, the problem is as follows: (a) Sender(i) wants to send a video stream at K bits/sec to M receivers. (b) Receiver(j) wants to receive Sender(i)'s video stream at L bits/sec (L&lt;K). (c) But, the image dimensions that Receiver(j) desires or is capable of processing are smaller than the default dimensions that Sender(i) encoded.
It is desirable that any system and/or method to address these problems in interactive video advantageously incorporate (1) inexpensive transcoding from a higher to a lower bit rate, preferably by only operating on a compressed stream, (2) simple bit rate control, (3) simple scalability of dimension at the destination, (4) symmetry (resulting in very inexpensive decode and encode), and (5) a prioritized compressed stream in addition to acceptable rate-distortion performance. None of the current standards (Motion JPEG, MPEG and H.261) possess all of these characteristics. In particular, no current standard has a facility to transcode from a higher to a lower bit rate efficiently. In addition, all are computationally expensive.
The present invention overcomes the aforenoted and other problems and incorporates the desired characteristics noted above. It is particularly directed to the art of video data compression, and will thus be described with specific reference thereto. It is appreciated, however, that the invention will have utility in other fields and applications.