The present invention relates to an apparatus capable of compressing motion picture and still-images. It improves coding efficiency (compressibility), increases coding/decoding speed and improves noise resilience of resulting compressed video streams. When properly implemented in a lossy wavelet-based video or still-image codec, it reduces the size of the bit-stream required for transmitting and/or storing video and still-image material at a given quality, or increases the quality of the reconstructed image at a given bit budget. The part of the invention designed for intra-frame compression is equally applicable to coding individual frames of a video sequence and to coding still images. The proliferation of digital television and the transmission of multimedia content over the Internet have created the need for better video and still image compression methods. Older video compression standards such as MPEG-1, MPEG-2, and H.261 are being replaced by newer standards such as H.263+, MPEG-4, and H.264 (also known as AVC, MPEG-4 part 10) primarily to provide better picture quality at a smaller bit budget. The arrival of high definition television (HDTV) broadcast and the ever-increasing demand for higher resolution, greater bit-depth, and multi-spectral digital pictures are creating the need for compression methods whose performance improves with increasing picture size.
Typically, well-designed pyramid-structured codecs have better intra-frame coding efficiencies than those of block-based video codecs such MPEG, H.261, H.264, etc. (or block-based still image standards such as JPEG) for higher resolution material.
The present invention improves the intra-frame coding efficiency of a wavelet- or similar filter-based codec, and provides methods for improving inter-frame coding efficiency, speed, and resilience to noise.
In the present invention, which employs a zerotree in a transform pyramid, velocity information is embedded along the nodes in a zerotree structure, allowing for sparser representation of motion fields and the description of affine motions (e.g. rotations, changes of scale, morphing etc.) not permitted in block motion algorithms. Consequently, the encoder encodes motion information only about changes in the movement of edges present in a multiresolution structure. Edges are one-dimensional sparse structures, and describing the motions of edges is an efficient approach compared with describing the motion of each pixel. Because affine motions such as rotation, skewing, morphing, and zooming may be handled by edge-motion descriptions, motion-compression can be more efficiently encoded in this manner than by block motions alone. This is important, since two-dimensional scene motion often contains zoom and rotation, and 2-D projection of 3-D real-world motion often contains morphing components (e.g. a face turning away from the camera). The velocity information at finer scales of the zerotree refines the velocity information of the finer scale tree components from the coarser scale information. If only a low-resolution version of a video sequence needs to be decoded, the unnecessary higher resolution velocity refinement information is discarded saving computational resources.
During transmission in a noisy environment (e.g. “over-the-air” terrestrial broadcast, satellite broadcast, etc.), noise often corrupts compressed video streams. Classical video compression methods encode certain frames as still images. These frames are decoded independently from any other frames and are called key-frames (or “I-frames” due to their independence). In classical video compression, in order to provide a high coding efficiency, the majority of frames in a video sequence are encoded as differences with respect to one or more reference frames. In MPEG parlance, these frames are referred to as “P” or “B” frames, for “Predicted” or “Bi-directionally predicted” frames. Without adequate error protection, if a short noise burst corrupts a key-frame, it has also corrupted those subsequent P and B frames dependent upon this particular key-frame as a reference frame from which the P or B frame is predicted for reconstruction. In classical MPEG type video transmission systems, a certain degree of noise resilience is achieved by combining the compressed video stream with forward error correction, in a context-independent manner (i.e. each bit in the compressed video stream receives an equal amount of error protection regardless of the importance of the visual information carried by that bit). In this mode of error protection, a most-significant bit (MSB) in the DC term (i.e. representing the average luminance or color) of an entire DCT block carries no more protection than a least-significant bit (LSB) in the highest frequency component. Because errors in the average luminance or color of entire blocks can result in a half-second or more of seriously damaged video, while errors in a refinement value for a small group of pixels may go unnoticed, the forward error correction applied in such a context-independent manner cannot provide optimal visual quality at a given error correction bit budget, since worst-case protection will apply too many error correction bits for the high-frequency LSBs, and best-case protection will permit large-scale, highly visible damage. According to the present invention the still-image update information is spread among several frames. Thus, the key-frames are replaced by frames containing key-regions (with the exception of scene changes, which can be addressed differently), that is, regions in a frame that are encoded independently from other frames. According to this encoding scheme, a short-lived noise event will only corrupt a region of a scene, without propagating through its contemporary frame, and if the forward error correction bit-budget is distributed in an hierarchical manner, i.e. coarser scales of the transform pyramid receive more error correction bits than finer scales, uncorrected erroneous data would more likely occur in the finer scales, resulting in a small blur of the reconstructed scene. This method results in more efficient use of the available error correction bit-budget.
Motion prediction is a crucial part of modern video compression algorithms. A classical encoder estimates motion vectors for the current frame using previous and future frames. The decoder cannot accomplish the same operation, since neither the current frame (the frame for which motion estimation is being performed) nor future frames are available at the decoder at that time.
According to the present invention, the encoder performs a different type of motion estimation—one where it assumes that the current frame is not available. It then estimates motion vectors (displacement) based on the frames that are currently available to the decoder. The decoder performs the identical operation. Since the encoder and the decoder perform identical operations on the same data in a synchronous fashion, they arrive at the same motion vector estimates, and these estimates need not be transmitted to the decoder. This synchronous type of motion estimation improves compression by not sending the motion vector data, and could provide an extra amount of noise resilience, since in the absence of transmission, motion vectors cannot be corrupted by transmission noise. Vector quantization (VQ) can provide a significant coding gain over scalar quantization by exploiting existing correlations among elements of a data set (e.g. the wavelet transform of an image). Typically, in vector quantization, an image (or its transform) is divided into square (or rectangular) block vectors. These blocks then form the input vectors to the vector quantizer. The output vectors of the vector quantizer usually form a sparse codebook containing the centroids of the input data set found by the vector quantizer, or, alternatively, a pre-selected, generally optimal codebook. If a pre-selected codebook is used, it can be stored at the decoder in advance, otherwise, the codebook of centroids, calculated in real-time at the encoder is transmitted to the decoder instead of all of the original vectors. In the reconstructed image the original vectors are replaced by their respective centroids.
In the present invention the decoder may perform additional operations on the received centroids. The procedures required to be performed on the vectors may either be known to the decoder or supplied to it by the encoder. The instruction regarding the type(s) of procedures needing to be performed on the vectors by the decoder can be supplied as pointers to procedures, and possibly their arguments, embedded in a zero-tree structure. This reuse of centroids is most effective if the vectors are defined as line-vectors (i.e. vectors that have spatial dimensions of one pixel in one axis by one or more pixels in the other axis, as opposed to block vectors). These line-vectors are taken from the un-subsampled (wavelet) transform to avoid shift-variance introduced by decimation. An example of the advantages offered by the present invention is in the encoding of an oblique homogeneous edge in an image. As the edge crosses square or rectangular regions, vectors representing those regions will be dissimilar and would require several centroids to accurately approximate the edge crossings at different locations. These block vectors are over-constrained with respect to the data they must compactly approximate. According to the present invention, one line-vector may be used to represent the coefficients along a homogeneous edge crossing at a particular location in a subband, and the same vector with accurate shift information (and/or a minor adjustment in one or more coefficients) may represent several of the subsequent edge crossings in that subband efficiently.
In order to reduce the execution time required for decoding of line-vectors, a method is provided here to avoid unnecessary recalculation (convolution) of line-vectors during the reconstruction of an image. Line-vectors in a high-pass band typically undergo transformations similar to the corresponding line-vectors representing the same location in the image in the corresponding low-pass band, and for highly similar linear image data on subsequent lines and in a particular band and orientation, the resulting line-vectors would differ from one-another exclusively by a shift requiring sub-pixel representation accuracy (and that shift could possibly be zero). If the encoder determines that this type of relationship exists, it inserts a representative symbol encoding “re-use this line-vector, applying specified shifts” into the bit-steam. Upon encountering this symbol, the decoder performs the convolution only once, and then repeats the result on the subsequent lines of the image. (Here, “lines” of the image refers to the filter orientation applied to the image data for generating a particular subband, and can be of either orientation.) By avoiding multiple convolution operations, execution time is significantly reduced.
If the data set is sufficiently de-correlated, vector quantization does not benefit from large vector sizes. In this case, vectors containing very few elements (and possibly just one element) are used. In the context of the wavelet transform of an image, this can occur at all scales of the transform if the input image already consists of de-correlated pixel content of the image, or at higher (coarser) scales for most images due to the de-correlating properties of the wavelet transform. A common approach to quantizing this type of data is to compare coefficient values against a threshold. If the amplitude of a coefficient has exceeded the threshold, it becomes significant. When representing edges, this technique often results in discontinuous representations of continuous edges, because a typical edge, which undergoes gradual changes in luminance and chrominance along its length, in the wavelet domain is represented by coefficients of varying amplitudes that reflect those changes. Larger amplitude wavelet coefficients along the edge are kept, while smaller coefficients become approximated by zeros (in scalar quantization, known as the dead-band). Upon reconstruction, a continuous edge becomes a discontinuous edge. In the present invention, the problem of preserving continuity in edges is solved by testing the wavelet coefficients' amplitudes within a subband against two different thresholds to determine their significance. If a coefficient's amplitude exceeds the larger threshold it becomes significant. If a coefficient's amplitude exceeds the smaller test-threshold, further testing is done to determine its significance. A test is performed to determine significance of any coefficient whose absolute amplitude falls between the smaller and larger thresholds. If any adjacent, neighboring wavelet coefficient has exceeded the larger threshold or if any adjacent, neighboring coefficient has been found to be significant by one or more of its neighbors' amplitudes, then the current coefficient is tagged as significant.
Further objects and advantages will become apparent from a consideration of the ensuing description and drawings.