1. Field of the Invention
The present invention relates generally to apparatus and methods for encoding and decoding video information. More particularly, the present invention relates to an apparatus and method for motion estimation and motion prediction in the transform domain.
2. Background of the Related Art
Due to the limited bandwidth available on transmission channels, only a limited number of bits are available to encode audio and video information. Video encoding techniques attempt to encode video information with as few bits as possible, while still maintaining the image quality required for a given application. Thus, video compression techniques attempt to reduce the bandwidth required to transmit a video signal by removing redundant information and representing the remaining information with a minimum number of bits, from which an approximation to the original image can be reconstructed, with a minimal loss of important features. In this manner, the compressed data can be stored or transmitted in a more efficient manner than the original image date.
There are a number of video encoding techniques which improve coding efficiency by removing statistical redundancy from video signals. Many standard image compression schemes are based on block transforms of the input image such as the Discrete Cosine Transform (DCT). The well-known MPEG video encoding technique, for example, developed by the Motion Pictures Experts Group, achieves significant bit rate reductions by taking advantage of the correlation between pixels (pels) and the spatial domain (through the use of the DCT), and the correlation between image frames in the time domain (through the use of prediction and motion compensation).
In well-known orthogonal and bi-orthogonal (subband) transform based encoding systems (inclusive of lapped orthogonal transforms), and image is transformed without the necessity of first blocking the image. Transform encoders based on DCT without necessity the of first blocking the image. Transform encoders based on DCT block the image primarily for two reasons: 1) experience has shown that the DCT is a good approximation to the known optimal transform (Kahunen-Luove') on 8×8 regions of the image or a sequence of difference images; and 2) the processing of DCT grows O(N log N) and through the blocking of the image, computational effort is limited.
The end result is that DCT based approaches, unless otherwise enhanced, have basis functions which are compactly supported by (or zero outside of) an 8×8 region of an image. The orthogonal and bi-orthogonal transforms under consideration have basis members which are predominately supported in a finite interval of the image, but share extent with neighboring spatial regions. Subband image encoding techniques, for example, divide an input image into a plurality of spatial frequency bands, using a set of filters and then quantize each band or channel. For a detailed discussion of subband image encoding techniques see Subband Video Coding With Dynamic Bit Allocation and Geometric Vector Quantization, C. Podilchuck & A. Jaquin, SPIE Vol. 1666 Human Vision, Visual Processing, and Digital Display III, pp. 241-52 (February 1992). At each stage of the subband encoding process, the signal is split into a low pass approximation of the image, and a high pass term representing the detail lost by making the approximation.
In addition, DCT based transform encoders are translation invariant in the sense that the base members have a support which extends over the entire 8×8 black. This prevents motion compensation from being done efficiently in the transform domain. Therefore, most of the motion compensation techniques in use utilize temporally adjacent image frames to form an error term which is then transform coded on an 8×8 block. As a consequence, these techniques require an inverse transform to be carried out to supply a reference frame from the frequency domain to the time domain. Examples of such systems are found in U.S. Pat. No. 5,481,553 to Suzuki et al and U.S. Pat. No. 5,025,482 to Murakami et al.
FIG. 1 illustrates a simplified block diagram of a prior art standard video compression approach using DCT. In block 10, the changes in the image sequence are efficiently represented through motion detection techniques such as one technique used in MPEG when in predictive mode. In particular, a previous frame is used as a reference frame and a subsequent frame, in forward prediction, is compared against the previous frame to eliminate temporal redundancies and rank the differences between them according to degree. This step sets the stage for motion prediction of the subsequent frame and also reduces the data size of the subsequent frame. In block 12, a determination is made as to which parts of the image have moved. Continuing with the MPEG example, using the data set provided by block 10, interframe motion prediction is carried out by applying motion compensation techniques to the reference frame and subsequent frame. The resulting prediction is subtracted from the subsequent frame to generate a prediction error/frame. Thereafter, in block 14, the changes are converted to features. In MPEG, this is done by compressing the prediction error using a 2-dimensional 8×8 DCT.
Most video compression techniques based on DCT or subband encoders have focused on high precision techniques that attempt to encode video information without a loss of accuracy in the transform stage. Such high precision encoding techniques, however, rely on relatively expensive microprocessors, such as Intel Corporation's PENTIUM® processor, which have dedicated hardware to aid in the manipulation of floating point arithmetic and thereby reduce the penalty for maintaining a high degree of precision.
For many applications, however, such relatively expensive hardware is not practical or justified. Thus, a lower cost implementation, which also maintains acceptable image quality levels, is required. Known limited precision transforms that may be implemented on lower-cost hardware, however, tend to exhibit reduced accuracy as a result of the “lossy” nature of the encoding process. As used herein, a “lossy” system refers to a system that loses precision through the various stages of the encoder and thereby lacks the ability to substantially reconstruct the input from the transform coefficients when decoding. The inability to compensate for the reduced accuracy exhibited by these low precision transforms have been an impediment to the use of such transforms.
In view of the foregoing, there is a need for a video encoder that performs the motion compensation in the transform domain, thereby eliminating the requirement of an inverse transform in the encoder and enabling a simple control structure for software and hardware devices. There is also a need in the art for a video encoder having a class of transforms which are suitable for low precision implementation, including a control structure which enables low cost hardware and high speed software devices.