The present invention relates to video coding and, more particularly, to video coding system using interpolation filters as part of motion-compensated coding.
Video codecs typically code video frames using a discrete cosine transform (“DCT”) on blocks of pixels, called “pixel blocks” herein, much the same as used for the original JPEG coder for still images. An initial frame (called an “intra” frame) is coded and transmitted as an independent frame. Subsequent frames, which are modeled as changing slowly due to small motions of objects in the scene, are coded efficiently in the inter mode using a technique called motion compensation (“MC”) in which the displacement of pixel blocks from their position in previously-coded frames are transmitted as motion vectors together with a coded representation of a difference between a predicted pixel block and a pixel block from the source image.
A brief review of motion compensation is provided below. FIGS. 1 and 2 show a block diagram of a motion-compensated image coder/decoder system. The system combines transform coding (in the form of the DCT of pixel blocks of pixels) with predictive coding (in the form of differential pulse coded modulation (“PCM”)) in order to reduce storage and computation of the compressed image, and at the same time to give a high degree of compression and adaptability. Since motion compensation is difficult to perform in the transform domain, the first step in the interframe coder is to create a motion compensated prediction error. This computation requires one or more frame stores in both the encoder and decoder. The resulting error signal is transformed using a DCT, quantized by an adaptive quantizer, entropy encoded using a variable length coder (“VLC”) and buffered for transmission over a channel.
The way that the motion estimator works is illustrated in FIG. 3. In its simplest form the current frame is partitioned into motion compensation blocks, called “mcblocks” herein, of constant size, e.g., 16×16 or 8×8. However, variable size mcblocks are often used, especially in newer codecs such as H.264. ITU-T Recommendation H.264, Advanced Video Coding. Indeed nonrectangular mcblocks have also been studied and proposed. Mcblocks are generally larger than or equal to pixel blocks in size.
Again, in the simplest form of motion compensation, the previous decoded frame is used as the reference frame, as shown in FIG. 3. However, one of many possible reference frames may also be used, especially in newer codecs such as H.264. In fact, with appropriate signaling, a different reference frame may be used for each mcblock.
Each mcblock in the current frame is compared with a set of displaced mcblocks in the reference frame to determine which one best predicts the current mcblock. When the best matching mcblock is found, a motion vector is determined that specifies the displacement of the reference mcblock.
Exploiting Spatial Redundancy
Because video is a sequence of still images, it is possible to achieve some compression using techniques similar to JPEG. Such methods of compression are called intraframe coding techniques, where each frame of video is individually and independently compressed or encoded. Intraframe coding exploits the spatial redundancy that exists between adjacent pixels of a frame. Frames coded using only intraframe coding are called “I-frames”.
Exploiting Temporal Redundancy
In the unidirectional motion estimation described above, called “forward prediction”, a target mcblock in the frame to be encoded is matched with a set of mcblocks of the same size in a past frame called the “reference frame”. The mcblock in the reference frame that “best matches” the target mcblock is used as the reference mcblock. The prediction error is then computed as the difference between the target mcblock and the reference mcblock. Prediction mcblocks do not, in general, align with coded mcblock boundaries in the reference frame. The position of this best-matching reference mcblock is indicated by a motion vector that describes the displacement between it and the target mcblock. The motion vector information is also encoded and transmitted along with the prediction error. Frames coded using forward prediction are called “P-frames”.
The prediction error itself is transmitted using the DCT-based intraframe encoding technique summarized above.
Bidirectional Temporal Prediction
Bidirectional temporal prediction, also called “Motion-Compensated Interpolation”, is a key feature of modern video codecs. Frames coded with bidirectional prediction use two reference frames, typically one in the past and one in the future. However, two of many possible reference frames may also be used, especially in newer codecs such as H.264. In fact, with appropriate signaling, different reference frames may be used for each mcblock.
A target mcblock in bidirectionally-coded frames can be predicted by a mcblock from the past reference frame (forward prediction), or one from the future reference frame (backward prediction), or by an average of two mcblocks, one from each reference frame (interpolation). In every case, a prediction mcblock from a reference frame is associated with a motion vector, so that up to two motion vectors per mcblock may be used with bidirectional prediction. Motion-Compensated Interpolation for a mcblock in a bidirectionally-predicted frame is illustrated in FIG. 4. Frames coded using bidirectional prediction are called “B-frames”.
Bidirectional prediction provides a number of advantages. The primary one is that the compression obtained is typically higher than can be obtained from forward (unidirectional) prediction alone. To obtain the same picture quality, bidirectionally-predicted frames can be encoded with fewer bits than frames using only forward prediction.
However, bidirectional prediction does introduce extra delay in the encoding process, because frames must be encoded out of sequence. Further, it entails extra encoding complexity because mcblock matching (the most computationally intensive encoding procedure) has to be performed twice for each target mcblock, once with the past reference frame and once with the future reference frame.
Typical Encoder Architecture for Bidirectional Prediction
FIG. 5 shows a typical bidirectional video encoder. It is assumed that frame reordering takes place before coding, i.e., I- or P-frames used for B-frame prediction must be coded and transmitted before any of the corresponding B-frames. In this codec, B-frames are not used as reference frames. With a change of architecture, they could be as in H.264.
Input video is fed to a Motion Compensation Estimator/Predictor that feeds a prediction to the minus input of the subtractor. For each mcblock, the Inter/Intra Classifier then compares the input pixels with the prediction error output of the subtractor. Typically, if the mean square prediction error exceeds the mean square pixel value, an intra mcblock is decided. More complicated comparisons involving DCT of both the pixels and the prediction error yield somewhat better performance, but are not usually deemed worth the cost.
For intra mcblocks the prediction is set to zero. Otherwise, it comes from the Predictor, as described above. The prediction error is then passed through the DCT and quantizer before being coded, multiplexed and sent to the Buffer.
Quantized levels are converted to reconstructed DCT coefficients by the Inverse Quantizer and then the inverse is transformed by the inverse DCT unit (“IDCT”) to produce a coded prediction error. The Adder adds the prediction to the prediction error and clips the result, e.g., to the range 0 to 255, to produce coded pixel values.
For B-frames, the Motion Compensation Estimator/Predictor uses both the previous frame and the future frame kept in picture stores.
For I- and P-frames, the coded pixels output by the Adder are written to the Next Picture Store, while at the same time the old pixels are copied from the Next Picture store to the Previous Picture store. In practice, this is usually accomplished by a simple change of memory addresses.
Also, in practice the coded pixels may be filtered by an adaptive deblocking filter prior to entering the picture stores. This improves the motion compensation prediction, especially for low bit rates where coding artifacts may become visible.
The Coding Statistics Processor in conjunction with the Quantizer Adapter controls the output bit-rate and optimizes the picture quality as much as possible.
Typical Decoder Architecture for Bidirectional Prediction
FIG. 6 shows a typical bidirectional video decoder. It has a structure corresponding to the pixel reconstruction portion of the encoder using inverting processes. It is assumed that frame reordering takes place after decoding and video output. The interpolation filter might be placed at the output of the motion compensated predictor as in the encoder.
Fractional Motion Vector Displacements
FIG. 3 and FIG. 4 show reference mcblocks in reference frames as being displaced vertically and horizontally with respect to the position of the current mcblock being decoded in the current frame. The amount of the displacement is represented by a two-dimensional vector [dx, dy], called the motion vector. Motion vectors may be coded and transmitted, or they may be estimated from information already in the decoder, in which case they are not transmitted. For bidirectional prediction, each transmitted mcblock requires two motion vectors.
In its simplest form, dx and dy are signed integers representing the number of pixels horizontally and the number of lines vertically to displace the reference mcblock. In this case, reference mcblocks are obtained merely by reading the appropriate pixels from the reference stores.
However, in newer video codecs it has been found beneficial to allow fractional values for dx and dy. Typically, they allow displacement accuracy down to a quarter pixel, i.e., an integer+−0.25, 0.5 or 0.75.
Fractional motion vectors require more than simply reading pixels from reference stores. In order to obtain reference mcblock values for locations between the reference store pixels, it is necessary to interpolate between them.
Simple bilinear interpolation can work fairly well. However, in practice it has been found beneficial to use two-dimensional interpolation filters especially designed for this purpose. In fact, for reasons of performance and practicality, the filters are often not shift-invariant filters. Instead different values of fractional motion vectors may utilize different interpolation filters.
Motion Compensation Using Adaptive Interpolation Filters
The optimum motion compensation interpolation filter depends on a number of factors. For example, objects in a scene may not be moving in pure translation. There may be object rotation, both in two dimensions and three dimensions. Other factors include zooming, camera motion and lighting variations caused by shadows, or varying illumination.
Camera characteristics may vary due to special properties of their sensors. For example, many consumer cameras are intrinsically interlaced, and their output may be de-interlaced and filtered to provide pleasing-looking pictures free of interlacing artifacts. Low light conditions may cause an increased exposure time per frame, leading to motion dependent blur of moving objects. Pixels may be non-square.
Thus, in many cases improved performance can be had if the motion compensation interpolation filter can adapt to these and other outside factors. In such systems interpolation filters may be designed by minimizing the mean square error between the current mcblocks and their corresponding reference mcblocks over each frame. These are the so-called Wiener filters. The filter coefficients would then be quantized and transmitted at the beginning of each frame to be used in the actual motion compensated coding.