Digital video and digital video compression has become ubiquitous throughout the content generation, distribution, broadcast, editing, and storage markets. The dominant compression schemes are block-based (MPEG2, MPEG1, DV, MPEG4) and ‘lossy’.
The basic idea behind video compression is to remove information redundancy, spatial within a video frame and temporal between video frames. As in JPEG, a standard for still image compression, DCT-based (Discrete Cosine Transform) compression is frequently used to reduce spatial redundancy. Because the images in a video stream usually do not change much within small time intervals the content of one video frame can be predicted from others temporally close to it. This technique, motion compensation, is used in standards such as MPEG to achieve greater compression ratios.
A video stream is a sequence of video frames. Each frame is a still image. A video player displays one frame after another, for example at a rate close to 30 frames per second in North America. In MPEG, digitized frames are divided into 16×16 pixel macroblocks, and are numbered in scan order (top left to bottom right) and are the units for motion-compensated compression. A block is a general term that also refers to 8×8 resgions as well.
Frames can be encoded in three types: intra-frames (I-frames), forward predicted frames (P-frames), and bi-directional predicted frames (B-frames). Non-intra frames are encoded using information from outside the current frames that has already been encoded. There are two types of non-intra frames, predicted frames (P-frames) and bi-directional frames (B-frames). In non-intra frames, motion compensated information can be used for a macroblock, which results in less data than directly (intra) encoding the macroblock. B and P pictures can contain some or all intra blocks.
An I-frame is spatially encoded as a single image, with no reference to any past or future frames. The encoding scheme used is similar to JPEG compression. Each 8×8 block in the frame is encoded independently with the exception of a DC coefficient. The block is first transformed from the spatial domain into a frequency domain using the DCT (Discrete Cosine Transform), which separates the signal into independent frequency bands. For human perception, the most sensitive frequency information is in the upper left corner of the resulting 8×8 block. After this, the data is quantized. Quantization can be thought of as ignoring lower-order bits (though this process is slightly more complicated). Quantization is the only lossy part of the whole compression process other than subsampling.
After the I-frame has been processed, the encoded frame will be reconstructed and stored to provide a reconstructed version of the image identical with the one that will be generated by the decoder. This reconstructed image is used as a reference frame to encode non-intra frames.
A P-frame is encoded relative to the past reference frame, which can be an I- or P-frame. In MPEG2, the past reference frame is the closest preceding reference frame. [Note: in H.264, aka MPEG4 part 10, additional previous reference pictures can be used.] As an illustration, an encoder might, before performing DCT, compare each macroblock with the same block in the reference image. If the block is part of a static background, it will be very similar to the corresponding block in the reference image. The encoder will generate a difference signal from the predictor and the current macroblock, and encode the difference signal in the output bitstream. The decoder is instructed to decode the difference signal and add the corresponding block from the reference image. Similarly, if the block is part of motion in the scene, then the block may still be present in the reference image, just not is the same location. A motion vector is used to instruct the encoder where in the reference image to get the data, typically using a value having x and y components. The process of obtaining the motion vector is known as motion estimation, and using the motion vector to eliminate or reduce the effects of motion is known as motion compensation.
For bi-directional encoding, a B-frame is encoded relative to a past reference frame, a future reference frame, or both frames. This way, a motion vector should be found for almost every macroblock in the B-frame. [See note on H.264] The encoding for B-frames is similar to P-frames, except that motion vectors may refer to areas in the future reference frames. For macroblocks that use both past and future reference frames, the two 16×16 areas are averaged.
FIG. 1 is a diagram showing a conventional example of an encoding pattern and dependencies between I-, P-, and B-frames. The diagram shows a typical group of pictures (GOP) IPB sequence that starts with an I-frame. The I- and P-frames are sometimes called anchor frames because they are used in the coding of other frames using motion compensation. The arrows represent the inter-frame prediction dependencies. The first P-frame is coded using the previous I-frame as a reference. Each subsequent P-frame uses the previous P-frame as its reference. Thus, errors in P-frames can propagate because the P-frame becomes the reference for other frames. B-frames are coded using the previous I- or P-frame as a reference for forward prediction, and the following I- or P-frame for backward prediction.
Frames do not need to follow a static IPB pattern. Each individual frame can be of any type. Often, however, a regular IPB sequence where there is a fixed pattern of P- and B-frames between I-frames is used throughout the entire video stream for simplicity. Regular GOPs are characterized by two parameters, M and N. M represents the distance between I-frames, and N is the distance between P-frames (or closest anchor frames). A value of M=1 means that there are no B-frames.
B-frames can usually be decoded only if both the preceding and following anchor frames have been sent to the decoder. The exception is that MPEG 2 has “closed GOP B pictures” in which only the following reference picture is required for decoding. FIG. 1 shows the GOP in display order, but to enable decoding, the order of the frames in the output sequence is rearranged in a way that a decoder can decompress the frames with minimum frame buffering. For example, an input sequence of IBBPBBP will be arranged in the output sequence as IPBBPBB. If there are no B-frames, then reordering is unnecessary.
Depending on the compression ratio and characteristics of the content, various compression artifact signatures can be introduced from this processing. For example, image motion, such as rotations and zooms may not be predicted efficiently and may load the system. Also brightness changes, shadows, and fades may result in poor prediction. Blocking artifacts result from a coarsely quantized DCT system. If there are insufficient bits available, block structures may become visible, resulting in visually perceptible rectilinear boundaries. ‘Mosquito’ noise is a characteristic of quantized DCT systems and appears on sharp edges in the scene, such as titles. Additionally, impairments characteristic of analog video distribution and storage (random noise, scan line corruption) are present as well. As bit rates decline, consumer digital video recording devices, such as time-shift and DVD-record devices become prevalent, and display sizes increase, these artifacts become more noticeable and a greater degree of suppression is required.
Typically, an MPEG, DV, or other decoder performs a scaling operation prior to display. Newer compression schemes such as MPEG4, H.26L, and H.263+, as well as older, low bit rate standards such as H.261 can apply deblocking filters to reduce the effects of the DCT block boundary artifacts as well.
Deblocking filters are spatial pixel operations that improve subjective quality by removing blocking and mosquito artifacts. These are either normative (H.261, H.263+) or non-normative (e.g. MPEG4). Normative deblocking filters are referred to as being inside the coding loop because predicted pictures are computed based on filtered versions of the previous ones. Normative (‘Loop’) filters are, in general, more effective in that they run both in the encoder and decoder. Non-normative deblocking filters are run after decode only and outside the coding loop. Therefore, prediction is not based on the post filtered version of the picture.
Temporal filtering is another approach for noise reduction. Temporal filters help remove the background graininess and noise that is often apparent in lower-quality input images, making the material easier to encode. There are two major temporal-domain approaches to image sequence filtering: (1) the motion-compensation approach, and (2) the motion-detection approach. In motion-compensated filtering, first a motion estimation algorithm is applied to the noisy image sequence to estimate the motion trajectories, i.e., locations of pixels (or subpixels) that correspond to each other at a predetermined number of nearby image frames. Then, the value of a particular pixel at a certain frame is estimated using the image sequence values that are on the motion trajectory passing through that pixel.
In contrast, methods based on motion detection do not attempt to estimate the interframe motion. Instead, direct differences of pixel values at identical spatial locations of two adjacent frames are computed to detect the presence of interframe motion.
In temporal filtering, the filtered pixel value at a certain location of the present frame is determined by applying a (typically nonlinear) finite impulse response (FIR) or infinite impulse response (IIR) filter structure to the unfiltered and estimated pixels. The filter coefficients are often functions of the difference between the pixel value of interest in the present frame and the pixel value at the same location of the previous frame.
The artifacts most readily suppressed by temporal filtering include ‘mosquito’ noise around text overlays and sharp edges, intra-frame beating, stationary texture crawling, and visible block boundaries. However, temporal filters have not been applied in conventional compression loops or within the loop in post processing for various reasons, including system complexity, and IDCT mismatch control: MPEG, for instance, does not specify an exact transform, so IIR loop filters would allow encoder/decoder drift to accumulate (Note: newer compression schemes like WMV and H.264 specify the transform exactly and do not suffer from this problem.)
Accordingly, what is needed is an improved method and system for reducing encoding artifacts in a video sequence of image frames. The present invention addresses such a need.