Digital video and digital video compression has become ubiquitous throughout the content generation, distribution, broadcast, editing, and storage markets. In general, image sequences may be compressed as either spatial only (JPEG, DV, MPEG Intra pictures) or both temporally and spatially (MPEG, H.261, H.263, H.264). The dominant compression schemes are block-based (MPEG2, MPEG1, DV, MPEG4) and ‘lossy’.
A compressed video stream is an encoded sequence of video frames. Each frame is a still image. A video player displays one frame after another, usually at a rate close to 30 frames per second. Frames can be encoded in three types: intra-frames (I-frames), forward predicted frames (P-frames), and bi-directional predicted frames (B-frames). Non-intra frames are encoded using information from outside the current frames that has already been encoded. There are two types of non-intra frames, predicted frames (P-frames) and bidirectional frames (B-frames). In non-intra frames, motion compensated information is used for a macroblock, which results in less data than directly (intra) encoding the macroblock.
An I-frame is spatially encoded as a single image, with no reference to any past or future frames. After the I-frame has been processed, the encoded frame will be transmitted to a decoder, where it is decoded as a reconstructed image and stored. The encoded I-frame is also decoded at the encoder to provide a reconstructed version of the image identical with the one that will be generated by the decoder. This reconstructed image is used as a reference frame to encode non-intra frames.
A P-frame is encoded relative to the past reference frame, which can be an I- or P-frame. The past reference frame is the closest preceding reference frame. Before the image is transformed from the spatial domain into a frequency domain using the DCT (Discrete Cosine Transform), each macroblock is compared with the same block in the reference image. If the block is part of a static background, it will be the same as the corresponding block in the reference image. Instead of encoding this block, the decoder is instructed to use the corresponding block from the reference image. If the block is part of motion in the scene, then the block may still be present in the reference image, just not is the same location. A motion vector is used to instruct the encoder where in the reference image to get the data, typically using a value having x and y components. The process of obtaining the motion vector is known as motion estimation, and using the motion vector to eliminate or reduce the amount of residual information to encode is known as motion compensation. The encoding for B-frames is similar to P-frames, except that motion vectors may refer to areas in both past and future reference pictures.
Once the compressed video stream is received by a decoder (e.g., MPEG or DV), the decoder decompresses the video and performs a scaling operation and perhaps overlays graphics (‘sub-picture’ in DVD-Video) prior to display. Newer compression schemes such as MPEG4, H.26L, and H.263+, as well as older, low bit rate standards such as H.261 also apply deblocking filters to reduce the effects of the DCT block boundary artifacts (Note that not all block-based standards use DCT). Progressive output (also known as ‘line doubling’ or ‘de-interlacing’) delivers higher vertical resolution and is typically generated by a pixel-adaptive nonlinear filter applied between fields to generate interpolated pixels. Large screen televisions employ special purpose hardware to detect coherent pan sequences in NTSC film material, and generate interpolated frames to reduce the motion artifacts of 3:2 pulldown (‘Judder’).
The decode process typically is performed in raster order left to right and top to bottom, although this need not be the case (example: DV). Additionally, some coding schemes may decode multiple pictures simultaneously. In the decode process, the reconstructed pixel data is created in an on-chip memory system and written back out to external memory (e.g., SDRAM). Previously decoded images may be accessed to provide reference pixel information used in the reconstruction of predicted images. The fetching and storage of pixel information consumes SDRAM ‘bandwidth’ and can be the bottleneck to system performance. Providing higher bandwidth to memory typically increases system cost, either in the form of a ‘wider’ memory interface (e.g., 32 bit instead of 16 bit wide memory) or in the form of faster memories (more power consumption, typically more expensive). In integrated system-on-chip video codec solutions, especially those with a Unified Memory Architecture (UMA), SDRAM bandwidth is one of the critical resources limiting system performance.
Co-pending patent application Ser. No. 10/256,190, discloses a content adaptive video processor in which scene classification is used to supply control input to a temporal filter to modify the behavior of the temporal filter according to current characteristics of the content to more effectively reduce artifacts in the image sequence. In addition, the temporal filtering may be applied in the encoding system and/or the decoding system, and in or out of the encoding and decoding loops. In this system, a motion estimation unit is utilized by both a motion-compensated temporal filter (MCTF) and a motion-compensated de-interlacer (MCDI).
Motion estimation tasks, however, require memory space for reference and target images, SDRAM bandwidth to fetch the pixels from these images, and computational resources to perform pixel comparisons and compute candidate motion vectors. In hierarchical motion estimation schemes, decimated (reduced resolution) versions of the reference and target images must be generated as well prior to performing motion estimation.
Typically, there is a pipelined architecture for performing both the MCTF and MCDI processes that includes the following three steps. The first step generates decimated images for the motion estimation engine; performs temporal analysis to detect repeated fields (for film material) and to detect scene cuts; measures spatial analysis metrics across the images to detect the location and severity of macroblocking, localized frequency content, edges and other features; and measures image differences and picture-level metrics to identify scene cuts, classify scenes, and invoking specialized processing modes for the identified scene type. (e.g., recognize film content for MCDI).
The second step is to perform motion estimation (ME). Typically, several stages of ME (e.g., hierarchical or telescopic motion estimation) may occur in sequence, each using the results from the previous stage. Additionally, multi-candidate ME may add stages, using different hierarchies or target blocks (e.g., field and frame).
The third step is to perform either MCTF or MCDI, respectively, in which the ME vector candidates are evaluated, block mode selected, and the temporal filtering and de-interlacing steps are performed using the information from the previous stages.
In a non-integrated system, (such as might occur in a system with both a decoder and de-interlacing chip) these three steps occur separately for MCTF and MCDI, and SDRAM space, SDRAM bandwidth, decimation, and estimation effort is duplicated. Additionally, image fetches and processing for spatial analysis and scene classification (e.g., to detect scene changes) is duplicated as well.
Accordingly, what is needed is an improved video decoder system. The present invention addresses such a need.