Video compression can be considered the process of representing digital video data in a form that uses fewer bits when stored or transmitted. Video encoding can achieve compression by exploiting redundancies in the video data, whether spatial, temporal, or color-space. Video compression processes typically segment the video data into portions, such as groups of frames and groups of pels, to identify areas of redundancy within the video that can be represented with fewer bits than required by the original video data. When these redundancies in the data are exploited, greater compression can be achieved. An encoder can be used to transform the video data into an encoded format, while a decoder can be used to transform encoded video back into a form comparable to the original video data. The implementation of the encoder/decoder is referred to as a codec.
Standard encoders divide a given video frame into non-overlapping coding units or macroblocks (rectangular regions of contiguous pels) for encoding. The macroblocks (herein referred to more generally as “input blocks” or “data blocks”) are typically processed in a traversal order of left to right and top to bottom in a video frame. Compression can be achieved when input blocks are predicted and encoded using previously-coded data. The process of encoding input blocks using spatially neighboring samples of previously-coded blocks within the same frame is referred to as intra-prediction. Intra-prediction attempts to exploit spatial redundancies in the data. The encoding of input blocks using similar regions from previously-coded frames, found using a motion estimation process, is referred to as inter-prediction. Inter-prediction attempts to exploit temporal redundancies in the data. The motion estimation process can generate a motion vector that specifies, for example, the location of a matching region in a reference frame relative to an input block that is being encoded. Most motion estimation processes consist of two main steps: initial motion estimation, which provides an first, rough estimate of the motion vector (and corresponding temporal prediction) for a given input block, and fine motion estimation, which performs a local search in the neighborhood of the initial estimate to determine a more precise estimate of the motion vector (and corresponding prediction) for that input block.
The encoder may measure the difference between the data to be encoded and the prediction to generate a residual. The residual can provide the difference between a predicted block and the original input block. The predictions, motion vectors (for inter-prediction), residuals, and related data can be combined with other processes such as a spatial transform, a quantizer, an entropy encoder, and a loop filter to create an efficient encoding of the video data. The residual that has been quantized and transformed can be processed and added back to the prediction, assembled into a decoded frame, and stored in a framestore. Details of such encoding techniques for video will be familiar to a person skilled in the art.
MPEG-2 (H.262) and H.264 (MPEG-4 Part 10, Advanced Video Coding [AVC]), hereafter referred to as MPEG-2 and H.264, respectively, are two codec standards for video compression that achieve high quality video representation at relatively low bitrates. The basic coding units for MPEG-2 and H.264 are 16×16 macroblocks. H.264 is the most recent widely-accepted standard in video compression and is generally thought to be twice as efficient as MPEG-2 at compressing video data.
The basic MPEG standard defines three types of frames (or pictures), based on how the input blocks in the frame are encoded. An I-frame (intra-coded picture) is encoded using only data present in the frame itself and thus consists of only intra-predicted blocks. A P-frame (predicted picture) is encoded via forward prediction, using data from previously-decoded I-frames or P-frames, also known as reference frames. P-frames can contain either intra blocks or (forward-)predicted blocks. A B-frame (bi-predicted picture) is encoded via bi-directional prediction, using data from both previous and subsequent frames. B-frames can contain intra, (forward-)predicted, or bi-predicted blocks.
A particular set of reference frames is termed a Group of Pictures (GOP). The GOP contains only the decoded pels within each reference frame and does not include information as to how the input blocks or frames themselves were originally encoded (I-frame, B-frame, or P-frame). Older video compression standards such as MPEG-2 use one reference frame (in the past) to predict P-frames and two reference frames (one past, one future) to predict B-frames. By contrast, more recent compression standards such as H.264 and HEVC (High Efficiency Video Coding) allow the use of multiple reference frames for P-frame and B-frame prediction. While reference frames are typically temporally adjacent to the current frame, the standards also allow reference frames that are not temporally adjacent.
Conventional inter-prediction is based on block-based motion estimation and compensation (BBMEC). The BBMEC process searches for the best match between the target block (the current input block being encoded) and same-sized regions within previously-decoded reference frames. When such a match is found, the encoder may transmit a motion vector, which serves as a pointer to the best match's position in the reference frame. For computational reasons, the BBMEC search process is limited, both temporally in terms of reference frames searched and spatially in terms of neighboring regions searched. This means that “best possible” matches are not always found, especially with rapidly changing data.
The simplest form of the BBMEC process initializes the motion estimation using a (0, 0) motion vector, meaning that the initial estimate of a target block is the co-located block in the reference frame. Fine motion estimation is then performed by searching in a local neighborhood for the region that best matches (i.e., has lowest error in relation to) the target block. The local search may be performed by exhaustive query of the local neighborhood (termed here full block search) or by any one of several “fast search” methods, such as a diamond or hexagonal search.
An improvement on the BBMEC process that has been present in standard codecs since later versions of MPEG-2 is the enhanced predictive zonal search (EPZS) method [Tourapis, A., 2002, “Enhanced predictive zonal search for single and multiple frame motion estimation,” Proc. SPIE 4671, Visual Communications and Image Processing, pp. 1069-1078]. The EPZS method considers a set of motion vector candidates for the initial estimate of a target block, based on the motion vectors of neighboring blocks that have already been encoded, as well as the motion vectors of the co-located block (and neighbors) in the previous reference frame. The EPZS method hypothesizes that the video's motion vector field has some spatial and temporal redundancy, so it is logical to initialize motion estimation for a target block with motion vectors of neighboring blocks, or with motion vectors from nearby blocks in already-encoded frames. Once the set of initial estimates has been gathered, the EPZS method narrows the set via approximate rate-distortion analysis, after which fine motion estimation is performed.
For any given target block, the encoder may generate multiple inter-predictions to choose from. The predictions may result from multiple prediction processes (e.g., BBMEC, EPZS, or model-based schemes). The predictions may also differ based on the subpartitioning of the target block, where different motion vectors are associated with different subpartitions of the target block and the respective motion vectors each point to a subpartition-sized region in a reference frame. The predictions may also differ based on the reference frames to which the motion vectors point; as noted above, recent compression standards allow the use of multiple reference frames. Selection of the best prediction for a given target block is usually accomplished through rate-distortion optimization, where the best prediction is the one that minimizes the rate-distortion metric D+λR, where the distortion D measures the error between the target block and the prediction, while the rate R quantifies the cost (in bits) to encode the prediction and λ is a scalar weighting factor.
Historically, model-based compression schemes have also been proposed to avoid the limitations of BBMEC prediction. These model-based compression schemes (the most well-known of which is perhaps the MPEG-4 Part 2 standard) rely on the detection and tracking of objects or features (defined generally as “components of interest”) in the video and a method for encoding those features/objects separately from the rest of the video frame. Feature/object detection/tracking occurs independently of the spatial search in standard motion estimation processes, so feature/object tracks can give rise to a different set of predictions than achievable through standard motion estimation.