Video compression can be considered the process of representing digital video data in a form that uses fewer bits when stored or transmitted. Video compression algorithms can achieve compression by exploiting redundancies in the video data, whether spatial, temporal, or color-space. Video compression algorithms typically segment the video data into portions, such as groups of frames and groups of pels, to identify areas of redundancy within the video that can be represented with fewer bits than required by the original video data. When these redundancies in the data are exploited, greater compression can be achieved. An encoder can be used to transform the video data into an encoded format, while a decoder can be used to transform encoded video back into a form comparable to the original video data. The implementation of the encoder/decoder is referred to as a codec.
Standard encoders divide a given video frame into non-overlapping coding units or macroblocks (rectangular regions of contiguous pels) for encoding. The macroblocks (herein referred to more generally as “input blocks” or “data blocks”) are typically processed in a traversal order of left to right and top to bottom in a video frame. Compression can be achieved when input blocks are predicted and encoded using previously-coded data. The process of encoding input blocks using spatially neighboring samples of previously-coded blocks within the same frame is referred to as intra-prediction. Intra-prediction attempts to exploit spatial redundancies in the data. The encoding of input blocks using similar regions from previously-coded frames, found using a motion estimation algorithm, is referred to as inter-prediction. Inter-prediction attempts to exploit temporal redundancies in the data. The motion estimation algorithm can generate a motion vector that specifies, for example, the location of a matching region in a reference frame relative to an input block that is being encoded. Most motion estimation algorithms consist of two main steps: initial motion estimation, which provides an first, rough estimate of the motion vector (and corresponding temporal prediction) for a given input block, and fine motion estimation, which performs a local search in the neighborhood of the initial estimate to determine a more precise estimate of the motion vector (and corresponding prediction) for that input block.
The encoder may measure the difference between the data to be encoded and the prediction to generate a residual. The residual can provide the difference between a predicted block and the original input block. The predictions, motion vectors (for inter-prediction), residuals, and related data can be combined with other processes such as a spatial transform, a quantizer, an entropy encoder, and a loop filter to create an efficient encoding of the video data. The residual that has been quantized and transformed can be processed and added back to the prediction, assembled into a decoded frame, and stored in a framestore. Details of such encoding techniques for video will be familiar to a person skilled in the art.
MPEG-2 (H.262) and H.264 (MPEG-4 Part 10, Advanced Video Coding [AVC]), hereafter referred to as MPEG-2 and H.264, respectively, are two codec standards for video compression that achieve high quality video representation at relatively low bitrates. The basic coding units for MPEG-2 and H.264 are 16×16 macroblocks. H.264 is the most recent widely-accepted standard in video compression and is generally thought to be twice as efficient as MPEG-2 at compressing video data.
The basic MPEG standard defines three types of frames (or pictures), based on how the input blocks in the frame are encoded. An I-frame (intra-coded picture) is encoded using only data present in the frame itself. Generally, when the encoder receives video signal data, the encoder creates I-frames first and segments the video frame data into input blocks that are each encoded using intra-prediction. An I-frame consists of only intra-predicted blocks. I-frames can be costly to encode, as the encoding is done without the benefit of information from previously-decoded frames. A P-frame (predicted picture) is encoded via forward prediction, using data from previously-decoded I-frames or P-frames, also known as reference frames. P-frames can contain either intra blocks or (forward-)predicted blocks. A B-frame (bi-predicted picture) is encoded via bi-directional prediction, using data from both previous and subsequent frames. B-frames can contain intra, (forward-)predicted, or bi-predicted blocks.
A particular set of reference frames is termed a Group of Pictures (GOP). The GOP contains only the decoded pels within each reference frame and does not include information as to how the input blocks or frames themselves were originally encoded (I-frame, B-frame, or P-frame). Older video compression standards such as MPEG-2 use one reference frame (in the past) to predict P-frames and two reference frames (one past, one future) to predict B-frames. By contrast, more recent compression standards such as H.264 and HEVC (High Efficiency Video Coding) allow the use of multiple reference frames for P-frame and B-frame prediction. While reference frames are typically temporally adjacent to the current frame, the standards also allow reference frames that are not temporally adjacent.
Conventional inter-prediction is based on block-based motion estimation and compensation (BBMEC). The BBMEC process searches for the best match between the target block (the current input block being encoded) and same-sized regions within previously-decoded reference frames. When such a match is found, the encoder may transmit a motion vector. The motion vector may include a pointer to the best match's position in the reference frame. One could conceivably perform exhaustive searches in this manner throughout the video “datacube” (height×width×frame index) to find the best possible matches for each input block, but exhaustive search is usually computationally prohibitive and increases the chances of selecting particularly poor motion vectors. As a result, the BBMEC search process is limited, both temporally in terms of reference frames searched and spatially in terms of neighboring regions searched. This means that “best possible” matches are not always found, especially with rapidly changing data.
The simplest form of the BBMEC algorithm initializes the motion estimation using a (0, 0) motion vector, meaning that the initial estimate of a target block is the co-located block in the reference frame. Fine motion estimation is then performed by searching in a local neighborhood for the region that best matches (i.e., has lowest error in relation to) the target block. The local search may be performed by exhaustive query of the local neighborhood (termed here full block search) or by any one of several “fast search” methods, such as a diamond or hexagonal search.
An improvement on the BBMEC algorithm that has been present in standard codecs since later versions of MPEG-2 is the enhanced predictive zonal search (EPZS) algorithm [Tourapis, A., 2002, “Enhanced predictive zonal search for single and multiple frame motion estimation,” Proc. SPIE 4671, Visual Communications and Image Processing, pp. 1069-1078]. The EPZS algorithm considers a set of motion vector candidates for the initial estimate of a target block, based on the motion vectors of neighboring blocks that have already been encoded, as well as the motion vectors of the co-located block (and neighbors) in the previous reference frame. The algorithm hypothesizes that the video's motion vector field has some spatial and temporal redundancy, so it is logical to initialize motion estimation for a target block with motion vectors of neighboring blocks, or with motion vectors from nearby blocks in already-encoded frames. Once the set of initial estimates has been gathered, the EPZS algorithm narrows the set via approximate rate-distortion analysis, after which fine motion estimation is performed.
Historically, model-based compression schemes have also been proposed to avoid the limitations of BBMEC prediction. These model-based compression schemes (the most well-known of which is perhaps the MPEG-4 Part 2 standard) rely on the detection and tracking of objects or features (defined generally as “components of interest”) in the video and a method for encoding those features/objects separately from the rest of the video frame. These model-based compression schemes, however, suffer from the challenge of segmenting video frames into object vs. non-object (feature vs. non-feature) regions. First, because objects can be of arbitrary size, their shapes need to be encoded in addition to their texture (color content). Second, the tracking of multiple moving objects can be difficult, and inaccurate tracking causes incorrect segmentation, usually resulting in poor compression performance. A third challenge is that not all video content is composed of objects or features, so there needs to be a fallback encoding scheme when objects/features are not present.