Current video compression standards such as MPEG-1/2/4 and H.26x employ a hybrid of block-based motion compensated prediction and transform coding for representing variations in picture content due to moving objects. Each video frame may be compared with one or two other previously encoded frames. These previously encoded frames are referred to as reference frames. In most standards, frames that are encoded only with respect to themselves and without reference to another frame are called Intra coded, or I frames. Predicted (P-) frames are coded with respect to the nearest preceding Intra coded (I-frame) or P-frame. Bi-directionally predicted (B-) frames use the nearest past and future I- or P-frames as reference.
In block-based motion estimation, a current frame is divided into rectangular blocks and an attempt is made to match each block with a block from a reference frame, which would serve as the predictor of the current block. The difference between this predictor block and the current block is then encoded. The (x,y) offset of the current block from the predictor block is characterized as a motion vector. A significant improvement in compression efficiency is achieved since usually the ‘difference block’ has a much lower energy or information content than the original block.
The new ITU H.264/MPEG-4 AVC standard extends the concept of motion compensated prediction in a number of ways. It allows blocks in a frame (or blocks therein) to be compared with several other frames (or blocks therein) Up to 16 reference frames or 32 reference fields may be used in the comparison. Moreover, the reference frames no longer have to be the nearest (past or future) I- or P-frame. The reference frames can be located anywhere in the video sequence, as long as they are encoded prior to the frames that use them as a reference. The number of reference frames that can be used to encode a frame is limited by the amount of resources (memory and CPU) available (in both the encoder and decoder), subject to the maximum cap of 16 frames imposed by the H.264 specifications.
This expanded flexibility in motion compensated prediction provided by the H.264 standard is particularly beneficial in scenes where the video content toggles between multiple cameras, or when objects in a scene follow an oscillatory motion (e.g., a person's head nodding or eyes blinking), or when an object is temporarily occluded by another one. In these situations the most appropriate reference frame for encoding a given block may not be the one immediately preceding or subsequent to the frame to be encoded, but might be several frames away—hence the notion of Long Term Prediction (LTP).
The introduction of this new flexibility introduces two new challenges for designers of H.264 and other encoders employing LTP: (1) developing a low-complexity algorithm to intelligently select the best frames that can serve as reference frames from the list of all previously encoded frames and (2) developing a low-complexity algorithm to efficiently search through the selected reference frames.
Despite the progress made in the last two decades on Fast Motion Estimation (FME) algorithms, motion estimation with a single reference frame is already the most expensive operation during video encoding in terms of both CPU and memory usage. Having several reference frames instead of one therefore significantly impacts encoder performance. As such, almost all of the research into LTP has been concentrated on the second challenge noted above—i.e. how to efficiently search through a given set of reference frames. Yet the first challenge—appropriate selection of reference frames—can have a significant impact on the effectiveness of the LTP tool for improving compression efficiency.
Current H.264 implementations employ a sliding window approach in which the N (typically N≦5) frames immediately preceding the current frame are selected as potential reference frames. This selection approach is not always an effective choice, particularly in video sequences where the frequency at which the image content changes (e.g., where the image toggles between two cameras, or where there is periodic occlusion of an object) occurs over a time frame greater than that associated with five frames (165 ms at 30 fps) or even sixteen frames (−0.5 seconds at 30 fps). In other words, this approach is insensitive to low frequency periodic motion, and cannot take advantage of content redundancy outside of the window length.
A variation of the sliding window approach is to use the N most recent frames that have a k-frame separation between them—i.e. every k frames, replace the oldest reference frame with the most recently encoded frame. Yet this approach still suffers from the fact that the reference frames are selected independently of their content, and thus may or may not be good predictors for the frame to be encoded.
Another approach is to select the N frames such that each represents a different scene or shot. This way, when the video content toggles between multiple cameras, there will be a good chance that one of the reference frames will be a good predictor for the current frame. This approach would leverage the large body of research in automatic scene and shot detection algorithms. The use of this scheme brings up several challenges. These include determining how shots or frames are classified, how reference frames are selected, dealing with single shot sequences, and distinguishing between real scene changes and large motion of objects or changes in lighting.