Digital video takes up a significant amount of storage space or bandwidth in its original uncompressed form. Video coding or video compression is the process of compressing (encoding) and decompressing (decoding) video. Video compression makes it possible to transmit or store digital video in a smaller, compressed form. Many video compression standards, including MPEG-2, MPEG-4 and H.264, are well-known in the art today and provide efficient methods to compress video data.
These standards use a variety of compression techniques, including inter frame prediction. Inter frame prediction (or inter frame coding) is based on the premise that in most video scenes, the background remains relatively stable while action is taking place in the foreground. Thus, typically video images vary little over short durations of time. As such, a significant part of the video information in a stream of images is predictable and therefore redundant. Accordingly, a primary objective of the inter frame prediction technique is to remove the redundant information in a video stream comprising one or more neighboring frames and leave only the true or unpredictable information. Inter frame prediction, therefore, takes advantage of the temporal redundancy between neighboring frames in order to attain higher compression rates.
Rather than storing separate, complete images for every frame of video then, most video compression standards use inter frame prediction, which comprises providing one or more reference frames, and building the majority of frames by noting how the reference frames change. For example, in some of the more popular video compression standards, a single complete image is encoded at the beginning of a sequence; such a complete image is described as an intra frame (I frame). The I frame is a reference frame. It is compressed without reference to other frames and thus contains an entire frame of video information. As a result, it can be decoded without any additional information. In most video compression standards, there are two other types of inter-frames also used: P frames and B frames.
Predicted frames (or P frames) generally are encoded with reference to a past frame (either an I frame of a previous P frame) and in general are used as a reference for subsequent P frames. B frames provide higher compression than P frames but require both a past and a future reference frame in order to be encoded.
Typically, an I frame needs to be transmitted periodically so that the decoder can synchronize to the video stream, otherwise it would be impossible to obtain the reference images. Since the images in between the reference I frames typically vary only to a small degree, only the image differences in the form of P frames and B frames, are captured, compressed and stored. How well a video compression technique performs depends largely on the accuracy of its estimate of these differences.
When a stream of video frames involves a moving object, the estimate needs to also include motion compensation. To this end, each inter coded frame is divided into blocks known as macroblocks. Typically, each frame can be sub-divided into either 16×16 blocks or 8×8 blocks, however, different encoding techniques use different block partitioning techniques. Macroblocks are regions of image pixels.
In conventional systems, instead of directly encoding raw pixel values for each block, the encoder tries to find a block similar to the one it is encoding on a previously encoded frame, typically the reference frame. This process is done by a block matching procedure, which is a motion-compensated estimation technique most commonly used in conventional systems because its consistency and simplicity make it considerably suitable for hardware implementation. If the encoder finds a match, the block could be encoded by a vector known as motion vector, which points to the position of the matching block at the reference frame. Accordingly, motion vectors are simply displacements measurements of objects between successive video frames and are transmitted as part of the compression scheme to be used in decoding the compressed video frame.
The process used to determine motion vectors is called motion estimation. Applying the motion vectors to an image to synthesize the transformation to the next image is called motion compensation. The combination of motion estimation and motion compensation is an integral part of many well-known video compression protocols.
In most cases, the encoder will succeed in finding a match on the reference frame. However, the block found is not likely to be an exact match to the block it is encoding and, as a result, the encoder will compute the differences between the two blocks. These differences are known as a prediction error and are transformed and sent to the decoder along with the motion vector. The decoder uses both the motion vector pointing to the matched block and the predictor error to recover the original block.
FIG. 1 illustrates how a prior art inter-frame prediction process is carried out. Instead of encoding the raw pixel values for macroblock 390 in target frame 360, the encoder uses a block matching procedure to try and find a block similar to it in reference frame 350. After finding macroblock 380 in the reference frame 350 and identifying it as the best match, the encoder will generate a motion vector 310 for macroblock 390 under the assumption that all the pixels within that block have the same motion activity. The block 380 in reference frame is identified through a search based on a best match selection criteria relative to a block from the target frame 360. The best match selection criteria is typically designed to ensure a minimized estimation difference (or prediction error).
The encoder will then compute the difference between macroblock 380 and macroblock 390 and transmit the calculated prediction error along with the computed motion vector 310 to the decoder 330. Using both the prediction error and the computed motion vector, the decoder can then recover macroblock 390.
Conventional systems employ many types of block matching procedures to find best matches that result in the smallest prediction error between block 390 from the target frame 360 and block 380 located from a search area within the reference frame 350. The most effective search but also the most inefficient and computationally expensive is a full exhaustive search wherein every block within the search area of the reference frame is examined and the corresponding computation made. The match selection criterion used for the full search may be the Sum of Absolute Difference (SAD), but other match selection criteria including mean absolute difference, mean square difference, etc., can also be used.
In order to reduce inefficiencies and processing effort and times, other search techniques have also been developed, including, cross search, spiral search, three step search, four step search, orthogonal search, hierarchical search and diamond search, for instance. These procedures attempt to equal the quality of an exhaustive search without the attendant time and computation effort. However, all such high speed methods of motion estimation are subject to certain shortcomings.
One shortcoming of conventional motion estimation methods is that these techniques use block matching procedures that start searching in the reference frame for a best match block at the same position that the block to be encoded (current block) holds in the corresponding target frame. Some procedures incorporate positions corresponding to neighboring blocks in the search for the current block within the reference frame as well. However, in real time encoding where significant transitions are occurring in the video frames over a short period of time, the encoder may not be able to find the best matching block in time if precious computational time is spent in searching the reference frame for the current block around the same position that it holds in the target frame. This makes it difficult to ensure real-time transmission of encoded video data.