A video sequence consists of a number of still image frames presented in time sequence to create the appearance of continuous motion. High quality video is usually comprised of thirty or more frames per second. Thus, when digitizing a high resolution video clip, the required bandwidth increases rapidly. The amount of data required to represent even a single picture (still image) is derived by the frame's dimensions multiplied by the pixel depth. Thus, even 640×480 video with a pixel depth of 256, that is, 8 bits for each of the RGB or YUV elements of each pixel would require 0.9216 Megabytes per frame without compression. At thirty frames per second, that is a throughput of 27.648 Mbytes per second. However, because video is merely a sequence of frames, subsequent frames are often very similar in terms of their content, containing a lot of redundant data. When compressing video, this reduntant data is removed to achieve data compression.
In video compression applications, motion compensation describes a current frame in terms of where each block of that frame came from in a previous frame. Motion compensation reduces the amount of data throughput required to reproduce video by describing frames by their measured change from previous and subsequent frames.
Various techniques exist for performing motion compensation. A first approach is to simply subtract a reference frame from a given frame. The difference is called residual and usually contains less information than the original frame. Thus, rather then encoding the frame, only the residual is encoded. The residual can be encoded at a lower bit-rate without degrading the image quality. The decoder can reconstruct the original frame by simply adding the reference frame again.
Another technique is to estimate the motion of the whole scene and the objects in a video sequence. The motion is described by some parameters that have to be encoded in the bit-stream. The blocks of the predicted frame are approximated by appropriately translated blocks of the reference frame. This gives more accurate residuals than a simple subtraction. However, the bit-rate occupied by the parameters of the motion model can become quite large. This runs contrary to the goal of achieving high compression ratios.
Video frames are often processed in groups. One frame (usually the first) is encoded without motion compensation just as a normal image, that is, without compression. This frame is called I-frame or I-picture. The other frames are called P-frames or P-pictures and are predicted from the I-frame or P-frame that comes (temporally) immediately before it. The prediction schemes are, for instance, described as IPPPP, meaning that a group consists of one I-frame followed by four P-frames.
Frames can also be predicted from future frames. The future frames then need to be encoded before the predicted frames and thus, the encoding order does not necessarily match the real frame order. Such predicted frames are usually predicted from two directions, i.e. from the I- or P-frames that immediately precede or follow the predicted frame. These bidirectionally predicted frames are called B-frames.
In block motion compensation, frames are partitioned in blocks of pixels (e.g. macroblocks of 16×16 pixels in MPEG). Each block is predicted from a block of equal size in the reference frame. The blocks are not transformed in any way apart from being shifted to the position of the predicted block. This shift is represented by a motion vector. The motion vectors are the parameters of this motion compensation model and have to be encoded into the bit-stream.
Existing block matching methods may be performed in software, or may be implemented by a special-purpose hardware device. Software implementations have the disadvantage of being slow, whereas hardware solutions often lack the flexibility needed to support a wide range of different video encoding standards. A specific problem associated with both software and hardware techniques is that of memory alignment. To achieve high performance motion estimation, the pixels of the reference frame should be retrieved from memory in groups of 8 or even 16. However, blocks of pixels from the reference frame are not guaranteed to be located in memory at an address that is an integer multiple of 8. This may require non-aligned accesses, with extra hardware and additional memory access cycles, and is therefore one problem with existing methods.