The amount of data in digital video is immense. For example, each frame of a progressive scan 1080p HD video has 2,073,600 pixels (1080×1920), and each frame is typically refreshed 60 times per second. If each pixel takes 3 bytes to represent the full color value, this is 2,986 Mbit/s, it is apparent, then, that video data must be compressed to be handled efficiently.
Although the amount of video data is massive, there are two forms of redundancy that can be exploited. Firstly, in each picture most of video is a mere repetition of what is already on the screen. Secondly, even in fast-moving scenes, little of the screen changes and most of a screen is reproduced in the next frame, although the data may be shifted or located at another point on the screen. Further helping compression is the fact that the human eye acts as a filter, and for example is very insensitive to high frequencies and color. All of these factors allow video to be compressed dramatically while maintaining, at least to the human eye, a good visual quality.
The most compute-intensive portions of a video compression system, or encoder, is motion estimation. Motion estimation exploits the redundancy between frames by searching adjacent frames for similar areas of picture. Instead of sending the original pixel data, it is much more efficient to send a motion vector indicating where the similar area is and a block of (hopefully zero) differences. Each frame is tiled into groups of 16×16 pixels called macroblocks. The macroblock in modern compression systems such as H.264 can have sub-tiles, and each block or sub-block partition in an inter-coded macroblock can have a motion vector. To further compress the vector information, it is assumed that the motion vectors themselves are correlated, as for example in a camera pan. Thus a motion vector of a partition in a current frame can be predicted from its neighbors; it is the difference (often zero) between the prediction and the actual vector that is sent. In the H.264 standard, also known as the International Telecommunication Union-Telecommunications (ITU-T) H.264 standard or ISO/IEC 14496-10, which is incorporated by reference herein, the offset between two motion vectors has a quarter-pixel resolution. This resolution allows natural motions to be determined, which increases the probability of a good match and hence coding efficiency, but comes at the expense of having to match 16× the candidates during a search (compared to integer-resolution) to compute the motion vectors.
The tradeoff in computation resources required to calculate motion vectors is between computation speed and computation area. A large amount of resources may calculate motion vectors quickly, even in real time, but comes at an enormous hardware cost typically reserved for very expensive video delivery systems. At the other end of the spectrum are software systems that are inefficient, yet effective if performance speed is not the primary consideration. Some systems may calculate motion vectors for days to produce just a few minutes of compressed video, which is obviously not time efficient, but in some cases, such as for authoring video before distribution, is acceptable.
Embodiments of the invention address these and other limitations in the prior art.