The present invention relates to digital image processing and, more particularly, to evaluating matches between digital images. The invention provides for high throughput motion estimation for video compression by providing a high-speed image-block-match function.
Video (especially with, but also without, audio) can be an engaging and effective form of communication. Video is typically stored as a series of still images referred to as “frames”. Motion and other forms of change can be represented as small changes from frame to frame as the frames are presented in rapid succession. Video can be analog or digital, with the trend being toward digital due to the increase in digital processing capability and the resistance of digital information to degradation as it is communicated.
Digital video can require huge amounts of data for storage and bandwidth for communication. For example, a digital image is typically described as an array of color dots, i.e., picture elements (“pixels”) each with an associated “color” or intensity described numerically. The number of pixels in an image can vary from hundreds to millions and beyond, with each pixel being able to assume any one of a range of values. The number of values available for characterizing a pixel can range from two to trillions; in the binary code used by computers and computer networks, the typical range is from 1 bit to 32 bits.
In view of the typically small changes from frame to frame, there is a lot of redundancy in video data. Accordingly, many video compression schemes seek to compress video data in part by exploiting inter-frame redundancy to reduce storage and bandwidth requirements. For example, two successive frames will have some corresponding pixel (“picture elements”) positions at which there is change and some pixel positions in which there is no change. Instead of describing the entire second frame pixel by pixel, only the changed pixels need be described in detail—the pixels that are unchanged can simply be indicated as “unchanged”. More generally, there may be slight changes in background pixels from frame to frame; these changes can be efficiently encoded as changes from the first frame as opposed to absolute values. Typically, this “inter-frame compression” results in a considerable reduction in the amount of data required to represent successive frames.
On the other hand, identifying unchanged pixel positions does not provide optimal compression in many situations. For example, consider the case where a video camera is panned 1-pixel to the left while videoing a static scene so that the scene appears (to the person viewing the video) to move 1-pixel to the right. Even though two successive frames will look very similar, the correspondence on a position-by-position basis may not be high. A similar problem arises as a large object moves against a static background: the redundancy associated with the background can be reduced on a position-by-position basis, but the redundancy of the object as it moves is not exploited.
Some prevalent compression schemes, e.g., MPEG, encode “motion vectors” to address inter-frame motion. A motion vector can be used to map one block of pixel positions in a first “reference” frame to a second block of pixel positions (displaced from the first set) in a second “predicted” frame. Thus, a block of pixels in the predicted frame can be described in terms of its differences from a block in the reference frame identified by the motion vector. For example, the motion vector can be used to indicate that the pixels in a given block of the predicted frame are being compared to pixels in a block one pixel up and two to the left in the reference frame. The effectiveness of compression schemes that use motion estimation is well established; in fact, the popular DVD (“digital versatile disk”) compression scheme (a form of MPEG2) uses motion detection to put hours of high-quality video on a 5-inch disk.
Identifying motion vectors can be a challenge. Translating a human visual ability for identifying motion into an algorithm that can be used on a computer is problematic, especially when the identification must be performed in real time (or at least at high speeds). Computers typically identify motion vectors by comparing blocks of pixels across frames. For example, each 16×16-pixel block in a “predicted” frame can be compared with many such blocks in another “reference” frame to find a best match. Blocks can be matched by calculating the sum of the absolute values of the differences of the pixel values at corresponding pixel positions within the respective blocks. The pair of blocks with the lowest sum represents the best match, the difference in positions of the best-matched blocks determines the motion vector. Note that in some contexts, the 16×16-pixel blocks typically used for motion detection are referred to as “macroblocks” to distinguish them from 8×8-pixel blocks used by DCT (discrete cosine transformations) transformations for intra-frame compression.
For example, consider two color video frames in which luminance (brightness) and chrominance (hue) are separately encoded. In such cases, motion estimation is typically performed using only the luminance data. Typically, 8-bits are used to distinguish 256 levels of luminance. In such a case, a 64-bit register can store luminance data for eight of the 256 pixels of a 16×16 block; thirty-two 64-bit registers are required to represent a full 16×16-pixel block, and a pair of such blocks fills sixty-four 64-bit registers. Pairs of 64-bit values can be compared using parallel subword operations. For example, PSAD “parallel sum of the absolute differences” yields a single 16-bit value for each pair of 64-bit operands. There are thirty-two such results, which can be added or accumulated, e.g., using ADD or accumulate instructions. In all, about sixty-three instructions are required to evaluate each pair of blocks. The number of matches to be evaluated varies by orders of magnitude, depending on several factors, but there can easily be millions to evaluate for a pair of frames. In any event, the block matching function severely taxes encoding throughput. What is needed is an approach to reducing the processing burden imposed by motion estimation.