The present invention relates to digital-image processing and, more particularly, to evaluating matches between digital images. The invention provides for high throughput motion estimation for video compression by providing a high-speed image-block-match function.
Video (especially with, but also without, audio) can be an engaging and effective form of communication. Video is typically stored as a series of still images referred to as “frames”. Motion and other forms of change can be represented as small changes from frame to frame as the frames are presented in rapid succession. Video can be analog or digital, with the trend being toward digital due to the increase in digital processing capability and the resistance of digital information to degradation as it is communicated.
Digital video can require huge amounts of data for storage and bandwidth for communication. For example, a digital image is typically described as an array of color dots, i.e., picture elements (“pixels”), each with an associated “color” or intensity represented numerically. The number of pixels in an image can vary from hundreds to millions and beyond, with each pixel being able to assume any one of a range of values. The number of values available for characterizing a video pixel can range from two to trillions; in the binary code used by computers and computer networks, the typical range is from eight to thirty-two bits.
In view of the typically small changes from frame to frame, there is a lot of redundancy in video data. Accordingly, many video compression schemes seek to compress video data in part by exploiting inter-frame redundancy to reduce storage and bandwidth requirements. For example, two successive frames typically have some corresponding pixel (“picture-element”) positions at which there is change and some pixel positions in which there is no change. Instead of describing the entire second frame pixel by pixel, only the changed pixels need be described in detail—the pixels that are unchanged can simply be indicated as “unchanged”. More generally, there may be slight changes in background pixels from frame to frame; these changes can be efficiently encoded as changes from the first frame as opposed to absolute values. Typically, this “inter-frame compression” results in a considerable reduction in the amount of data required to represent video images.
On the other hand, identifying unchanged pixel positions does not provide optimal compression in many situations. For example, consider the case where a video camera is panned one pixel to the left while videoing a static scene so that the scene appears (to the person viewing the video) to move one pixel to the right. Even though two successive frames will look very similar, the correspondence on a position-by-position basis may not be high. A similar problem arises as a large object moves against a static background: the redundancy associated with the background can be reduced on a position-by-position basis, but the redundancy of the object as it moves is not exploited.
Some prevalent compression schemes, e.g., MPEG, encode “motion vectors” to address inter-frame motion. A motion vector can be used to map one block of pixel positions in a first “reference” frame to a second block of pixel positions (displaced from the first set) in a second “predicted” frame. Thus, a block of pixels in the predicted frame can be described in terms of its differences from a block in the reference frame identified by the motion vector. For example, the motion vector can be used to indicate the pixels in a given block of the predicted frame are being compared to pixels in a block one pixel up and two to the left in the reference frame. The effectiveness of compression schemes that use motion estimation is well established; in fact, the popular DVD (“digital versatile disk”) compression scheme (a form of MPEG2) uses motion detection to put hours of high-quality video on a 5-inch disk.
Identifying motion vectors can be a challenge. Translating a human visual ability for identifying motion into an algorithm that can be used on a computer is problematic, especially when the identification must be performed in real time (or at least at high speeds). Computers typically identify motion vectors by comparing blocks of pixels across frames. For example, each 16×16-pixel block in a “predicted” frame can be compared with many such blocks in another “reference” frame to find a best match. Blocks can be matched by calculating the sum of the absolute values of the differences of the pixel values at corresponding pixel positions within the respective blocks. The pair of blocks with the lowest sum represents the best match, the difference in positions of the best-matched blocks determine the motion vector. Note that in some contexts, the 16×16-pixel blocks typically used for motion detection are referred to as “macroblocks” to distinguish them from 8×8-pixel blocks used by DCT (discrete cosine transformations) transformations for intra-frame compression.
For example, consider two color video frames in which luminance (brightness) and chrominance (hue) are separately encoded. In such cases, motion estimation is typically performed using only the luminance data. Typically, 8-bits are used to distinguish 256 levels of luminance. In such a case, a 64-bit register can store luminance data for eight of the 256 pixels of a 16×16 block; thirty-two 64-bit registers are required to represent a full 16×16-pixel block, and a pair of such blocks fills sixty-four 64-bit registers. Pairs of 64-bit values can be compared using parallel subword operations; for example, PSAD “parallel sum of the absolute differences” yields a single 16-bit value for each pair of 64-bit operands. There are thirty-two such results, which can be added or accumulated, e.g., using ADD or accumulate instructions. In all, about sixty-four instructions, other than load instructions, are required to evaluate each pair of blocks.
Note that the two-instruction loop (PSAD+ADD) can be replaced by a one-instruction loop using a parallel sum of the absolute differences and accumulate PSADAC instruction. However, this instruction requires three operands (the minuend register, the subtrahend register, and the accumulate register holding the previously accumulated value). Three operand registers are not normally available in general-purpose processors. However, such instructions can be advantageous for application-specific designs.
The Intel Itanium processor provides for improved performance in motion estimation using one- and two-operand instructions. In this case, a three-instruction loop is used. The first instruction is a PAveSub, which yields half the difference between respective one-byte subwords of two 64-bit registers. The half is obtained by shifting right one bit position. Without the shift, nine bits would be required to express all possible differences between 8-bit values. So the shift allows results to fit within the same one-byte subword positions as the one-byte subword operands.
These half-differences are accumulated into two-byte subwords. Since eight half-differences are accumulated into four two-byte subwords, the bytes at even-numbered byte positions are accumulated separately from bytes at odd-numbered byte positions. Thus, a “parallel accumulate magnitude left” PAccMagL accumulates half-differences at byte positions 1, 3, 5, and 7, while a “parallel accumulate magnitude right” PAccMagR accumulates the half-differences at byte positions 0, 2, 4, and 6. This loop can execute more quickly than the two-instruction loop described above, as a final sum is not calculated within each loop iteration. Instead, the four 2-byte subwords are summed once after the loop iterations end.
The four two-byte subwords can be summed outside the loop using an instruction sequence as follows. First, the final result is shifted to the right thirty-two bits. Then the original and shifted versions of the final result are summed. Then the sum is shifted sixteen bits to the right. The original and shifted versions of the sum are added. If necessary, all but the least-significant sixteen bits can be masked out to yield the desired match measure.
While the foregoing programs for calculating match measures are quite efficient, further improvements in performance are highly desirable. The number of matches to be evaluated varies by orders of magnitude, depending on several factors, but there can easily be millions to evaluate for a pair of frames. In any event, the block matching function severely taxes encoding throughput. Further reductions in the processing burden imposed by motion estimation are desired.