The processing of video data is generally well known in the art. Generally, data for video is defined by a series of frame images of an X by Y pixel resolution. For facilitating efficient processing, pixel (pel) data for each frame is customarily handled in macroblock (MB) groups. Typical MB sizes include 4×16, 8×8 and 16×16 row by column blocks of pel data. Usually each dimension is preferably some power of two when employing binary digital processing.
For various types of video processing, such as video encoding, frame rate conversion, super-resolution, etc., it is desirable to perform motion estimation by finding where MBs of image data (or altered versions thereof) that appear in a reference frame are located in a subsequent frame. Conventionally, the image data of a MB located at a particular coordinate location in a reference frame is searched for in a subsequent frame within a search block that surrounds the MB reference block location. For example, FIG. 1a illustrates MB 10a at a particular location within a reference and FIG. 1b illustrates a corresponding search area 12a in a subsequent frame for the MB 10a. 
MB searches within search areas are typically performed using a pixel based comparison, such as using an accumulation of the sum of the absolute value of differences calculation comparison of the pel values of the MB 10a to corresponding pel values with respect to each MB-sized area within the search area 12a. To conduct a search, the pel data for the entire search area must be available. Where it is desirable to search for all of the MBs of a reference frame in a subsequent frame, such processing is highly calculation intensive and becomes more calculation intensive and time sensitive as resolution sizes and frame speeds increase.
A given search for a MB within a search area will not always obtain a positive result since the MB of the reference frame (or altered version thereof) may simply not be present in a subsequent frame. Quite often, however, the same or an altered version of the same MB will appear in a subsequent frame so that when that image (or altered version) is found in a subsequent frame, a motion vector can be defined for the MB based on the MB location in the reference frame and the MB location in the search frame. Where a motion vector for the MB was previously determined, the prior motion vector can also be used in the determination of an updated motion vector.
Where a reference frame MB (or altered version thereof) is located in the search frame and a motion vector can be determined, that information is useful in facilitating the efficient processing of the video data as a whole. However, there can be instances where the image of a MB is moving at such a high speed that it in fact appears in the search frame being searched outside of the corresponding search area and accordingly is not detected. In such cases, the opportunity for determining a motion vector for that MB of the frame image, and using it to facilitate efficient processing, is lost.
The size of the search area is typically selected to be a number of pixels larger in both height and width than the MB. Although it is possible to search an entire search frame for the MBs of the reference frame, the time needed to perform such searches is prohibitive. Generally, a smaller search window requires less time to perform a search, but has a greater chance of missing detection of a MB that actually appears in the searched frame, but beyond the boundaries of the search area.
The following provides an example of a search area size and relative location. If MB 10a represents pel data for a 4×16 block of pixels, the search area 12a may be selected as 12×24, i.e., 8 pixels greater in both height and width. For convenience, the pixel location of an upper left corner of a MB or search area can be used to define its location within a frame. Where the upper left pixel of MB 10a is located at coordinates Xi, Xj, of the reference frame, the upper left pixel coordinate of the search area 12a in the search frame for MB 10a can be located at coordinates Xi-4, Yj-4 to provide a surrounding four pixel search area about the relative location of the MB 10a in the reference frame.
Similarly, MBs 10b and 10c that are adjacent to MB 10a in the reference frame depicted in FIG. 1a, have corresponding search areas 12b, 12c in the subsequent frame. Since the MBs 10a, 10b, 10c are adjacent, there is a substantial overlap with the corresponding search areas 12a, 12b, 12c. 
One context where motion estimation is typically used is video encoding. FIG. 2 illustrates an example of a conventional video encoder that receives data for video frames as input and outputs an encoded bit stream of encoded video data. One common method for encoding graphics/video involves encoding using discrete-cosine transform (DCT) processing so the encoded video content is translated into DCT coefficients. To playback/decode such encoded video, the use of inverse discrete-cosine transform (iDCT) processing is one of the required steps.
For MPEG-2 video encoding, for example, the video is defined in frames of pixels represent by YUV values. DCT processing is then performed with respect to blocks of YUV pixel data to result in blocks of DCT coefficients that are quantized and entropy coded using a variable-length code (VLC) that results in much of the video data of an MPEG-2 encoded bit stream that generally also includes motion vector and audio data. To decode the video of such an MPEG-2 bit stream, the processes with respect to the VLC encoded data must be reversed, but some loss of data quality is sacrificed because the encoding quantization process is not fully reversible (i.e., MPEG-2 represents a lossy coding scheme).
Referring to the FIG. 2 example, the input video data generally includes YUV values for each pixel of each frame of a video. Macroblocks of pixel data are processed along a primary encoding path by transform component T that performs DCT processing to produce blocks of DCT coefficients that are then processed by a quantization component Q. The quantized blocks of DCT coefficients are then processed by an entropy encoder to produce the bit stream output.
To include motion vector data, additional components are provided. In particular, a motion estimation/compensation component is provided which executes a comparative search to find a new relative location of MBs of a reference frame in a subsequent video frame. The reference frame is typically generated from quantized blocks of DCT coefficients of a previously processed video frame by processing them through an inverse quantization component Q−1 and an inverse transform component T−1 to generate MBs of pixel data of a given reference frame.
Graphics processing units (GPUs) have been developed to assist in the expedient processing of video data. GPUs have been developed with expanded processing functionality through configurations that utilize single instruction, multiple data (SIMD) processing engines that include local data storage (LDS) memory and processing components known as shaders. For example, FIG. 3 illustrates a prior art GPU, namely the ATI Radeon HD 5800 series GPU. The Radeon HD 5800 series GPU has approximately 2.72 TeraFLOPS of processing power. This exemplary GPU features 20 SIMD engines, each with LDS memory and 16 processors (shaders), i.e., 320 shaders. The Radeon HD 5800 series GPU also sports 80 texture units, 4 per SIMD engine, and a Graphics Double Data Rate (GDDR) memory interface that offers approximately 150+GB/sec of peak bandwidth.