Media applications have been driving microprocessor development for more than a decade. In fact, most computing upgrades in recent years have been driven by media applications. These upgrades have predominantly occurred within consumer segments, although significant advances have also been seen in enterprise segments for entertainment enhanced education and communication purposes. Nevertheless, future media applications will require even higher computational requirements. As a result, tomorrow's personal computing (PC) experience will be even richer in audio-visual effects, as well as being easier to use, and more importantly, computing will merge with communications.
Accordingly, the display of images, as well as playback of audio and video data, which is collectively referred to herein as content, have become increasingly popular applications for current computing devices. Bi-linear interpolation and motion compensation are popular techniques for decompression and display of images and video. Quarter-pixel and eighth-pixel motion compensation of luminance content in prior decompression techniques have made use of Finite Impulse Response (FIR) filters for interpolation. However for fractional-pixel chrominance motion compensation, bi-linear interpolation may be used instead.
Recently bi-linear interpolations have been proposed by the Joint Video Team of the International Telecommunication Union (ITU) Video Coding Experts Group (VCEG) and the International Organization for Standardization/International Electrotechnical Commission (ISO/IEC) Motion Picture Experts Group (MPEG) for fractional-pixel chrominance motion compensation in accordance with the H.264 standard (see Final Committee Draft of ISO/IEC 14496-10 Advanced Video Coding). Although the number of chrominance components is typically only half of the number of luminance components, the processing required for motion compensation of both types of components may be substantially equal.
In some computer systems, processors are implemented to operate on values represented by a large number of bits (e.g., 32 or 64) using instructions that produce one result. For example, the execution of an add instruction will add together a first 64-bit value and a second 64-bit value and store the result as a third 64-bit value. However, media applications require the manipulation of large amounts of data which may be represented in a small number of bits. For example, image data typically requires 8 or 16 bits and sound data typically requires 8 or 16 bits. To improve efficiency of media applications, some prior art processors provide packed data formats. A packed data format is one in which the bits typically used to represent a single value are broken into a number of fixed sized data elements, each of which represents a separate value. For example, a 64-bit register may be broken into two 32-bit elements, each of which represents a separate 32-bit value. In addition, these prior art processors provide instructions for separately manipulating each element in these packed data types in parallel. For example, a packed add instruction adds together corresponding data elements from a first packed data and a second packed data. Thus, if a multimedia algorithm requires a loop containing five operations that must be performed on a large number of data elements, it is desirable to pack the data and perform these operations in parallel using packed data instructions. In this manner, these processors can more efficiently process content of media applications.
Unfortunately, current methods and instructions target the general needs of compression/decompression and are not comprehensive. In fact, many architectures do not support a means for efficient bi-linear interpolation and motion compensation over a range of coefficient sizes and data types. In addition, data ordering within data storage devices such as SIMD registers, as well as a capability for partial data transfers between registers, are generally not supported. As a result, current architectures require unnecessary data type changes which minimizes the number of operations per instruction and significantly increases the number of clock cycles required to order data for arithmetic operations.
Therefore, there remains a need to overcome one or more of the limitations existing in the techniques above-described.