A processor technology advances, newer software code is also being generated to run on machines with these processors. Users generally expect and demand higher performance from their computers regardless of the type of software being used. One such issue can arise from the kinds of instructions and operations that are actually being performed within the processor. Certain types of operations require more time to complete based on the complexity of the operations and/or type of circuitry needed. This provides an opportunity to optimize the way certain complex operations are executed inside the processor.
Media applications have been driving microprocessor development for more than a decade. In fact, most computing upgrades in recent years have been driven by media applications. These upgrades have predominantly occurred within consumer segments, although significant advances have also been seen in enterprise segments for entertainment enhanced education and communication purposes. Nevertheless, future media applications will require even higher computational requirements. As a result, tomorrow's personal computing (PC) experience will be even richer in audio-visual effects, as well as being easier to use, and more importantly, computing will merge with communications.
Accordingly, the display of images, as well as playback of audio and video data, which is collectively referred to herein as content, have become increasingly popular applications for current computing devices. Filtering and convolution operations are some of the most common operations performed on content data, such as image audio and video data. As known to those skilled in the art, filtering and correlation calculations are computed with a multiply-accumulate operation that adds the products of data and co-efficients. The correlation of two vectors, A and B, consists in the calculation of the sum S:
                                          S            ⁡                          [              k              ]                                =                                    1              N                        ⁢                                          ∑                                  i                  =                  0                                                  N                  -                  1                                            ⁢                                                a                  ⁡                                      [                    i                    ]                                                  ·                                  b                  ⁡                                      [                                          i                      +                      k                                        ]                                                                                      ,                            Equation        ⁢                                  ⁢                  (          1          )                    that is very often used with k=0:
                              S          ⁡                      [            0            ]                          =                              1            N                    ⁢                                    ∑                              i                =                0                                            N                -                1                                      ⁢                                          a                ⁡                                  [                  i                  ]                                            ·                              b                ⁡                                  [                  i                  ]                                                                                        Equation        ⁢                                  ⁢                  (          2          )                    In case of an N tap filter f applied to a vector V, the sum S to be calculated is the following:
                    S        =                              ∑                          i              =              0                                      N              -              1                                ⁢                                    f              ⁡                              [                i                ]                                      ·                          V              ⁡                              [                i                ]                                                                        Equation        ⁢                                  ⁢                  (          3          )                    Such operations are computationally intensive, but offer a high level of data parallelism that can be exploited through an efficient implementation using various data storage devices, such as for example, single instruction multiple data (SIMD) registers.
Applications of filtering operations are found in a wider array of image and video processing tasks and communications. Examples of uses of filters are reduction of block artifacts in motion picture expert group (MPEG) video, reducing noise and audio, decoupling watermarks from pixel values to improve watermark detection, correlation for smoothing, sharpening, reducing noise, finding edges and scaling the sizes of images or video frames, up sampling video frames for sub-pixel motion estimation, enhancing audio signal quality, and pulse shaping and equalizing the signal in communications. Accordingly, filtering as well as convolution operations are vital to computing devices which offer playback of content, including image, audio and video data.
Unfortunately, current methods and instructions target the general needs of filtering and are not comprehensive. In fact, many architectures do not support a means for efficient filter calculations for a range of filter lengths and data types. In addition, data ordering within data storage devices such as SIMD registers, as well as a capability of adding adjacent values in a register and for partial data transfers between registers, are generally not supported. As a result, current architectures require unnecessary data type changes which minimizes the number of operations per instruction and significantly increases the number of clock cycles required to order data for arithmetic operations.