Computer designers are faced with the task of designing systems that must meet continually expanding performance requirements. At an architectural level, many advances either reduce latency (the time between start and completion of an operation), or increase bandwidth (the width and rate of operations). At the semiconductor level, the speed of circuits has increased, while packaging densities have been enhanced to obtain higher performance. However, due to physical limitations on the speed of electronic components, other performance enhancing approaches have also been taken. In fact, a current architectural advance, which provides significant performance improvement in execution bandwidth, was first conceived during the early days of supercomputing.
The early days of supercomputing realized an architectural advantage by utilizing data parallelism to design legacy vector architectures with improved execution bandwidth. This form of parallelism arises in many numerical applications in science, engineering and image processing, where a single operation is applied to multiple elements in the data set (“data parallelism”), usually a vector or matrix. One way to utilize data parallelism that has proven effective in early processors is data pipelining. In this approach, vectors of data stream directly from memory or vector registers to and from pipelined functional units of the legacy vector architectures.
However, exploiting data parallelism within current architectures requires the conversion of serial code into parallel instructions to achieve optimum performance. One technique for rewriting serial code into a form that enables simultaneous (or parallel) processing of an instruction on multiple data elements is the single instruction, multiple data (SIMD) technique. Unfortunately, the task of transforming serial code into parallel instructions, such as SIMD instructions, is often a cumbersome task for programmers. As described herein, rewriting of serial code into a form that exploits instruction parallelism provided by, for example, SIMD instructions, is referred to as “vectorization”.
As described above, the SIMD technique provides a significant enhancement to execution bandwidth in mainstream computing. According to the SIMD approach, multiple functional units operate simultaneously on so-called “packed data elements” (relatively short vectors that reside in memory or registers). As a result, since a single instruction processes multiple data elements in parallel, this form of instruction level parallelism provides a new way to utilize data parallelism first devised during the early days of supercomputers. Accordingly, recent extensions to computing architectures utilize the SIMD technique to form architectures that support streaming SIMD extension (SSE/SSE2) (“SIMD Extension Architectures”). As a result, SIMD extension architectures enhance the performance of computationally intensive applications by utilizing a single operation which simultaneously processes different elements in a data set.
In addition to serial code vectorization, exploiting data parallelism generally requires the implementation of SIMD clipping instructions, as well as SIMD saturation instructions. In fact, implementing the conditional flow of control that is inherent to clipping and saturation operations without branch instructions is an important performance issue for SIMD Extension microarchitectures. Unfortunately, high level program languages generally do not include instructions or constructs for performing saturation arithmetic, as well as clipping operations.
As known to those skilled in the art, saturation and clipping constructs are commonly used in, for example, graphics applications to avoid anomalies where standard wraparound arithmetic would suddenly make black pixels darker instead of brighter. However, due to the lack of saturation and clipping operations in programming languages like C++ and Fortran, such constructs have to be explicitly coded. The explicit coding is generally performed utilizing “if” statements, or conditional expressions to test the value of operands before the actual arithmetic operations are performed. Therefore, there remains a need to overcome one or more of the limitations in the above-described, existing art.