1. Technical Field
The present invention relates to systems for processing data and, in particular, to systems for processing data through single-instruction multiple data (SIMD) operations.
2. Background Art
Processor designers are always looking for ways to enhance the performance of microprocessors. Processing multiple operands in parallel provides one avenue for gaining additional performance from today's highly optimized processors. In certain common mathematical calculations and graphics operations, the same operation(s) is performed repeatedly on each of a large number of operands. For example, in matrix multiplication, the row elements of a first matrix are multiplied by corresponding column elements of a second matrix and the resulting products are summed (multiply-accumulate). By providing appropriate scheduling and execution resources, multiply-accumulate operations may be implemented concurrently on multiple sets of row-column operands. This approach is known as vector processing or single instruction, multiple data stream (SIMD) processing to distinguish it from scalar or single instruction, single data stream (SISD) processing.
In order to implement SIMD operations efficiently, data is typically provided to the execution resources in a "packed" data format For example, a 64-bit processor may operate on a packed data block, which includes two 32-bit operands. In this example, a vector multiply-accumulate instruction, V-FMA (f.sub.1, f.sub.2, f.sub.3), multiplies each of a pair of 32-bit operands stored in register f.sub.1 with a corresponding pair of 32-bit entries stored in register f.sub.2 and adds the resulting products to a pair of running sums stored in register f.sub.3. In other words, data is stored in the registers f.sub.1, f.sub.2, and f.sub.3 in a packed format that provides two operands from each register entry. If the processor has sufficient resources, it may process two or more packed data blocks, e.g. four or more 32-bit operands, concurrently. The 32 bit operands are routed to different execution units for processing in parallel and subsequently repacked, if necessary.
Even in graphics-intensive and scientific programming, not all operations are SIMD operations. Much of the software executed by general-purpose processors comprises instructions that perform scalar operations. That is, each source register specified by an instruction stores one operand, and each target register specified by the instruction receives one operand. In the above example, a scalar floating-point multiply-accumulate instruction, S-FMA (f.sub.1, f.sub.2, f.sub.3), may multiply a single 64-bit operand stored in register f.sub.1 with corresponding 64-bit operand stored in register f.sub.2 and add the product to a running sum stored in register f.sub.3. Each operand processed by the S-FMA instruction is provided to the FMAC unit in an unpacked format.
The register file that provides source operands to and receive results from the execution units consume significant amounts of a processor's die area. Available die area is a scarce resource on most processor chips. For this reason, processors typically include one register file for each major data type. For example, a processor typically has one floating-point register file that stores both packed and unpacked floating-point operands. Consequently, packed and unpacked operands are designed to fit in the same sized register entries, despite the fact that a packed operand includes two or more component operands.
Providing execution resources for packed and unpacked operands creates performance/cost challenges. One way to provide high performance scalar and vector processing is to include separate scalar and vector execution units. An advantage of this approach is that the vector and scalar execution units can each be optimized to process data in its corresponding format, i.e. packed and unpacked, respectively. The problem with this approach is that the additional execution units consume silicon die area, which is a relatively precious commodity.
In addition to providing appropriate execution resources, high performance processors must include mechanisms for transferring both packed and unpacked operand data efficiently. These mechanisms include those that transfer operand data to the register file from the processor's memory hierarchy, e.g. caches, and those that transfer operand data from the register file to the execution resources.
The present invention addresses these and other problems with currently available SIMD systems.