1. Field of the Invention
The present invention generally relates to the implementation of microprocessors, and more particularly to an improved processor implementation having a unified scalar and SIMD datapath.
2. Background Description
Contemporary high-performance processors support multimedia-processing, using single instruction multiple data (SIMD) techniques for exploiting instruction-level parallelism in programs; that is, for executing more than one operation at a time. In general, these processors contain multiple functional units, some of which are directed to the execution of scalar data and some of which are grouped for the processing of structured SIMD vector data. SIMD data streams are often used to represent multimedia datatypes, such as color information, using, for example, the RGB format by encoding the red, green, and blue components in the structured data type, or coordinate information, by encoding position as the quadruple (x, y, z, w). Diefendorff et al, “How Multimedia Workloads Will Change Processor Design”, Computer, Vol. 30, No. 9, IEEE, September 1997, and Conte et al, “Challenges to Combining General-Purpose and Mulitmedia Processors”, Computer, Vol. 30, No. 12, IEEE, December 1997, give an overview of multimedia processing using SIMD processing techniques. Implementations based on the addition of a full-function SIMD processing block to an existing scalar block lead to large processor cores where multiple units are unnecessarily replicated, each replica dedicated to the processing of either scalar data or one element of the structured multimedia data type.
To date, processors designed for processing multimedia data have typically been implemented by augmenting an existing scalar processor implementation, for instance by adding a SIMD unit, the SIMD unit itself consisting of multiple functional units (i.e., fixed point units and floating point units) mirroring resources available for the processing of scalar data types, and replicating each functional unit type for the number of structured elements to be supported by the SIMD architecture. Often, the only units shared between the scalar and SIMD processing units are the issue logic, which issues instructions to either the scalar or SIMD processing blocks, and the load/store unit, which governs access to the memory subsystem.
FIG. 1 is a block diagram depicting an example of a prior art processor containing both scalar processing units and a SIMD unit for processing structured data types, the SIMD unit comprising multiple processing units for each element in the structured data type. This processor implementation is exemplary of prior art systems; in some implementations, some register files may be shared, e.g., a combined integer and floating point register file, or additional register files may be present, such as a condition register file or a predicate register file for comparison results. But in general, the use of separate scalar and SIMD processors is inefficient and expensive in that such a configuration includes a number of redundant functional units and data paths. Furthermore, such implementations result in an undesirable amount of power consumption since while either the scalar or SIMD unit is processing data, the other is generally sitting idly by, awaiting its next instruction, but all the while consuming system power.
During operation of the system of FIG. 1, instructions are fetched by instruction fetch unit 100, and supplied to an instruction decode unit 102. Decoded instructions are passed to an issue/branch unit 104, where branch instructions are resolved and other instructions can be stored in the instruction issue unit thereof (not shown) until they can be executed in one of the functional units of the processor. The instruction issue unit can contain prediction logic, instruction reordering logic, instruction issue buffers and other logic supporting the high-performance issuing of instructions.
Instructions are issued by the issue/branch unit 104 to one or more of the load/store unit 106, the fixed-point unit 108, the floating-point unit 110, or the SIMD processing block 112. Before instructions can be processed by one or more of the processing units, one or more register accesses are usually required in a register file, e.g., the integer register file 114, the floating point register file 116, or the vector register file 118 which is a part of the SIMD multimedia extension found in many contemporary processors.
The SIMD multimedia processing block 112 typically contains a vector register file 118 for storing structured data (usually a vector consisting of four elements). The vector register file may be segmented into four sub-register files, each storing a single field of the structured data. The SIMD multimedia processor block 112 may contain several types of function units, each type being replicated for the number of elements in the structured data type supported by the multimedia extension. In FIG. 1, there are shown fixed point units 119 and floating point units 120 replicated four times to process one structure element each as can be found in the PowerPC™ VMX multimedia extension.
An alternative implementation style for combining scalar and SIMD processing capabilities, but which is applicable to very limited SIMD processing, is the use of subword parallelism. With subword parallelism, a single scalar unit can be partitioned into multiple subword units by inserting segmentation logic into the fixed point unit, e.g., by breaking a carry chain for addition and subtraction. However, such systems have been applied only to very limited applications, such as those that can be trivially bit-sliced, e.g., logic operations like OR and XOR, etc., and simple integer arithmetic, e.g., ADD, SUBTRACT, etc., but not to complex integer arithmetic like DIVIDE, MULTIPLY, those that require shift operations, or floating point arithmetic. An implementation of subword parallelism is described by R. Lee, “Multimedia Enhancements for PA-RISC Processors”, Hot Chips VI, Palo Alto, Calif., August 1994. This approach may also be used in conjunction with the previously described SIMD implementation technique, e.g., to provide the capability to segment four 32-bit integer data fields into sixteen 8-bit data fields, for instance when it may be desirable to work on sixteen 8-bit data words in parallel instead of on four 32-bit data words, for instance when doing simple graphics rendering operations or for string searches.
FIG. 2 shows an example of SIMD subword parallelism achieved by segmenting a scalar processing unit. The figure depicts a 32-bit datapath comprising a register file 200 and an ALU 202 comprising four 8-bit ALU slices. Depending on the mode of operation indicated by the signal labeled ‘segment’ 204, the 32-bit ALU 202 can be segmented into four 8-bit data paths by disabling information from flowing across 8-bit boundaries. This can be achieved in an exemplary manner using the carry signal for a 32-bit addition/subtraction operation, but a similar concept may be applied to other operations, e.g., by segmenting a multiplier array into sub-arrays dedicated to the processing of subword data types. Implementations based on subword parallelism, however, usually do not provide full functionality for the processing of all data types, but rather only for a subset of integer operations, in particular, only those integer operations which can be achieved by partitioning an integer ALU. Such implementations also suffer power inefficiency problems. For instance, when a required vector operation requires the use of fewer than all slices of the ALU 202, remaining idle slices nevertheless continue to consume power throughout the operation.
Yet another attempt to integrate scalar and vector processing functionality is described in U.S. Pat. No. 5,423,051, issued Jun. 6, 1995, to Fuller et al. (the “'051 patent”). The '051 patent describes a processor system having an execution unit which can execute both scalar and vector operations, the system including enhanced load/store units directed towards the loading and storing of scalar or vector data, and a collection of pipelined functional units which can execute on either scalar or vector data.
The system of the '051 patent is directed towards traditional vector processing applications wherein long vectors are processed serially on a single or several pipelined functional units which can be shared among scalar and vector processing, rather than the execution of scalar and vector data on more recent “short vector” machines, as exemplified by the vector extensions found in PowerPC™ VMX, AMD 3DNow™, or Intel x86MMX™/SSE/SSE2. These recent instruction set extensions are directed in particular towards multimedia processing typically employing short vectors (typically two to eight elements) and using tightly coupled parallel (and usually pipelined) processing units for each vector element. In addition, the system described in the '051 patent suffers from the same power consumption drawbacks that plague the other systems discussed in this section.
Thus, it is clear that a system and method are needed to control the execution on a tightly coupled parallel vector processing unit with coupled and parallel execution units such that scalar instructions can be processed. Furthermore, in light of the ever increasing need to reduce overall power consumption and heat dissipation in the processor, it is desirable to provide a system and method that can process scalar and vector data in an efficiently-designed processor while reducing power consumption.