1. Technical Field
The present invention relates in general to an improved data processing system, and in particular to an improved vector processor. Still more particularly, the present invention relates to an improved vector processor having a dynamically reconfigurable register file.
2. Description of the Related Art
In the art of data processing system design, the speed and computational power of the data processing system has continuously increased as designers make advances in semiconductor design and manufacturing techniques, and as the architectural design of the central processing unit (CPU) is improved. One such improvement in CPU architecture is the addition of pipelining.
Pipelining increases the speed of processing a sequence of instructions by starting the execution of all instructions before the execution of all previous instructions is completed. For example, the CPU may begin fetching an instruction from memory while another part of the CPU is decoding a previously fetched instruction. Thus, pipelining does not speed up the execution of any one instruction, but it may speed up the processing of a sequence of instructions because succeeding instructions are being processed in the CPU before the processing of prior instructions has been completed.
Another architectural improvement in CPU design is the utilization of special processor functional blocks that are optimized for rapidly performing a limited set of instructions. For example, some CPUs include functional blocks for performing only fixed-point arithmetic, or only floating-point arithmetic, or for processing only branch instructions. These functional blocks, which may be referred to as execution units, may perform their assigned limited functions much faster than a single general purpose processor is able to perform the same function.
When the vertical parallelism achieved by pipelining is combined with the horizontal parallelism achieved by utilizing multiple execution units the computational power of the CPU is further increased. Such a combination of vertical and horizontal parallelism permits the hardware to take a sequential instruction stream and dispatch (or issue) several instructions per clock cycle into the pipelines associated with the execution units. A CPU that utilizes multiple pipelined execution units may be called a superscalar processor.
FIG. 1 is a high-level block diagram of such a superscalar data processing system. As illustrated, superscalar data processing system 100 includes branch execution unit 102, which is coupled to memory 104 via instruction bus 106 and address bus 108. Branch execution unit 102 may fetch multiple instructions from memory 104 during a single clock cycle and dispatch such instructions to an appropriate execution unit via instruction dispatch buses 110.
Another execution unit within superscalar data processing system 100 is load/store execution unit 112. Load/store execution unit 112, which may be implemented within a fixed-point execution unit, performs address calculations and generates memory requests for instructions requiring memory access. Load/store execution unit 112 provides address information to memory 104 via address bus 114.
Floating-point execution unit 116 may also be included within superscalar data processing system 100. Floating-point execution unit 116 is optimized to receive, buffer, and execute floating-point calculations. Floating-point execution unit 116 may be coupled to memory 104 via data bus 118.
Fixed-point execution unit 120 is yet another execution unit which may be included within superscalar data processing system 100. Fixed-point execution unit 120, which may be coupled to memory 104 via data bus 122, may be utilized to perform integer calculations. In some implementations, fixed-point execution unit 120 may include the load/store functions performed by load/store execution unit 112. One example of such a superscalar data processing system having multiple pipelined execution units is the processor manufactured under the trademark "IBM RISC System/6000 Model 59H" by International Business Machines Corporation (IBM) of Armonk, N.Y.
In many prior art CPUs, a single instruction stream directs the CPU to perform operations on a single data stream. That is, each CPU instruction performs an operation on defined data to produce one calculation per instruction. Such CPUs may be referred to as "single-instruction single-data" (SISD) processors. One problem with SISD processors may be seen during the execution of software which performs the same instruction utilizing multiple data operands. If the application program requires the same CPU instruction to be performed using multiple data operands, the CPU may be programmed to loop through a short software segment many times. That is, the CPU may be programmed to perform a "DO loop" to perform a particular operation on multiple data operands. During such a DO loop, the instruction performed on multiple operands must be recalled from memory in each pass through the DO loop. This process of repeatedly recalling a single instruction may reduce the available instruction bandwidth of the CPU. Such a reduction in available instruction bandwidth means that the CPU may not be able to fetch instructions for other execution units to keep all the pipelines filled.
Another improvement in CPU architecture permits the CPU to utilize a single instruction to operate on multiple data streams or multiple operands. Such a CPU architecture is utilized in a "single-instruction multiple-data" (SIMD) processor. In an SIMD processor, high-level operations, invoked by a single instruction, are performed on vectors, which are linear arrays of numbers. A typical vector operation might add two 64-entry, floating-point vectors to obtain a single 64-entry vector. This vector instruction may be equivalent to an entire DO loop, in which each iteration of the DO loop includes a computation of one of the 64 elements of the result, an update of the indices, and a branch back to the beginning of the DO loop. For an instructional discussion of vector processing see chapter 7 of Computer Architecture, A Quantitive Approach by John L. Hennessy and David A. Patterson, published by Morgan Kaufmann Publishers, Inc., Palo Alto, Calif., pages 351-379.
Another advantage of using an SIMD-parallel processor is that the computation of each result, or element in a vector, is independent of the computation of a previous result. This allows a very deep pipeline without generating data hazards. Data hazards occur when the execution of an operation depends upon the result of a previously scheduled operation.
Another advantage of using an SIMD-parallel processor is that vector instructions that access memory have a known memory access pattern. For example, if the vector's elements are all stored in adjacent memory locations, then fetching the vector from a set of heavily interleaved memory banks works well. When recalling a vector from main memory, the high latency of initiating a main memory access (compared with the latency of accessing a cache memory) may be amortized over the access for an entire vector rather than a single element. Thus, the cost of the main memory latency may be incurred once for the entire vector, rather than for each word of the vector.
Yet another advantage of using an SIMD-parallel processor is that such a single vector instruction may specify a large amount of computational work. Such a single vector instruction may be equivalent to executing an entire DO loop. Thus, the instruction bandwidth requirement of the CPU is reduced, and the "Flynn bottleneck," which is a limitation that prevents fetching and issuing of more than a few instructions per clock cycle, is considerably mitigated.
Because of the advantages described above, vector operations may be executed faster than a sequence of scalar operations on the same number of data elements. However, many vector processors in the prior art use a fixed register configuration, which may not be the optimum register configuration for a particular vector calculation performed by an application. The register configuration of such known vector processors may be fixed in three areas of design: first, the number of available vector registers, which may not be optimal across a broad range of applications; second, the vector register length (i.e., the number of elements that may be stored in a vector register) may not be large enough for the vector calculations required by an application program; and finally, the number of vector registers versus the number of scalar registers may not be optimal for the vector and scalar calculations required by a particular application program.
Many known vector processors include a number of vector registers having a fixed length, which may range from 64 elements to 256 elements. If an application program manipulates vectors having a vector length greater than the length of the vector register, the application program performs two or more smaller vector operations in order to complete a vector operation on such a large vector. Conversely, if a vector processor provides relatively large vector registers, many application programs will not utilize the entire vector capacity during vector calculations involving smaller vectors, thus allowing valuable register storage space to go unused. Because different applications, or portions of the same application, may require the processing of vectors having different vector lengths, it is difficult to select a vector register length that will satisfy the needs of all vector processing application programs. Moreover, many prior art vector processors will not permit unused registers in a long vector register to be utilized to store an additional vector or scalar operand.
The selection of a fixed number of vector registers and a fixed number of scalar registers causes a similar problem. This problem occurs because some calculations involving vectors may involve no scalar operands, while other calculations involving vectors may require several scalar operands. The differing requirements make the selection of a fixed number of vector registers and a fixed number of scalar registers impossible to optimize for the various requirements that different application programs may impose.
Thus, the problem remaining in the prior art is to provide a method and system for efficiently and dynamically reconfiguring a register file in a vector processor such that optimal vector register length may be selected, the optimal partition between scalar registers and vector registers may be selected, and vectors having different lengths may be simultaneously stored in the register file of the vector processor.