1. Technical Field
The present invention relates in general to an improved data processing system, and in particular to an improved vector processor. Still more particularly, the present invention relates to an improved vector processor having a vector register interface unit for loading and storing vectors in a plurality of modes.
2. Description of the Related Art
In the art of data processing system design, the speed and computational power of the data processing system have continuously increased as designers make advances in semiconductor design and manufacturing techniques, and as the architectural design of the central processing unit (CPU) is improved. One such improvement in CPU architecture is the addition of pipelining.
Pipelining increases the speed of processing a sequence of instructions by starting the execution of all instructions before the execution of all previous instructions has been completed. For example, the CPU may begin fetching an instruction from memory while another part of the CPU is decoding a previously fetched instruction. Thus, pipelining does not speed up the execution of any one instruction, but it may speed up the processing of a sequence of instructions because succeeding instructions are being processed in the CPU before the processing of prior instructions has been completed.
Another architectural improvement in CPU design is the utilization of special processor functional blocks which are optimized for rapidly performing a limited set of instructions. For example, some CPUs include functional blocks for performing only fixed-point arithmetic, or for performing only floating-point arithmetic, or for processing only branch instructions. These functional blocks which may be referred to as execution units, may perform their assigned limited functions much faster than a general purpose processor is able to perform the same function.
When the vertical parallelism achieved by pipelining is combined with the horizontal parallelism achieved by utilizing multiple execution units, the computational power of the CPU is further increased. Such a combination of vertical and horizontal parallelism permits the hardware to take a sequential instruction stream and dispatch (or issue) several instructions per clock cycle into pipelines associated with the execution units. A CPU that utilizes multiple pipelined execution units may be called a superscalar processor.
One example of such a superscalar data processing system having multiple pipelined execution units is the processor manufactured under the trademark "IBM RISC SYSTEM/6000 Model 59H" by International Business Machines Corporation (IBM) of Armonk, N.Y. Examples of execution units contained in the Model 59H include a branch execution unit, a load/store execution unit, a fixed-point execution unit, and a floating-point execution unit. The branch execution unit may be used to fetch multiple instructions from memory during a single clock cycle, and dispatch such instructions to another appropriate execution unit via an instruction dispatch bus. The load/store execution unit, which may be implemented within the fixed-point execution unit, performs address calculations and generates memory requests for instructions requiring memory access. The floating-point execution unit is optimized to receive, buffer, and execute floating-point calculations. The fixed-point execution unit may be utilized to perform integer calculations.
In many prior art CPUs, a single instruction stream directs the CPU to perform operations on a single data stream. That is, each CPU instruction performs an operation on defined data to produce one calculation per instruction. Such CPUs may be referred to as "single-instruction single-data" (SISD) processors. One problem with SISD processors may become apparent during the execution of software which performs the same operation on multiple data operands utilizing the same instruction. If the application program requires that the same CPU instruction be performed utilizing multiple data operands, the CPU may be programmed to loop through a short software segment many times. Such a software segment may be referred to as a "DO Loop". Such a DO loop performs a particular operation on multiple data operands by repeatedly recalling a particular instruction from memory in each pass through the DO loop. Such repeated recalling of a single instruction may reduce the instruction bandwidth of the CPU. Such a reduction in available instruction bandwidth means that the CPU may not be able to fetch instructions for other execution units, thereby preventing the other pipelines for other execution units to remain filled.
Another improvement in CPU architecture permits the CPU to utilize a single instruction to operate on multiple data streams, or multiple operands. Such a CPU architecture is utilized in a "single-instruction multiple-data" (SIMD) processor. In an SIMD processor, high-level operations, invoked by a single instruction, are performed on vectors. A vector is a linear array of numbers, wherein each number may be referred to as an element of the vector.
A typical vector operation might add, say, two 64-entry, floating-point vectors to obtain a single 64-entry vector. Such a vector instruction may be equivalent to an entire DO loop in which each iteration of the DO loop includes a computation of one of the 64 elements of the result, an update of the indices, and a branch back to the beginning of the DO loop. For an instructional discussion of vector processing, see chapter 7 of Computer Architecture, A Quantative Approach, by John L. Hennessy and David A. Patterson, published by Morgan Kaufmann Publishers, Inc. Palo Alto, Calif., pages 351-379.
Recently, the inclusion of an "SIMD execution unit" in a superscalar processor has been proposed. Such an SIMD execution unit is capable of receiving a vector instruction from the branch execution unit and performing a corresponding vector calculation. Such vector calculations include fixed- and floating-point vector calculations. Within the SIMD execution unit, multiple processing elements are coupled to a register file, or register array. Each register in the register file may store an element of a vector. The register array is configured as N rows of registers by M columns of registers, where each of the N rows is coupled to its own processing element for performing calculations and other operations using elements in the coupled row. Vector processing speed is increased by permitting each processing element to simultaneously operate on a row of registers in the register array.
Since one SIMD execution unit instruction may operate on multiple elements of a vector as directed by a single instruction, it becomes a problem to load all of the vector elements into the SIMD execution unit (i.e., the register array), and to store vector elements back to memory once vector calculations have been performed. Thus, the SIMD execution unit must load and store a vast amount of data in the form of vector elements compared to the number of instructions received by the SIMD execution unit. This may be referred to as a data bandwidth problem, as compared to an instruction bandwidth problem.
To solve such a data bandwidth problem, vector processors may recall and transfer several elements at once from memory into the vector processor. Similarly, several elements may be simultaneously stored back to memory. Typically, memory used to store such vectors is a high speed, cache-based memory system, which may provide access to several consecutive memory elements in order of increasing address in a single memory access. Such a memory access may be referred to as a "stride-1" memory access. The stride of the memory access refers to the separation between elements that are to be merged into a single vector (i.e., the address separation between two successive vector elements as those elements are stored in memory).
Loading or storing stride-1 vectors efficiently exploits the organization of the cache because consecutive memory locations are accessed in order of increasing address. Thus, vector elements may be loaded from, or stored to, a cache "line," which is a range of memory locations that may be specified with a starting address and a length. Such a cache line is the basic unit for cache memory operations.
However, the problem of interfacing memory to vector processors is not completely solved by loading or storing stride-1 vectors. In certain calculations, it may be necessary to load or store vector elements from every other memory location, in order of increasing address. Such is the case when a vector includes a real part comprised of a plurality of real elements, and an imaginary part comprised of a plurality of imaginary elements. Typically, such a vector is stored in memory with such real and imaginary elements alternating in consecutively numbered memory locations. Thus, complex vectors may be stored in memories as pairs of real and imaginary components or elements that occupy adjacent memory locations.
Vector operations on complex vectors typically require that the real and imaginary elements be placed in separate vector registers in the vector processor. This would require two stride-2 loads to load the real elements consecutive register locations in the register array and load the imaginary elements in consecutive register locations in the register array to form two vectors in two vector registers. Such stride-2 loads would not fully utilize the memory bandwidth available if such a load used a stride-1 access. Cache-based memory systems typically do not support a true stride-2 memory access because it is expensive to implement.
Other data bandwidth problems may arise during certain vector calculations which require loading or storing vector elements from memory locations that are separated in memory by n number of locations, where n is greater than one. Such a loading or storing operation may be referred to as a stride-n memory access. Loading a row of a matrix which has been stored in memory in column-major order is an important example of an operation requiring a stride-n load from memory. Column-major order means that adjacent elements in the same column of a matrix occupy adjacent memory locations in memory. Previous vector processors having cache-based memory systems have required a separate address for each vector element in a row because the elements in a row are not stored in consecutive memory locations, which means that a fast stride-1 access may not be utilized to load a row into the vector processor.
Algorithms that involve matrix operations often partition the matrices into smaller sub-matrices. These algorithms often require loading a group of adjacent rows of the sub-matrix. A typical matrix operation may use a column (or sub-column) of matrix A as a first operand and a row (or sub-row) of matrix B as a second operand. Such an operation requires that a column from a matrix be placed in one vector register and a row from the matrix placed in a second vector register so that the two vector registers may be used as source operands in a vector operation.
Loading a matrix column into a first vector register is a straight forward stride-1 memory access. However, loading a matrix row into a second vector register requires a series of stride-n memory accesses, which are typically not efficient in known cache-based vector processing systems.
Even vector processors that do not have a cache can suffer substantial performance problems with stride-n loads and stores. Systems without caches, such as some powerful supercomputers, may use a memory system that supports multiple memory accesses per cycle per processor by providing numerous memory banks that operate independently. While stride-n addressing works well for many values of n, certain values of n may lead to conflicts in the memory system when many accesses require data from the same memory bank rather than spreading accesses across many banks. Therefore, supporting stride-n vector loads and stores having the same performance as a stride-1 load and store is very desirable.
Still another memory access problem may arise during certain calculations which require loading or storing vector elements from adjacent memory locations in reverse order, or in order of decreasing address. To load elements in reverse order (relative to a stride-1 access order), previous vector processors having cache-based memory systems required a separate address for each vector element because the order of the vector elements is reversed compared to a typical stride-1 access. Therefore, supporting stride-(-1) vector loads and stores with the same performance as stride-1 loads and stores is desirable.