The concept of vector processing has been incorporated into computing systems to increase performance for application programs that can be expressed in terms of operations on vector operands. Vector processing systems include special purpose vector instructions for performing consecutive sequences of operations using pipelined execution units. Since multiple operations are implied by a single vector instruction, vector processing systems require fewer instructions to be fetched and decoded by the hardware. Vector processing reduces the frequency of branch instructions since the vector instructions themselves specify repetition of processing operations on different data elements. Utilization of pipelined execution units is maximized since the same operation is repeated on multiple data elements--once enough vector elements have entered the pipeline to fill all its stages, one result is produced each machine cycle. Multiple parallel execution pipelines can be implemented within the vector processing paradigm to process logically adjacent vector elements concurrently.
A certain amount of overhead is normally associated with vector processing. There is overhead to fill the data and execution pipelines, as well as overhead for additional instructions that are needed to setup and control the vector pipeline, such as those for breaking large vector operands into sections that will fit in smaller vector registers. To compensate, prior art vector computers use large registers so that the vector startup overheads are amortized over a large number of pipelined operations. These large registers consume valuable chip space and add significant product cost as registers are implemented in expensive high-speed technologies. Large vector registers can require saving and restoring thousands of bytes when processing interruptions, imposing additional processing overhead.
An aspect of prior art vector processors is their ability to supply vector elements from memory in a pipelined fashion in order to keep the vector execution pipeline full. In architectures with large vector sections, this behavior is a natural fallout of the size of the vector registers. Entire sections worth of data can be prefetched without the need to guess address reference patterns. The vector architecture guarantees these operands will be used. There is overhead associated with these prefetched operands that occurs at section boundaries, where the full memory access latency is incurred before the first element is returned. This overhead is incurred for each vector section that must be fetched from memory.
A representative vector processing configuration is disclosed in U.S. Pat. No. 4,791,555 issued Dec. 13, 1988 to Garcia et al and entitled "Vector Processing Unit". In this invention, a separate vector processing functional unit is connected to a general purpose data processing system. The base system provides instruction handling, operand fetch and store capability, and exception handling for the vector unit. This configuration incorporates a dedicated vector register set and separate vector instruction set for operating on vector data. The vector functional unit includes an arithmetic pipeline for operating on vector elements that duplicates the capabilities of the scalar pipeline in the base general purpose system. The present invention utilizes the base scalar instruction set, scalar registers, and scalar execution apparatus for parallel vector element processing.
U.S. Pat. No. 4,745,547 issued May 17, 1988 to Buchholz et al and entitled "Vector Processing discloses a technique for processing vector operands in storage that are too large to fit in the vector registers. Vector operations are broken up in sectioning loops, where vector operands are processed in sections that will fit in the vector registers. This technique provides for interruptions between any pair of vector element operations, and exact recovery at the point of interruption. The technique requires an additional instruction within the inner vector processing loop, which imposes a performance overhead. Vector instructions accessing operands in storage automatically update vector operand address registers to step to the next vector section in storage. However, when the same operand occurs more than once in a vector loop, it is necessary to reload the original operand address for each subsequent use of the operand, and incur a performance overhead, or to represent the same operand address in multiple addressing registers, thereby consuming more of the limited register resource available to the compiler. The full vector length is contained in a general purpose register, rather than a dedicated length register. As a consequence, the data fetch and store pipelines must be restarted for each iteration of the loop.
This invention incorporates large vector registers to cover the vector startup overheads. Since vector instructions specify a large number of operations, interruptability of vector instructions and exception handling at the elemental level are necessities.
Vector architectures commonly include instructions for loading and storing vectors that are not arranged according to a fixed stride in storage. Gather and scatter operations and other methods for handling sparse matrices are examples. Instructions such as these are necessary in architectures with large, dedicated vector registers to support the various types of programming constructs that arise in vector programs. Their inclusion is mandated by the separation of vector and scalar data in different register sets. In prior art vector architectures, data in vector registers are not easily accessible to scalar instructions. It may be necessary to store a large vector register to memory so that a scalar instruction can access a single element or to process the vector data using a scalar algorithm that is more economical than an equivalent vector instruction.
Superscalar techniques have been used to enable RISC computers to achieve performance equivalent to single-pipeline vector processors, without the additional hardware complexity and cost of dedicated vector registers, instruction sets and execution facilities.
Superscalar computers have multiple independent execution pipelines that can execute different scalar instructions concurrently. A limitation of these systems is the additional hardware required to fetch and decode multiple instructions simultaneously, and to enforce interlocks between in-progress instructions that contain data dependencies as more and more parallel instruction execution is attempted. The present invention incorporates vector techniques into the scalar processing architecture to obtain higher levels of parallel operation. The instruction fetch, decode, and element independence advantages of vector operations is obtained without using dedicated vector instruction sets, vector register sets or dedicated vector execution facilities.
U.S. Pat. No. 5,261,113 issued Nov. 9, 1993 to Jouppi and entitled "Apparatus and Method for Single Operand Register Array for Vector and Scalar Data Processing Operations" discloses a technique for using a shared register file to store vector operands as well as scalar operands. Data in the register file is directly accessible for both vector operations and scalar operations. The shared register file is fixed in size by the fields used to address the file, thereby limiting the size of vector operands that can be addressed. Multiple operations are pipelined through a single pipelined execution unit to achieve one result per cycle under control of a single vector instruction. A new instruction format is defined to cover the range of vector arithmetic operations, but memory load and store operations are performed with scalar instructions in a base general purpose processing system. The new instruction format supporting vector operations includes fields to identify each operand as vector or scalar, and to specify the vector length. This invention identifies a single pipeline configuration and does not facilitate multiple pipeline configurations. The use of arbitrary sequences of scalar registers to form vector registers complicates the dependency interlock logic in a multiple pipeline configuration. The lack of vector load and store instructions requires that parallel loads and stores be done using superscalar techniques.