1. Field of the Invention
The present invention relates to digital computers with scalar and vector processing capabilities. More particularly, the invention relates to a digital computer that includes a scalar central processing unit and a vector processor and that transmits vector instructions from the central processing unit to the vector processor through a data path used by the scalar central processing unit for accessing a cache memory.
2. Description of the Related Art
Vector Processing is a widely used means for enhancing the performance of computer applications that contain elements of an array that can be computed in parallel. Vector processors may be "attached" to the I/O bus of a scalar central processing unit (CPU), or tightly coupled (i.e., "integrated") with the scalar CPU. Attached processors can produce good performance increases for applications which require minimal interaction with the scalar CPU. Most applications do require a significant amount of interaction with the scalar CPU, and the overhead to communicate with an attached processor limits the advantages of this type of vector unit.
Integrated vector processors can be classified as either memory-to-memory or register-to-register. In memory-to-memory vector processors, the operands are fetched from memory into one or more vector function units of the vector processor, and the result computed in the vector function unit is returned directly to memory. While this type of vector architecture may work well for applications that use very long vectors, the startup overhead is too costly for most applications to produce the desired increase in performance.
Register-to-register vector architectures work by first loading vector data into high-speed vector registers. Vector operate instructions then specify the vector registers to be operated upon by the vector function units, and the result of each vector function unit is returned to a vector register. Vector store instructions are issued to move the results back to memory. Register-to-register vector architectures have less startup overhead than memory-to-memory architectures, but only a small segment of a long vector can be stored in a vector register, and vector operations between long vectors require multiple load, operate, and store cycles upon segments of the long vectors. Long vector applications are optimized by loading a next segment of a long vector while a previously loaded segment is being operated upon.
Thus for register-to-register vector architectures the vector registers serve as a software controlled first level cache to the vector function units, and the bandwidth to and from the vector register is a key factor in system performance.
A specific implementation of a register-to-register vector processor is usually partitioned as: (1) load/store unit, (2) vector register file, and (3) one or more vector function units, either in a single arithmetic pipeline, or in multiple pipelines for different operations (add, multiply, divide).
In register-to-register implementations the vector processor typically contains the vector registers and vector function units, and is responsive to commands for loading the vector registers with data from a data bus, controlling the vector function units to operate upon the data in the vector registers, and to transmit data from the vector registers onto the data bus. The vector processor, for example, is comprised of two VLSI chips called vector processing elements (VPEs). Each VPE is partitioned into two sections, each section and contains one-quarter of the vector registers and an arithmetic pipeline. Section #1 of VPE #1 contains elements 0,4,8,12 . . . 60 of each of 16 vector registers. Section #2 of VPE #1 contains elements 1,5,9,13 . . . 61. Section #1 of VPE #2 contains elements 2, 6, 10, 14 . . . 62. Section #2 of VPE #2 contains elements 3,7,11,15 . . . 63. When vector operate instructions are executed by the VPEs, the four pipelines are operated in parallel, and thus the VPE can complete 4 operations per cycle. The VPEs contain a two deep instruction queue, allowing the scalar CPU to transfer a next operate instruction to the VPEs while the previous operate instruction is being executed in the VPE pipelines. Upon completion of an operate instruction, the next vector operate instruction starts into the VPE arithmetic pipelines without any bubbles if the instruction code is valid in the VPE instruction queue. The vector register file in each VPE has five ports; two for source operands to the arithmetic pipes, one for the pipe destination, a load port, and a store port. Vector load and stores are processed to and from the vector register file in parallel with the execution of vector operate instructions in the arithmetic pipes.
For many "integrated" vector processors, the load/store unit is separate from the scalar CPU. When a vector load or store instruction is decoded by the scalar CPU, the instruction is sent to the Load/Store unit with the appropriate operand information (base address, stride, and source/destination vector register). This separation of the vector load/store functionality, however, results in a high cost for the logic to support adding vector instructions to a processor. The separate load/store unit requires logic to generate addresses, a memory management unit for vector references, and a memory control unit to access the vector references to/from the vector register file. A performance issue also exists with the latency incurred in sending the base address and stride to the remote load/store unit. Another important consideration is a synchronization issue between the present load/store instruction and subsequent instructions. For systems which require virtual address translations, if the processor attempts to issue beyond the load/store instruction which has been issued to the remote load/store unit before determining all the addresses in the load/store can be translated without taking a memory management exception, the recovery protocol is extremely complicated. Thus implementations with a separate load/store unit choose between additional latency waiting for load/store synchronization or complex recovery mechanisms.
If the load/store unit has a common connection to memory, the cache subsystem is common to the scalar and vector memory controllers. A cache shared by a scalar unit and vector load/store unit requires a complicated sharing protocol. An alternative is for the load/store unit to access memory via a different path than the scalar CPU, with a memory hierarchy containing a cache for data items which are either read or written by the vector load and vector store instructions. Having separate scalar and vector caches works well for applications where minimal interaction is required between the scalar and vector segments of the program, but can drastically degrade the performance of programs where scalar instructions require access to data in the vector cache and vector load/stores instructions need data from the scalar cache.
A specific example of a known system using a scalar CPU and a vector processor is described in Richard A. Brunner et al., "Vector Processing on the VAX 9000 System," Digital Technical Journal, Vol. 2, No. 4, Fall 1990, pp. 61-79; and Fossum et al. U.S. Pat. No. 4,888,679, issued Dec. 19, 1989, entitled "Method and Apparatus Using a Cache and Main Memory for Both Vector Processing and Scalar Processing by Prefetching Cache Blocks Including Vector Data Elements", incorporated herein by reference.