A high-speed computer having vector processors (a "vector computer") requires fast access to data in memory. The largest and fastest of such computers are known as supercomputers. One method of speeding up a computer is by "pipelining," wherein the computer's digital logic between an input and an output is divided into several serially connected successive stages. Data are fed into the computer's input stage before data previously input are completely processed through the computer's output stage. There are typically many intermediate stages between the input stage and the output stage. Each stage performs a little more of the overall function desired, adding to the functions performed by previous stages. Thus, multiple pieces of data are in various successive stages of processing at each successive stage between the input and output stages. Since each individual stage performs only a small part of the overall function, the system clock is shortened. Each successive clock propagates the data one stage further in the pipeline.
As a result of pipelining, the system clock can operate at a faster rate than the system clocks of non-pipelined systems. In some computer designs of today, the system clock cycles in as fast as two nanoseconds ("ns"), allowing up to 500 million operations per second though a single functional unit. Parallel functional units within each processor, and parallel processors within a single system, allow even greater throughput. Achieving high-performance throughputs is only possible, however, if data are fed into each pipeline at close to the system clock rate.
Another way of increasing performance on supercomputers is by going to systems of multiprocessors, where multiple central processing units (CPUs) are coupled together. Some multiprocessor systems share one or more common memory subsystems among the CPUs. Some systems have several CPUs, each of which has several independent vector processors which can operate on more than one vector operation simultaneously.
As processor speeds have increased, the size of memory in a typical computer system has also increased drastically, since more powerful processors can handle larger programs and larger quantities of data. In addition, error-correction circuitry is now placed in the memory path to increase reliability. Memory-access speeds have improved over time, but the increased size of memory and the complexity of error-correction circuitry have meant that memory-access time has remained approximately constant. For example, a typical supercomputer system clock rate may have improved from roughly 8 ns to 4 ns to 2 ns over three generations. Over the same time period, memory-access times may have remained at approximately 96 ns. These times mean that the 8-ns processor accesses memory in 12 clocks, the 4-ns processor in 24 clocks, and the 2-ns processor in 48 clocks. A computer which randomly accessed data throughout memory would see almost no speed improvement from the faster system clock rate.
One solution has been to organize data into vectors, where each element (or datum) of a vector has similar operations performed on it. Computer designers schedule various portions of the memory to simultaneously fetch various elements of a vector, and these fetched elements are fed into one or more parallel pipelines on successive clock cycles. Examples of such designs are described in U.S. Pat. No. 4,128,880 issued Dec. 5, 1978 to Cray (the '880 patent), U.S. Pat. No. 4,661,900 issued Apr. 28, 1987 to Chen et al., and U.S. Pat. No. 5,349,667 issued Sep. 20, 1994 to Cray et al., each of which are assigned to Cray Research Inc., the assignee of the present invention, and each of which are hereby incorporated by reference.
For example, vector elements are loaded into pipelines to vector registers from successive element locations in the vector image in memory. A single CPU may include several vector processors which can operate in parallel. Overlapped with pipelined vector loads from memory, there might be other pipelines taking data from two other vector registers to feed a vector processor, with the resultant vector fed through a pipeline into a third vector register. Overlapped with these vector loads, there might be other pipelines taking data from two other vector registers to feed another vector processor, with the resultant vector fed through a pipeline into yet another vector register. In a well-tuned system of this design, using 2-ns pipeline clocks, the throughput can approach 500 million operations per second per single vector functional unit within a processor, with parallel functional units within a vector processor, and parallel vector processors within a multiprocessor system providing enhanced overall performance, even with relatively slow memory-access times.
In the system described in the '880 patent to Cray, a single counter associated with each vector register was used to address elements in that vector register for any one vector operation. Vector operations began with element number zero (the first element of a vector register) and proceeded until the number of elements specified by a vector-length register had been processed. In the process called "chaining," when a succeeding (or second) vector operation needed to use a vector as an operand the result from a preceding (or first) vector operation, the second operation started operation (or "issued") as soon as the result from the first vector operation arrived at the vector register. The second instruction was therefore "chained" to the first instruction. In systems constructed according to the '880 patent, result elements from the first vector operation executing in a first functional unit were passed on to a second functional unit simultaneously with being stored into the result vector register. Since there could be only one operation (either a read or a write) to the vector register occurring per clock cycle in such a system, chaining could only occur if the result write operation for the first instruction went to the vector register, and simultaneously that element value was passed to the second functional unit as if it were read from the same register. There was thus a single clock period during which the second instruction could issue (start execution) and be chained to the first instruction. This single clock period was termed the "chain slot time," and it occurred only once for each vector instruction. If the succeeding instruction could not issue precisely at the chain slot time because of a prior functional-unit or operand-register reservation, then the succeeding instruction had to wait until all element results of the previous operation had been stored in the vector-result register and that register's reservation was released. In addition, one succeeding element had to be accepted by the second functional unit every clock, since that was how the elements were made available by the first functional unit.
In the system described in the U.S. Pat. No. 4,661,900 to Chen et al. (the '900 patent), the chaining of the write of the first element from the first vector operation to the read of the first element of a subsequent vector operation was decoupled by providing two separate counters (one for reads and one for writes) associated with each vector register. The counters were used to address elements for read operations and elements for write operations, respectively, in that vector register. In the process called "flexible chaining," successive operations were no longer constrained to start exactly at the chain slot time, but could be issued at any time after the first result element was written to a result vector register which was being designated as an operand register for the successive operation. Again every vector operation would begin with element number zero (the first element of a vector register) and proceed until the number of elements specified by a vector-length register had been processed. The array of vector elements was divided into two arrays (even elements and odd elements). A read operation for an even-numbered element would go to the even array in the same clock period as a write operation for an odd-numbered element would go to the odd array. In the following clock period, the next read operation, now for the next, odd, read element would go to the odd array; and the next write operation, now for the next, even, write element would go to the even array. In this manner, two operations could be scheduled for each vector register every clock period.
In very-high-speed vector processors, such as the Cray Y-MP C90 manufactured by Cray Research Inc., the assignee of the present invention, a computer system includes a number of central processing units ("CPUs"), each of which may have more than one vector processor. In addition, the computer system includes a number of common memories which store the programs and data used by the CPUs. Vector data are often streamed or pipelined into a CPU, and so delays due to long access times can be compensated for by processing many elements on successive cycles as the result of a single request.
One method of enhancing the performance of vector data streaming through a vector processor is to monitor or track which elements of a particular vector register are available or valid, and to stream elements, as they become available, into an arithmetic/logical functional unit (ALFU) for processing by an arithmetic or logical vector operation. Referring to prior-art FIGS. 4 and 5, several successive vector operations may be performed by "flexibly chaining" operations, so that elements in a vector register, for example, are chained into a second operation as an "operand" as soon as they become available (or any time thereafter) in that vector register as a "result" from a first operation. One such system is described in the '900 patent mentioned above. Such an approach is of limited value in situations where data read into a vector register may arrive out-of-sequence (e.g., due to variable latency times in a memory).
What is needed is an improved vector chaining system for a vector computer system which compensates for variable latency times in a memory, and a method for increasing the performance of processing vector data into, through, and out of a vector processor.