This invention relates to an arithmetic and logical operation unit for performing the operation of a series of vector data in a pipelined manner.
In line with current progress in scientific technology, large-scale and fast technical computation has urgently been demanded. Especially, the attainment of a fast vector operation presents a significant problem. Currently, computers dedicated to vector operations (hereinafter referred to as vector processors) such as the Cray 1 and Cyber 205 have been commercialized which accomplish high performance by performing vector operations in a pipelined fashion.
FIG. 1 shows, in schematic block form, a vector processor which comprises a main storage 1, a storage control 2, an instruction control 3, and m arithmetic and logical units (ALUs) 4-1, 4-2, . . . and 4-m. The instruction control 3 reads instructions from the main storage 1 via the storage control 2, reads operands from the main storage 1 on the basis of decoded information of the read-out instructions, and controls the selection of one or more of the ALUs 4-1 to 4-m for effecting the operation designated by the read-out instructions, the supply of the read-out operands to the selected ALUs and the writing of operation results into the main storage 1. Where a vector operation is designated by the instructions, a pipelined procedure is carried out as will be outlined by way of the ALU 4-1 and by referring to FIG. 2. In FIG. 2, the ALU 4-1 includes stage ALUs 4-1-1, 4-1-2, . . . and 4-1-8 for respectively performing operation stages 1 to 8. In general, the ALU carries out the operation in a pipelined manner by dividing the operation into a plurality of operation stages and the stage ALUs are provided correspondingly to the operation stages. In an example of FIG. 2, the ALU 4-1 has eight of such stage ALUs.
In operation, operands X.sub.i and Y.sub.i (i=0 .about.(n-1)) are read out of the main storage 1 via the storage control 2 and sequentially applied to the ALU 4-1 at a fixed rate called the pipeline pitch. The applied operands are subjected to operations at the stage ALUs 4-1-1 to 4-1-8, and operation results Z.sub.i (i=0 .about.(n-1)) are written sequentially into the main storage 1 via the storage control 2. The pipeline pitch referred to herein corresponds to the processing time for one operation stage. If the processing of each of the operation stages is completed within one cycle, the pipeline pitch is one cycle.
In an operation wherein the processing of an i-th data element is independent of the operation results of the 1st to (i-1)-th data elements, when the processing of the i-th element is carried out, for example, at the stage ALU 4-1-2, the processing of an (i+1)-th element can simultaneously be carried out at the stage ALU 4-1-1. Thus, the pipelined procedure advantageously permits continuous processings thereby presenting high performance. In contrast with this procedure, assume that the following operations are to be carried out EQU Z.sub.i+1 .rarw.X.sub.i +Y.sub.i .times.Z.sub.i ( 1) EQU Z.sub.i+1 .rarw.X.sub.i +Z.sub.i ( 2) EQU Z.sub.i+1 .rarw.X.sub.i .times.Z.sub.i ( 3) EQU S.rarw..SIGMA.X.sub.i +S (4)
where X.sub.i, Y.sub.i, Z.sub.i and Z.sub.i+1 each represents an element of different vector data element and S a scaler data, and that, in equation (4), the total summation is obtained by sequentially summing up respective elements of vector data in the order from lower to higher ordinal numbers of the elements.
Conventionally, the processing of the i-th element is started after an operation of the (i-1)-th element has been completed. This raises the problem that it is impossible to make the most of advantages of the pipelined procedure and the operation speed is retarded.
For example, the operation of the equation (1) has been conventionally carried out by an ALU as shown in FIG. 3 which comprises a multiplier 5 including 8 stage ALUs 5-1, 5-2, . . . and 5-8, a selector 5-9, and an adder 6 including stage adders 6-1 and 6-2. It should be appreciated that the number of stage ALUs, 8 in this example, and the number of stage adders, 2 in this example, may be changed depending on the nature of operation. The sequence of the operation procedure of equation (1) is as follows.
(1) Operation of 0-th Elements
A data element Yo is fed via a line 8 to the multiplier 5, a data element Zo is fed via a line 7 and the selector 5-9 to the multiplier 5, and a product Yo.times.Zo obtained from the 8 stage ALUs 5-1 to 5-8 is fed to the adder 6. In synchronism with the application of the product to the adder 6, a data element Xo is also fed to the adder 6 via the line 7, these two inputs to the adder 6 are added at the two stage ALUs 6-1 and 6-2, and the output of the adder, which is equal to a sum Z.sub.1, is delivered to the storage control 2.
(2) Operation of the Remaining Elements
When an element Z.sub.1 is delivered to the storage control 2, an element Y.sub.1 is fed via the line 8 to the multiplier 5 and at the same time, an element Z.sub.1 is fed to the multiplier 5 via the line 7 and the selector 5-9. And, a product Y.sub.1 .times.Z.sub.1 obtained from 8 stage ALUs 5-1 to 5-8 is fed to the adder 6. In synchronism with the application of the product Y.sub.1 .times.Z.sub.1 to the adder 6, the element X.sub.1 is also fed to the adder 6 via the line 7, and these two inputs are added at the two stage ALUs 5-1 and 5-2 to deliver a sum Z.sub.2 to the storage control 2. For elements Z.sub.2 to Z.sub.(n-1), a similar operation procedure is repeated.
Such a prior art process requires for obtaining an operation result on each element to process ten stages of multiplier and adder operations, resulting in loss of advantages of the pipelined procedure and lower operation speed.