(1) Field of the Invention
The present invention relates to a vector data processing apparatus which carries out a vector calculation.
To increase speed in data processing in computer systems, pipelining operations are utilized. The pipelining operation is, in particular, effective in a vector data processing, e.g., a vector data calculation, A(i)+B(i)=C(i), (i=1-n).
In particular, in scientific or technological calculations, loop calculations are frequently carried out. The loop calculations can be transformed into vector calculations.
In high-speed data processing system, such as a supercomputer, a vector data processing apparatus is provided to carry out vector data processing, in addition to a scalar data processing apparatus which is provided to carry out scalar data processing.
(2) Description of the Related Art
FIG. 1 shows an outline of the construction of an example of data processing apparatus comprising a vector data processing apparatus.
In FIG. 1, reference numeral 61 denotes a main storage unit (MSU), 62 denotes a memory control unit (MCU), 63 denotes a scalar unit (SU), 64 denotes a vector unit (VU), 65 denotes a vector execution unit (VEU), 66 denotes a vector control unit (VCU), 650 and 651 each denote a load/store pipeline, 652 denotes a set of vector registers (VR), 653 denotes an addition pipeline, 654 denotes a multiplier pipeline, 655 denotes a divider pipeline, 656 denotes an adder-multiplier pipeline, 657 denotes an adder-divider pipeline, 658 denotes a set of mask registers (MR), 659 denotes a logical sum pipeline, 660 denotes a logical multiplication pipeline, and 661 denotes control lines.
The scalar unit (SU) 63 reads out an instruction from the main storage unit (MSU) 61 under the control of the memory control unit (MCU) 62.
If the instruction is a scalar instruction, i.e., other than a vector instruction, the scalar unit (SU) 63 carries out the scalar instruction. If the instruction is a vector instruction, the scalar unit (SU) 63 sends the vector instruction to the vector control unit (VCU) 66 in the vector unit (VU) 64.
The vector instruction is an instruction which instructs an execution of a vector data processing operation, for example, an addition of two vector data, a multiplication of two vector data, and the like.
When the vector control unit (VCU) 66 receives a vector instruction, the vector control unit (VCU) 66 makes the vector execution unit (VEU) 65 execute the received vector instruction. The vector control unit (VCU) 66 controls all operations carried out in the vector execution unit (VEU) 65.
The load/store pipelines 650 and 651 are memory access pipelines through each of which a data transfer operation is carried out between the main storage unit (MSU) 61 and banks (explained later) of the set of vector registers (VR) 652, i.e., data is read out from the main storage unit (MSU) 61 under the control of the memory control unit (MCU) 62 through the load/store pipelines 650 or 651, and is then written into the vector registers (VR) 652.
In addition, when storing data in the vector registers (VR) 652 from the main storage unit (MSU) 61, data held in an address (explained later) of the vector registers (VR) 652 is read out, and then is transferred through one of the load/store pipelines 650 and 651 to the main storage unit (MSU) 61 under the control of the memory control unit (MCU) 62.
Although the detailed construction is not shown, the main function of the load/store pipelines 650 or 651 is to align a plurality of elements of vector data which have been read out from the main storage unit (MSU) 61, to write the elements in the corresponding banks of the vector registers (VR) 652, or to transfer aligned data which have been read out from the banks of the vector registers (VR) 652 to the main storage unit (MSU) 61.
The adder pipeline 653, the multiplier pipeline 654, the divider pipeline 655, adder-multiplier pipeline 656, and the adder-divider pipeline 657, are calculation pipelines, each of which reads out data from one or more banks of the vector registers (VR) 652, carries out a corresponding calculation, (for example, the adder pipeline 653 carries out an addition) using the data, and writes a result of the calculation into one bank of the vector registers (VR) 652.
The above corresponding calculation of the adder-multiplier pipeline 656 is a composite calculation of an addition and a multiplication, and the above corresponding calculation of the adder-divider pipeline 657 is a composite calculation of an addition and a division.
Similarly, mask data, which is used for a masking operation of vector data, is transferred between the main storage unit (MSU) 61 and the set of mask registers (MR) 658 through the load/store pipeline 650 or 651.
The logical sum pipeline 659, and the logical multiplication pipeline 660, are logical calculation pipelines for mask data, each of which reads out mask data from one or more banks (explained later) of the mask register (MR) 658, carries out a logical calculation, (for example, the logical sum pipeline 659 carries out a logical sum operation) using the mask data, and writes a result of the logical calculation into one bank of the mask register (MR) 658.
FIG. 2 shows a detailed construction of the set of vector registers (VR) 652 and the set of mask registers (MR) 658.
The set of vector registers shown in FIG. 2 consists of m (m is an integer) registers, each of which registers corresponds to an address, and each register is divided into eight banks, B0, B1,-B7. In each bank of each register, an element of vector data is held, and each bank is simultaneously and independently accessible for each of the calculation pipelines and the memory access pipelines 650, 651, 653 to 657, and 659, and 660.
Similarly, the set of mask registers shown in FIG. 2 consists of m registers, each register corresponds to an address, and each register is divided into eight banks, B0, B1,-B7. In each bank of each register, an element (one bit) of mask data is held, and each bank is simultaneously and independently accessible for each of the logical calculation pipelines 659 and 660, and the memory access pipelines 650 and 651.
Each pipeline can access a bank of the set of vector registers (VR) 652 during an assigned time slot, which is called a bank slot.
As shown in FIG. 3, eight bank slots, K, E3, E2, E1, L, F3, F2, and F1, are defined corresponding to the eight banks, B0, B1,-B7 of the set of vector registers (VR) 652, and each of the bank slots K and L is assigned for a memory access pipeline, i.e., the load/store pipeline 650 or 651, three bank slots of E3, E2, and E1, are assigned for a simple (non-composite) calculation pipeline such as the adder pipeline 653, the multiplier pipeline 654, or the divider pipeline 655, and the other three bank slots of F3, F2, and F1, are assigned for another simple calculation pipeline.
The pipeline for which the bank slot is assigned can cyclically access each bank of the set of vector registers (VR) 652, as shown in FIG. 4.
Namely, for example, when the bank slot K is assigned for the load/store pipeline 650, the load/store pipeline 650 can access the bank B0 of the set of vector registers (VR) 652 at the timing 0, the bank B1 at the timing 1, - - - , the bank B7 at the timing 7, and the bank B0 again at the timing 8 - - -.
When the three bank slots of E3, E2, and E1, are assigned for a non-composite pipeline, for example, an adder pipeline 653, the adder pipeline 653 can access the banks, B1, B2, and B3 at the timing 0, the banks, B2, B3, and B4 at the timing 1, - - - , the banks, B0, B1, and B2 at the timing 7, and the banks, B1, B2, and B3 again at the timing 8, - - -.
In the above case, the non-composite calculation pipeline such as the adder pipeline 653, can read data used for its own calculation from each bank of the set of vector registers (VR) 652 through the bank slots E3 and E2, and write the result of the calculation in each bank of the set of vector registers (VR) 652 through the bank slots E1. The addresses of the registers during the above accesses are controlled by the vector control unit (VCU) 66.
FIG. 5 shows an example of the access timing and the read/write data by a non-composite calculation pipeline for which the bank slots E3, E2, and E1 are assigned.
In FIG. 5, it is assumed that the non-composite calculation pipeline carries out a calculation, EQU (R3).sub.k *(R2).sub.k =(R1).sub.k,
where (R3).sub.k and (R2).sub.k each denotes an element of vector data which is used in a calculation carried out in the above non-composite calculation pipeline, "*" indicates a type of the calculation, e.g., "+" indicates an addition in the adder pipeline 653, or "x" indicates a multiplication in the multiplier pipeline 654, (R1).sub.k denotes a result of the above calculation using the above element data (R3).sub.k and (R2).sub.k.
As shown in FIG. 5, element data (R3).sub.k and (R2).sub.k are each read from the banks B1 and B2 through the bank slots E3 and E2, respectively, at the timing k. Since it takes a few or several cycles to obtain the above calculation result (R1).sub.k from the timing of reading the above element data (R3).sub.k and (R2).sub.k, a calculation result (R1).sub.k-r obtained from the element data (R3).sub.k-r and (R2).sub.k-r which are read in the previous timing k-r, is written in the bank B3 through the bank slot E1 at the timing k.
Then, at the next timing k+1, element data (R3).sub.k+1 and (R2).sub.k+1 are each read from the banks B2 and B3 through the bank slots E3 and E2, respectively, and a calculation result (R1).sub.k-r+1 obtained from the element data (R3).sub.k-r+1 and (R2).sub.k-r+1 which are read in the previous timing k-r+1, is written in the bank B4 through the bank slot E1 at the timing k+1.
Thus, pipeline operation for vector data calculation, is carried out in a calculation pipeline for which a necessary number of bank slots are assigned.
FIG. 6 shows an example of a timing of bank slot assignment when a calculation instruction is received.
When an addition instruction is received the adder pipeline 653 obtains the next available successive three bank slots to start reading of element data from an instructed bank of the set of vector registers (VR) 652 most quickly.
For example, when an addition instruction is received, if the bank slot which is accessible to the bank wherein the element data which should be first read for the execution of the addition instruction, is F2, the next available successive three bank slots E1, E2, and E3, are assigned for the corresponding addition calculation, i.e., for the adder pipeline 653. Then, the corresponding calculation result is written in the bank slot E1 in the next rotation of the eight bank slots.
FIG. 7 shows an example of a timing of bank slot assignment when a vector load instruction is received.
The memory access instruction executed in the vector execution unit (VEU) 65, is a vector load instruction or a vector store instruction. The bank slot K or L is assigned for a memory access pipeline for execution of load or vector store instructions.
The vector store instruction instructs a vector data transfer operation from each bank of the set of vector registers (VR) 652 to the main storage unit (MSU) 61 through a memory access pipeline, i.e., the load/store pipeline 650 or 651.
The execution of a vector store instruction starts from a reading operation of a bank of the set of vector registers (VR) 652. Therefore, similar to the above calculation instruction, when a vector store instruction is received, the next available bank slot K or L, is assigned for the corresponding load/store pipeline.
On the other hand, the vector load instruction instructs a data transfer operation from the main storage unit (MSU) 61 to each bank of the set of vector registers (VR) 652 through a memory access pipeline, i.e., the load/store pipeline 650 or 651.
The execution of a vector load instruction starts from a reading operation from the main storage unit (MSU) 61. The load/store pipeline 650 or 651 must carry out an address calculation, and an address transformation, wait for an allowance of an access to the main storage unit (MSU) 61 by the memory control unit (MCU) 62, and then access the main storage unit (MSU) 61 to read out data from the main storage unit (MSU) 61.
However, in execution of a vector load instruction, the time necessary to read out data from the main storage unit (MSU) 61, in particular, the time necessary to obtain an allowance to access the main storage unit (MSU) 61 in the reading stage, is uncertain because the main storage unit (MSU) 61 may be in contention with the other data processing units.
Therefore, in the prior art, the timing of writing data in banks of the set of vector registers (VR) 652, is not determined when a vector load instruction is received, and the timing is determined just before (a few or several cycles before) the timing of the writing operation in the set of vector registers (VR) 652. Thus, the assignment of bank slot K or L is carried out just before the writing operation.
In the example shown in FIG. 7, the determination (assignment) of the bank slot for an vector load instruction, is carried out five cycles (5 .tau.) before the start of the writing operation in the set of vector registers (VR) 652 after a time T, which is uncertain as to when the vector load instruction is received, has elapsed after receiving the vector load instruction.
As explained before with reference to FIG. 5, when a non-composite calculation instruction is received, one of the two sets of the successive three bank slots (E3, E2, and E1), or (F3, F2, and F1), can be assigned for the instruction among the eight bank slots K, E3, E2, E1, L, F3, F2, and F1. However, when a composite calculation instruction, such as an addition-multiplication instruction, which instructs an execution of a calculation (A(i)+B(i)).times.C(i)=D(i), (i=1-n), is received, four bank slots are necessary for an execution of the addition-multiplication instruction in the adder-multiplier pipeline 656: one bank slot is for reading an element of A(i) from a bank of the set of vector registers (VR) 652, another for reading B(i), another for reading C(i), and the other for writing the result D(i) in the set of vector registers (VR) 652.
As understood from the above explanation, more bank slots than the aforementioned three successive bank slots (E3, E2, and E1), or (F3, F2, and F1), must be assigned for a composite calculation instruction. When a composite calculation instruction which requires four bank slots is received, generally, one of the bank slot K or L is assigned together with the three successive bank slots (E3, E2, and E1), or (F3, F2, and F1), i.e., four successive bank slots (K, E3, E2, and E1), (E3, E2, and E1, L), (K, F3, F2, and F1), or (F3, F2, and F1, L), are assigned for a composite calculation instruction.
In addition, composite calculation instructions are often executed just after an execution of a vector load instruction for a corresponding element, i.e., just after the loading of a corresponding element of vector data used for a calculation in a corresponding composite calculation pipeline.
FIG. 8 shows a timing of a successive execution of a vector load instruction and an addition-multiplication instruction using the data loaded by the vector load instruction.
To start a composite calculation instruction just after an execution of a vector load instruction for a corresponding element, i.e., just after the loading of a corresponding element of vector data used for a calculation in a corresponding composite calculation pipeline, necessary bank slots must be assigned for the composite calculation instruction, or in other words, for a corresponding composite calculation pipeline.
Since four successive bank slots (K, E3, E2, and E1), (E3, E2, and E1, L), (K, F3, F2, and F1), or (F3, F2, and F1, L), must be assigned for a composite calculation instruction, such as an addition-multiplication instruction, it is necessary to know which of the bank slots K and L is available for the composite calculation instruction (pipeline), at an early stage. In other words, it is necessary to know which of the bank slots K and L is assigned for the preceding vector load instruction (pipeline), at an early stage.
However, as described before with reference to FIG. 7, in the prior art, the assignment of bank slot K or L for a vector load instruction is carried out just before the writing operation is carried out. Therefore, it is impossible to assign a bank slot for a following composite calculation instruction (pipeline) because it is uncertain which of the bank slots K and L is available.
Generally, in the prior art it is difficult to dynamically assign a bank slot for instructions following a vector load instruction due to the late assignment of a bank slot.