The present invention relates to a data processing apparatus having a plurality of arithmetic units and a plurality of registers for parallelly executing register-to-register arithmetic operations by simultaneously operating different arithmetic units, and more particularly to a data processing apparatus for executing the above operations for vector data.
In a field of large scale technical calculation, a large volume of arithmetic operations for a large volume of data is required. In a high speed computer, a plurality of arithmetic units are provided and they are operated parallelly.
FIG. 1 shows a schematic configuration of a prior art vector processing apparatus. Numeral 10 denotes a main storage unit (MS), numerals 20 and 21 denote pipelined arithmetic units and numeral 50 denotes registers which can be written and read independently.
For convenience of explanation, the numbers of the registers and the arithmetic units are assumed to be 8 and 2, respectively. Numeral 60 denotes a first selector for parallelly sending out data read from the MS 10 or a vector from the arithmetic unit 20 or 21 to one of the registers 50 in a pipeline fashion one vector data element at a time. Numeral 70 denotes a second selector for parallelly sending out the vector data serially read from the respective registers 50, one vector data element at a time, to the MS 10 or the arithmetic unit 20 or 21. Any combination of the register R0-R7 and the arithmetic unit 20 or 21 or the MS 10 can be designated by a control circuit 80 in accordance with a program instruction.
Let us assume that each vector data block consists of a plurality of vector data elements each consisting of parallel sixteen bits. The circuit of FIG. 1 shows the configuration per bit of the parallel 16-bit vector data elements actually, therefore, 16 configurations of FIG. 1 should be parallelly arranged. The same is true for other drawings.
The control circuit 80 serially decodes the instructions and outputs result register numbers of registers in which the resuIt data is stored, through lines Ri1-Ri4 which correspond to the lines 91-94, respectively, to the internal registers of the first selector 60 depending on a resource (arithmetic unit or MS) and the register required by the decoded instruction, and outputs operand register numbers from which the vector data is outputted, through lines Rj1, Rj2, Rk1, Rk2 and Rl1 which correspond to the lines 82, 84, 83, 85 and 90, respectively, to the internal registers of the second selector 70.
A timing for serially reading out the vector data from the vector registers R0-R7 and a timing for storing the result vectors serially outputted from the arithmetic units 20 and 21 to the vector registers R0-R7 are controlled by the control circuit 80.
Such a vector data processing apparatus is disclosed in U.S. application Ser. No. 453,094, now U.S. Pat. No. 4,617,625, based on Japanese patent application No. 56-210392, assigned to the assignee of the present application.
In the vector data processing apparatus of this type, the vector element data is sequentially read out from one or two vector registers designated as the operand registers by an instruction, and the data is sent to the arithmetic unit 20 or 21, and the operation result is sequentially stored in one register designated as the result register by the instruction. The content of the vector register designated by the instruction is stored in the MS 10 or the vector data in the MS 10 is transferred to the vector register designated by the instruction. Accordingly, the operations for a plurality of instructions which use different operand registers, result registers or resources (arithmetic units or MS) are executed parallelly.
When a register designated as the result register by a preceding first instruction is used as the operand register by a succeeding second instruction, the operation by the second instruction is started before the operation by the first instruction ends, that is, in the course of the arithmetic operation of the vector elements relative to the first instruction. As a result, the first and second instructions are partially and parallelly executed. This technique is called chaining. The chaining operation is disclosed in U.S. Pat. No. 4,128,880.
An operation of the apparatus of FIG. 1 is now explained taking a vector operation as an example. EQU Z=[{(x+a)*b}+c]*y
where x and y are variable vectors in the MS 10, a, b and c are constant vectors read from the MS 10 and stored in the registers R1, R3 and R5, and z is a variable vector which is to be written into the MS 10 as an operation result.
FIG. 2 shows a diagram of a vector operation executed in parallel by the two arithmetic units 20 and 21, the MS 10 and the eight registers R0-R7.
A first instruction to transfer the vector x from the MS 10 to the register R0 is executed. By the execution of this instruction, the vector data elements of the vector x are serially read out from the MS 10 to a line 91. The first selector 60 responds to the signal supplied from the control circuit 80 as a result of decoding of the first instruction to connect the line 60 to the input terminal of the register R0. In this manner, the vector data elements of the vector x are sequentially stored in the register R0 by the control circuit 80. In the mean time, the execution of a second instruction for addition of the vectors in the registers R0 and R1 is started. However, the register R0 requested by the second instruction is used by the first instruction to store the vector therein. Accordingly, the control circuit 80 does not immediately start the transfer of the vectors from the registers R0 and R1. When the head data element of the vector x is stored in the register R0, the control circuit 80 starts the transfer of the head vector data element pair of the vectors stored in the registers R0 and R1. On the other hand, the second selector 70 responds to the signal supplied from the control circuit 80 as a result of decoding of the second instruction to connect the output terminals of the registers R0 and R1 to the lines 82 and 83, respectively. In this manner, the head vector data element pair of the vectors x and a is transferred to the arithmetic unit 20. In a similar manner, the remaining elements of those vectors are sequentially sent to the arithmetic unit 20. The arithmetic unit 20 executes pipeline addition for the serially inputted vector data element pairs, and serially outputs the results. The first selector 60 responds to the signal supplied from the control circuit 80 as a result of decoding of the second instruction to connect the output line 93 of the arithmetic unit 20 to the register R2 which is designated as the result register by the second instruction. The control circuit 80 serially writes into the register R2 the vector data elements of the result vector (x+a) which are serially sent from the arithmetic unit 20.
In this manner, as soon as the storing of the vector data elements of the vector x into the register R0 is started, the vector operation by the arithmetic unit 20 is executed in parallel with the storing, and the result is stored in the register R2.
In a similar manner, a third instruction which instructs multiplication of the vectors in the registers R2 and R3 is executed in parallel with the storing of the vector data elements of the vector (x+a), and a product vector (x+a)*b is stored serially by the vector data element in the register R4 designated by the third instruction.
Next, a fourth instruction which instructs addition of the vectors in the registers R4 and R5 is executed. In the vector processing apparatus of FIG. 1, only two arithmetic units are provided. Accordingly, the execution of the fourth instruction is delayed until the arithmetic unit 20 completes the execution of the addition of the vectors (x+a) by the second instruction. When the arithmetic unit 20 completes the execution of the second instruction, the fourth instruction is executed by the arithmetic unit 20 in parallel with the storing of the vector (x+a)*b in the register R4, and a sum vector (x+a)*b+c outputted by the arithmetic unit 20 is stored in the register R6 designated by the fourth instruction.
A fifth instruction to store the vector y in the register R7 is executed at an appropriate time, and the vector y is serially stored in the register R7 from the MS 10. After the fourth and fifth instructions, a sixth instruction to add the contents of the registers R6 and R7 and store the sum in the register R0 is executed. Because the data processing apparatus of FIG. 1 has only two arithmetic units, the execution of the sixth instruction is delayed until the arithmetic unit 21 completes the execution of the multiplication by the third instruction. When the arithmetic unit 21 completes the execution of the third instruction, the control circuit 80 reads out the vector data element pairs of the vector (x+a)*b+c and the vector y to the registers R0 and R7. Those vector data pairs are sent to the arithmetic unit 21 through the second selector 70 and the product is stored in the register R1 through the first selector 60. In a similar manner, the operations for other vector data pairs are carried out.
As soon as the head vector data element of the result vector from the arithmetic unit 21 is stored in the register R0, an instruction to transfer the content of the register R0 into the MS 10 is executed and the final result vector is serially stored in the MS 10.
In the execution of the instructions described above, the first and second selectors 60 and 70 receive the register numbers from the control circuit 80 to control the connection between the lines 82-85, 90 and the registers R0-R7 and the connection between the lines 91-94 and the registers R0-R7.
When the number of arithmetic units in the data processing unit is small, the execution of the succeeding instruction is delayed until the arithmetic unit used by the previous instruction becomes available. Accordingly, the chaining is not attained efficiently. Further, when the number of registers is small, some registers (e.g., R0 in the above case) must be used twice or more in the course of execution of a vector operation. While the data processing apparatus of FIG. 1 is simplified for the purpose of better understanding, it is apparent that in an actual data processing apparatus, the larger the number of arithmetic units is, the faster are a large volume of operations executed. In the actual data processing apparatus, a number of registers are required. Accordingly, a data processing apparatus having a number of arithmetic units and registers is desirable. For example, a data processing apparatus, as shown in FIG. 3, has twice as large a capacity as that of the data processing apparatus of FIG. 1, but is constructed in accordance with the concept of FIG. 1. In FIG. 3, the numbers of the arithmetic units and the registers are 4 and 16, respectively, and two output lines extend from the second selector 70 to the MS 10. Like numerals to those shown in FIG. 1 designate like elements. The second selector 70 of FIG. 3 comprises ten partial selectors for connecting the output of a selected one of the registers R0-R15 to the two input lines 90, 81 for the MS 10 and the eight input lines 82-89 for the four arithmetic units 20-23. Similarly, the first selector 60 comprises six partial distributors for connecting the output lines 91 and 92 from the MS 10 and the four output lines 93-96 from the arithmetic units 20-23 to respective ones of the registers R0-R15. The control circuit 80 sequentially decodes the instructions and outputs operand register numbers Rj1-Rj4 for designating first inputs of the arithmetic units, operand register numbers Rk1-Rk4 for designating second inputs of the arithmetic units and register numbers Rl1 and Rl2 for storing into the MS 10, to the eight partial selectors of the second selector 70, depending on the resource (arithmetic unit or MS) and the register requested by the decoded instruction. The result register numbers Ri1-Ri6 are outputted to the six partial distributors of the first selector 60.
In the data processing apparatus shown in FIG. 3, when the vector operation Z=[{(x+a)*b}+c]*y is to be carried out, the delay of the execution of the instruction due to the small number of arithmetic units is avoided and efficient chaining is attained. Even when the result register of the preceeding instruction and the operand register of the succeeding instruction are common, the execution of the succeeding instruction may be started as soon as the first result vector element for the preceeding instruction is obtained so that the two instructions are effectively parallelly executed.
In FIG. 3, the first and second selectors 60 and 70 can connect any register and any resource. However, such a selector is of large scale and has a limitation on the arrangement of the circuit components and a limitation on the operation speed due to the delay time in the selector.
For example, in the second selector 70, it is desirable, for reducing a signal propagation time on the signal line, that the readout data lines 51 of the registers 50 (that is, the input signal lines to the second selector 70), the data lines 90 and 81 to the MS 10 and the data lines 82-89 to the arithmetic units 20-23 are short and concentrated. In order to meet the above requirements, it is necessary that the MS 10 and the arithmetic units 20-23 are located in the vicinity of the registers 50. However, when the number of the arithmetic units and the number of the registers R0-R15 are increased, such a physical arrangement is very difficult to attain. As a result, the signal propagation time on the signal line cannot be sufficiently reduced. The same is true for the relation between the first selector 60 and the output lines 93-96 of the arithmetic units 20-23.
It is considered that the circuit scale of the selector 60 or 70 is proportional to a product of the number of inputs and the number of outputs. For example, for the selector, the circuit scale of a one-out-of-n selector is proportional to the number n of inputs and an n-input m-output selector can be attained by m n-input selectors. For the distributor, the circuit scale of an m-output distributor is proportional to the number m of outputs and an n-input m-output distributor can be attained by n partial distributors. Accordingly, the first and second selectors 60 and 70 have problems in that the circuit scale increases as the numbers of registers and arithmetic units increase.
For this reason, the signal propagation time in the selectors 60 and 70 cannot but increase as the numbers of registers and arithmetic units increase.