The present invention relates to a program-controlled digital computer, and more particularly to a digital computer suitable to execute a vector operation at a high speed (hereinafter referred to a vector processor).
A vector processor has been proposed for high speed processing such as required for a large scale matrix operation, which frequently appears in a scientific and technical calculation.
A vector processor having a vector register and a chaining function for effectively displaying a fast operation speed of a plurality of pipelined ALU's and enhancing the transfer capability of operation data has been proposed (U.S. Pat. No. 4,128,880).
The prior art system has a problem in connection with the chaining function. This problem will be explained with reference to the following simple example of a vector operation.
______________________________________ FORTRAN Sentence ______________________________________ DO 10 I = 1, L 10 Y (I) = A (I) + B (I) * C (I) ______________________________________
This process is expressed by vector instruction as follows.
______________________________________ 1. Vector Load VR0 .rarw. A 2. Vector Load VR1 .rarw. B 3. Vector Load VR2 .rarw. C 4. Vector Multiply VR3 .rarw. VR1 * VR2 5. Vector Add VR4 .rarw. VR0 + VR3 6. Vector Store VR4 .fwdarw. Y ______________________________________
where VRi is an i-th vector register.
Each vector instruction repeatedly executes the operation and the data transfer for L elements.
In the above example, a product of vectors B and C, which is an intermediate result, is temporarily stored in a vector register VR3 and only a sum Y of the product and a vector A is stored in a main storage. In general, in a vector processor having vector registers, an intermediate result of an operation is temporarily stored in the vector register VR1 and only a final result vector is stored in the main storage so that the number of times of data transfer to and from the main storage is substantially reduced. Accordingly, by increasing the speed of the read/write operations of the vector registers, the data transfer capability necessary for the operation is assured even if the accessing capability of the main storage is relatively low.
Considering the fourth and fifth vector instructions of the above example, the vector register VR3 which stores the product provided by the preceding instruction 4 also serves as a vector register from which an operand for the succeeding vector add instruction 5 is read. If the succeeding vector add instruction 5 is started after all of the L results have been written in the vector register VR3 by the fourth vector multiply instruction, the plurality of ALU's (for example, a multiplier and an adder) are not operated in parallel and the processing time is increased. The queuing or waiting time caused when reading the contents of the vector register as the operand for the succeeding vector instruction, which vector register stores the operation result for the preceding vector instruction or a fetched vector data also exists between the vector instruction 1 or 2 and 4, between the vector instructions 3 and 5 and between the vector instructions 5 and 6. The queuing can be relieved by a chaining function. In a prior art chaining operation, the operation result by the preceding vector instruction is written, e.g., in the vector register VR1 and it is simultaneously transferred to the ALU as the operand for the succeeding vector instruction. If the chaining operation is possible, an operation result for a first element of the preceding vector instruction is outputted without waiting for the completion of the operation for the final element of the preceding vector instruction, and the succeeding vector instruction is started. In this manner, in a multi-term vector calculation, a plurality of ALU's are operated in parallel so that the parallel operation capability is enhanced and a high speed operation is attained.
In the prior art chaining, however, the reading of the vector register for the succeeding vector instruction is started in full synchronism with the writing of the operation result vector by the preceding vector instruction or the first element of the fetched vector data into the vector register. At the start of the instruction, it is checked to see if the vector register from which the operand for the instruction is to be read is being used for writing by the preceding instruction, and if the decision is affirmative, the chaining condition is checked. If chaining is possible, the instruction is started, and if chaining is not possible, the start of the instruction is delayed until the writing of all elements by the preceding instruction is completed and the vector register is released. The chainable instruction is started when the operation result of the preceding vector or the first element of the fetched vector data arrives at the vector register. This is called a chain slot time. It is determined that the chaining is possible only when it is predicted at the chain slot time that the writing of all subsequent elements of the preceding instruction is fully synchronous with the readout of th succeeding instruction with respect to the all subsequent elements.
Thus, the prior art chaining operation has the following two restrictions.
(1) The vector instruction decoded after the writing of the first element (that is, after the chain slot time) is determined to be unchainable because the decoding time is after the chain slot time, and under these conditions, the start of the succeeding instruction is delayed until the writing by the preceding instruction is completed. For example, in the above example, when a plurality of vector instructions exist between the vector instructions 4 and 5 and hence the decoding time of the instruction 5 occurs after the chain slot time of the instruction 4, the chaining of the vector register VR3 is not possible and the start of the instruction 5 is delayed until the execution of the vector instruction 4 is completed. As a result, the processing time is increased.
(2) Many of the operations are in the form of two vector operands to be operated on to produce a result. If the two vector registers which store the two operands to be read are in the writing operation called for by the preceding instruction, it must be assured that all of the elements of those operands are written into those vector registers at the same time in order to chain those registers to cause the contents thereof to be inputted to the same ALU. In the above example, if it is predicted at the time of decoding of the vector instruction 5 that the arrival of the fetched vector data from the main storage 5 to the vector register VR0 by the vector instruction VR1 and the arrival of the product vector by the vector instruction 4 to the vector register VR3 are synchronized for all elements, it is determined that the chaining is possible and the instruction is instantly started. Accordingly, it is necessary that the consecutive elements of both vector data arrive consecutively in one cycle pitch to the vector registers and the chain slot times of the vector registers are identical.
However, the data fetching from the main storage is generally intermittent because of the bank conflict of a plurality of memory accesses and the intervention of the channel operation. As a result, the frequency or chance of utilization of the fetched vector data obtained by the preceding instruction by the chaining of the succeeding instruction is low. in actual practice, the coincidence of the chain slot times of the two vector data occurs only by chance. Accordingly, the chance of chaining in the prior art vector processor is low.
The prior art vector processor has a further problem. The prior art vector processor has a plurality of ALU's which are configured for parallel execution. Depending on the mutual relation of a plurality of instructions, the ALU's cannot be used in spite of the fact that they are vacant. In the prior art vector processor, the instruction is fetched in the sequence in which it is stored in the main storage and the execution condition is checked, and if the execution is possible it is executed. If the execution is not possible because all of the ALU's are busy or the register is busy etc., the instruction execution is delayed until it becomes possible. If an instruction next to an unexecutable instruction is executable, the instruction decoding is unconditionally delayed and hence the ALU's are not utilized.
In order to resolve the above problem, it is necessary to provide an instruction execution control system which does not execute the instruction in the sequence of the instruction decoding but stores several decoded instructions and executes the executable instructions while taking the order of utilization of the vector register into consideration.
Further, in the prior art vector processor, the conditions required to start the instruction execution include the availbility of the necessary registers and ALU's as well as the availability of the necessary data. As a result, the control of the start of the execution is complex and the decoding of the next instruction is delayed, resulting in an increase of the instruction execution time and a reduction of the performance.