1. Field of the Invention
The present invention relates to a vector processing unit, and, more particularly, to a vector processing unit having a data buffer between a storage and a vector processor.
2. Description of the Prior Art
In a storage shared vector processing unit, a physical distance to the storage tends to be longer as the system becomes larger in its size in view of mounting. Therefore, storage access takes much time.
The vector processing unit attains high speed arithmetic processing by continuous arithmetic processing of a large amount of data simultaneously. That is, after a group of operands is loaded from the storage to a vector register, arithmetic processing is performed by supplying the arithmetic operand to a processor from the vector register. Then, the result is stored again in the vector register the content of which is in turn stored in the storage.
Referring to FIG. 10, there is shown an example of operation in a conventional vector processing unit where vector processing is performed by transferring data between the storage and the vector register without providing any buffer. In the example, the vector processing is performed by using data from the storage with two vector load instructions VLD, and then the result of processing is written in the storage with a vector store instruction VST. Therefore, there arises idle time because of storage access. As the idle time for storage access becomes longer, idle time in the processing unit also becomes longer so that the usage efficiency of the processor is being deteriorated.
In addition, since a plurality of vector processors is provided in a the vector processing unit, there is possibility of contention in storage access in each vector processor. Therefore, it is not necessarily guaranteed that data is returned at the minimum access time so that the idle time tends to occur more often.
In order to solve this problem, the conventional vector processing unit provides a buffer between the vector register and the storage for separating the storage access from the vector processing. This improves the vector processing unit by allowing it to load data in advance so that subsequent vector processing can be continuously performed after storing the data whether or not the storage access is performed.
Referring to FIG. 11, the conventional vector processing unit 1800 comprises a vector processor 1700 having a crossbar 1710, vector registers 1720, 1721, and a processor 1730; load data buffers 1100 with a load data buffer storing circuit 1200, and a load data buffer read circuit 1300 for storing vector data to be sent to the vector processor 1700; and store data buffers 1400 with a store data buffer store circuit 1500, and a store data buffer read circuit 1600 for storing the result of processing by the vector processor 1700.
Then, the number of words per each of the load data buffers 1110 and 1120 is designed for the maximum number of elements by a vector instruction. In addition, although the number of load data buffers is two of buffers 1110 and 1120 here, it is designed by estimating the number of load instructions for which data is not returned although the instruction is issued.
Furthermore, the number of words per each of the store data buffers 1410 and 1420 is also designed for the maximum number of elements by a vector instruction. Then, the number is designed by estimating the number of store instructions which may be executed from issuance of a store instruction to storing in a storage 1900.
Referring to FIGS. 12 (A) and 12 (B), if vector length in FIG. 12 (A) is eight, the number of vector load instructions VLD is four, which can be started for issuance until the vector processing can be executed. In addition, if the vector length in FIG. 12 (B) is four, the number of vector load instructions VLD is seven, which can be started for issuance until the vector processing can be executed. Therefore, although, if the vector length is eight, four load data buffers may be sufficient, seven load data buffers are required if the load vector length is four.
Next, description is made on the operation of instructions and the result of processing by using an example of an instruction sequence. Here, for the purpose of description, it is assumed that the number of load data buffers and store data buffers is two, respectively, each has a capacity of 8 bytes.times.64 words, there are an 8-byte load instruction VLD, an upper 4-byte load instruction VLDU, a lower 4-byte load instruction VLDL, an 8-byte store instruction VST, an upper 4-byte store instruction VSTU, a lower 4-byte store instruction VSTL, a fixed point addition VADD, and a floating point addition VFAD, and the maximum number of vector elements which these instruction can have is 64. Furthermore, the vector registers are refered to as V0 and V1.
Here, it is assumed that the instruction sequence shown in FIG. 9 is executed by assuming that the vector length is 16.
Referring to FIG. 13, when the instruction sequence is processed in the conventional vector processing unit by using two load data buffers, vector load instructions of instructions (1) and (2) are first assigned with the load data buffers 1110 and 1120, respectively. Then, the load data buffers 1110 and 1120 are also used for vector load instructions of instructions (5) and (6). Similarly, vector store instructions of instructions (4) and (8) are assigned with the store data buffers 1410 and 1420, respectively.
In this case, because there are only two load data buffers, the instruction (5) cannot be issued until the load data buffer V0, which is used by the instruction (1), is released. Thus, delay time as shown in FIG. 13 occurs.
Referring to FIG. 14, since the load data buffers 1110, 1120 and the store data buffers 1410, 1420 are configured in a fixed size, there is a possibility that a number of unused regions is caused in the load data buffers 1110, 1120, and the store data buffers 1410, 1420 if the vector is short.
As an example of such a conventional vector processing unit, European Patent Application No. 445,802-A2 describes a vector processing unit with a store buffer.
As described above, because the conventional vector processing unit has the fixed number and capacity of data buffers, it has a disadvantage that the usage efficiency is degraded depending on a program configuration. That is, when the vector length is long, there is little impact on the performance even if the number of data buffers is relatively small. However, if the vector length is short, the vector processor may not be efficiently utilized without providing a large number of data buffers. On the other hand, because vector length depends on a program, a relatively large number of data buffers should be provided for a case where the vector length is short to fully extract the performance of the conventional vector processing unit regardless of the vector length, which causes a problem that the amount of hardware is increased.