This invention relates to a data processing unit. More particularly, it relates to a technology which enables a data processing unit to make access to a greater number of registers than the number of registers which can be accessed by instructions. Further particularly, the present invention relates to a technology which hardly induces the drop of performance due to data transfer from a main memory during a so-called "vector processing" for continuously processing data of a large scale, for which a cache is not much effective, and which can accomplish an pseudo-vector processing by an ordinary data processing unit.
JP-A-57-166649 describes technologies of enabling a data processing unit to make access to a greater number of registers than the number of registers accessible by instructions. According to this reference, a greater number of registers referred to as "hardware registers" than the number of general purpose registers accessible by instructions are first disposed, and when a plurality of load instructions are issued to the same general purpose registers from different main memory addresses, they are stored in the same number of hardware registers as the number of the load instructions. When the number of general purpose registers accessible by a program is 16, for example, sixteen hardware registers for each general purpose register, or in other words, 256 in total, of the hardware registers, are prepared, and the hardware registers Nos. 0 to 15, for example, are allotted to the general purpose register No. 0. When a load instruction designating 16 different main memory addresses for the general register No. 0 is executed, the data from the 16 load instructions are stored in the hardware registers Nos. 0 to 15. There is also disposed a memory mechanism for registering the main memory addresses of the load instructions that have been executed in the past, and the hardware register number storing the data loaded at that time. When the load instruction issued by the program coincides with the main memory address registered to this memory mechanism, the data is not read out from the main memory but from the corresponding hardware register. According to this system, the number of times lookup to the main memory is performed can be reduced and the drop of performance due to collision of the lookup registers between the instructions can be prevented.
Another known technology enables a data processing unit to access a greater number of registers than the number of registers accessible by instructions. Such a technology is described in Hennesy & D. A. Patterson, "Computer Architecture: A Quantitative Approach", pages 450-454, Morgan Kaufmann Publishers, Inc. (1990). According to this reference, a greater number of registers called "physical registers" than the number of registers accessible by programs are first disposed, and the physical registers are divided into a plurality of segments referred to as "windows". In other words, each window comprises a plurality of physical registers. The reference assumes that the registers are numbered from Nos. 1 to n by the program and the physical registers are numbered by n*m ("*" means multiplication), that is, from No. 1 to No. n*m. When m windows, that is, from Nos. 1 to m, are disposed, the window No. 1 can be allotted t the physical registers Nos. 1 to n, and the window No. 2, to the physical registers Nos. n+1 to 2n,for example. Though physical registers common to all the windows, physical registers common to adjacent windows, etc, are disposed in practice, this example is given for the purpose of simplification. Each window has the registers used by one program. In other words, to look up the register accessible by a certain program is to look up the physical register belong to a certain window, in practice. For instance, if the window 2 is allotted to a certain program in the example given above, when the register k is designated by this program, the physical register which is to be looked up is the physical register n+k.
This window is used in the following way. Assuming that the window j is allotted to a certain program and when this program calls another program, the window j+1 is allotted to this called program. Assuming that the window j is allotted to a certain program and when the program returns to a program calling it, the window j-1 is allotted to the program of the return end. The use of the window in this way provides the following effects. Namely, in a system having only the same number of registers as the number of registers accessible by the programs, the data stored in the registers must be stored in the main memory whenever call of the program such as described above occurs, so as to preserve the data at the time of occurrence of the call, and the data stored in the main memory must be rewritten to the registers, whenever the return of the program occurs, so as to re-start the program. In the system having the window mechanism described above, on the other hand, the program to which a different window is allotted looks up a different physical register. For this reason, storage from the register into the main memory and re-write from the main memory to the register become unnecessary, and processing can be sped up as much.
In the system having such a window mechanism, however, control must be made so that "when call of a program is generated from a program having the greatest window number, interrupt of window overflow be made and when return of a program is generated from the program having the smallest window number, interrupt of window underflow be made".
The major part of the scientific and technological calculation are vector calculations given below: EQU A(i)=B(i)*S i=1,N (1)
where A and B are vectors of an element number N and S is a scalar.
In the following description, the data width of the floating point registers is 8 bytes.
When the equation (1) is computed by a general purpose computer, a program such as shown in FIG. 6 can be obtained.
The function of each of the instructions shown in FIG. 6 will be explained below.
FLDM a(GRm), FRn
(Function) PA1 (Function) PA1 (Function) PA1 (Function) PA1 1. When 0.ltoreq.r.ltoreq.7: EQU &lt;w, r&gt;=r (6) PA1 irrespective of w (that is, irrespective of p) PA1 2. When 8.ltoreq.r.ltoreq.31: EQU &lt;w, r&gt;=p+r (7) PA1 1. The 0th to 7th logical floating point registers are used in common for each window. These registers hold data which is in common to arithmetic loops using the respective windows, as the global registers. PA1 2. The logical floating point register of each window and the logical floating point register of a window having a greater window number by one sometimes represent the same physical floating point register. PA1 3. The window cut width can be changed by changing the pointer width maximum value sm or the window start pointer register width q. PA1 (Instruction mnemonic) FWSTPS u, v PA1 (Function) PA1 (Instruction mnemonic) FLDPRM a(GRm), FRn, wstr PA1 (Function) PA1 (Instruction mnemonic) FSTPOM a(GRm), FRn, wstr PA1 (Function) PA1 B. R. Rau et al, "Register Allocation for Software Pipelined Loops": ACM SIGPLAN, 1992, Pages 283-299, and Tirumalai et al, "Parallelization Of Loops With Exists On Piplined Architectures": Supercomputing, 1990, Pages 200-212.
8-byte data is read out from a main memory address indicated by the value of a general register m and is stored in a floating point register n. Thereafter, the value of the general m is added with a. PA2 The product of the value of the floating point resister m and the value of the floating point resister n is stored in the floating point register j. PA2 The value (8-byte) of the floating point register n is stored in a main memory address indicated by the value of the general register m. PA2 Thereafter, the value of the general register m is added with a. PA2 The value of GRm is subtracted by 1. If the result is not zero, the program branches to the address t. If it is zero, the program does not branch.
FMLT FRj, FRm, FRn
FSTM a(GRm), FRn
BCNT GRm, t
It will be assumed hereby that the vector B is stored in a continuous region starting from the main memory address adl before the execution of the program shown in FIG. 6. In other words, the main memory address of B(1) and the main memory address of B(2) are stored at adl and adl+8, respectively. It will also be assumed that the vector A is stored similarly in a continuous region starting from the main memory address ad3. Further, ad1, ad3 and N are assumed to be stored in advance in the general register 1, the general register 3 and the general register 4, respectively. S is assumed to be stored in advance in the floating point resister 7.
As can be understood from FIG. 6, B(i) is loaded to the floating point register 8 by the FLDM instruction No. 1, the product of the value of this floating point register and the value of the floating point register 7 is stored in the floating point register 10 by the FMLT instruction No. 2, and the value of this floating point register is stored in A(i) by the FSTM instruction No. 3.
In other words, when a loop comprising the four instructions is executed once, the result of one element can be determined, and all the elements can be calculated by executing N times this loop.
Here, the execution time of one loop becomes a problem. First, the data is loaded from the main memory to the floating point register by the FLDM instruction No. 1, and the FLDM instruction can be terminated with a small number of cycles when any data exists in the cache. However, when no data exists in the cache, the data must be read out from the main memory having a considerably lower speed than the cache, and a longer time is necessary than when any data exists in the cache. Next, the FMLT instruction No. 2 uses the value of the floating point register 8. Therefore, unless load described above is completed, the execution cannot be started. The FSTM instruction No. 3 uses the value of the floating point register 10, but since the value of the floating point register 10 is not determined before the execution of the preceding FMLT instruction is completed. Accordingly, the execution cannot be started, either.
In other words, two factors for reducing performance, i.e. (1) the data read time and (2) collision of the registers, prolong the execution time of the loop. Particularly (1) the data read time is a critical problem in the case of computation handling enormous data, and necessary data cannot be fully stored in the cache and the drop of performance becomes greater.
One of the means for solving this problem is loop unrolling and is shown in FIG. 7. This is a system which reduces a number of times of looping to 1/n in comparison with the case where a plurality of elements (=n) are processed by one loop and one element is processed by one loop. FIG. 7 shows a system which processes four elements by one loop.
It will be assumed that the vector B is stored in advance in a continuous region starting from the main memory address before the execution of the program shown in FIG. 7. In other words, the main memory address of B(1) is stored in adl and the main memory address of B(2) is stored in adl+8. Similarly, the vector A is assumed to be stored in a continuous region starting from the main memory address ad3. It will also be assumed that adl, ad3 and N/4 are stored in advance in the general register 1, the general register 3 and the general register 4, respectively. It will be assumed that S is stored in advance in the floating point register 7.
As can be understood from FIG. 7, when a loop comprising 13 instructions is once executed, the result of our elements can be determined, and when this loop is executed N/4 times, all the elements can be calculated.
As can also be understood from FIG. 7, load is effected by the FLDM instruction No. 1, multiplication by the FMLT instruction No. 5 and store, by the FSTM instruction No. 9, for the ith element. Similarly, load is effected by the FLDM instruction No. 2, multiplication, by the FMLT instruction No. 6 and store, by the FSTM instruction No. 10, for the (i+1)th element. Similarly, load is effected by the FLDM instruction No. 3, multiplication, by the FMLT instruction No. 7, and store, by the FSTM instruction No. 11, for the (i+2)th element. Similarly, load is effected by the FLDM instruction No. 4, multiplication, by the FMLT instruction No. 8 and store, by the FSTM instruction No. 12, for the (i+3)th element. In comparison with FIG. 6, therefore, a series of processing such as load, multiplication and store are separated from one another on an instruction string for the element indicated by a certain element number, and the two major factors for inviting the drop of performance, that is, (1) the data read time and (2) collision of the registers, can be reduced. For example, load of B(i) is effected by the FLDM instruction No. 1, but it is only after four instructions that the load result is employed. Accordingly, if the data read time is within the four cycles, the FMLT instruction No. 5 using this load result is not brought into the waiting state. Further, it is only after four instructions that the result of multiplication B(i)*S by the FMLT instruction No. 5 is used. Accordingly, if the time necessary for multiplication is within four cycles, the FSTM instruction No. 9 is not brought into the waiting state.
Though loop unrolling can improve system performance, it is not free from the drawback that a large number of registers are necessary. Whereas the program shown in FIG. 6 needs three floating point registers, the program shown in FIG. 7 needs nine floating point registers. If the data read time is much more longer or if the calculation time is longer, a greater number of elements must be processed in one loop, so that a greater number of registers become necessary.
Generally, a register comprises an active device (that is, a device which is not a memory device) and can provide a large number of read/write ports (that is, data input/output ports). Therefore, the register has an operation speed which is by far higher than that of a so-called "memory device" which can read/write only one data in one operation cycle. Therefore, a system must essentially include registers having a sufficient capacity in comparison with not only the main memory but also with the cache so as to improve the operation speed. Nonetheless, the reason why the number of registers is relatively small in the conventional systems is because the cost per bit is high and the length of a field of a register number on the instruction format is limited as illustrated below. Although the problem of cost has now been soled to some extents by LSIs, the latter problem is left yet to be solved.
The number of the registers which can be accessed by a program is limited from the aspect of an architecture. For example, if five bits of register designation field exist in an instruction word, the number of registers which can be addressed is 32 (2.sup.5). The number of the registers accessible by the program can be increased by increasing the number of bits of the register designation field, but this is unrealistic because the instruction format changes and hence, the existing program must be changed.
Therefore, a system has become necessary which enables a data processing unit to make access to a greater number of registers than the number of registers accessible by an instruction, without changing the architecture of the data processing unit. According to the first prior art technology described already, the operation speed can be improved when a load instruction is executed afresh for the main memory address for which the load instruction has been executed in the past. However, in most of the vector calculations expressed by the equation (1), a load request for the data on the main memory is issued only once as in the program shown in FIG. 6. In other words, the prior art technology cannot improve the operation speed in this case.
According to the prior art technology 2 described already, it is only the physical registers belonging to a certain window that can be used by one program, and the number of such registers is equal to the number of registers that can be accessed by the program. Therefore, an operation executed by one program cannot be sped up. In other words, the aforementioned window function can improve the processing speed only when call and return of the program occur, but cannot improve the processing speed when processing is completed by one program such as the vector calculation of the equation (1). Another problem is that the interrupt of window overflow and window underflow is not necessary when processing is completed by one program and call and return of the program do not occur, such as in the vector calculation of the equation (1).