In the field of microprocessors such as a CPU (Central Processing Unit), there have recently been eager researches on speeding up computation. Examples of techniques for the steeping up include pipelining, superscalar, out-of-order execution, and register renaming.
Pipelining is a technique in which execution of instructions is divided into one or more stages and the instructions are concurrently executed like an assembly-line system. Superscalar is a technique in which two or more circuits concurrently execute instructions in parallel. Out-of-order execution is a technique in which executable instructions among sequences of instructions are picked out and executed irrespective of the prescribed order of instructions. Register renaming is a technique in which, in a CISC (Complex Instruction Set Computer) processor, for instance, the probability of parallelism is increased by increasing the number of general registers, while the compatibility of instructions with conventional processors is maintained.
As described above, parallel execution of instructions is important for speeding up computation in microprocessors. Nonetheless programs typically include such a dependency relation that an instruction is executed in response to the result of another instruction, i.e. typically include branches. If such a branch is included, a result of an instruction precedently executed in parallelism may be nullified as a result of the branch. This deteriorates the effect of the speeding up.
To solve this problem, there have been various researches on techniques to predict a result of branch so as to reduce the probability of nullifying the result of precedent instruction execution and increase the performance of the speeding up. Such techniques are termed branch prediction.
In a case where speculative instruction execution is carried out based on the branch prediction, however, the following problem typically occurs: firstly, since it is necessary to verify the validity of a prediction at all time, execution times for a precedent sequence of instructions are not shortened; secondly, since it is necessary to nullify all results of preceding computations based on an erroneous prediction, sizable hardware costs are required for increasing the number of instructions to be subjected at once to speculative processing; and thirdly, increase in the number of dependency relations among instructions requires multiple speculative processing, causing the verification of the validity of a prediction and the nullification of computation based on an erroneous prediction to be enormously complicated.
Aside from the branch prediction, there has been proposed a speeding up technique termed value reuse. This value reuse is arranged such that, an input value and an output value regarding a part of a program are registered in a reuse table, and when the same part is executed again, the registered output value is output if the input value is identical with the input value registered in the reuse table. The value reuse is advantageous in the following points: (1) if the input value is identical with the input value registered in the reuse table, it is unnecessary to verify the execution result; (2) since hardware costs are determined only in accordance with the total number of input and output values, the lengths of omissible sequences of instructions are not limited; (3) the number of dependency relations among instructions is unrelated to the complexity of the reuse mechanism; and (4) redundant load/store instructions are eliminated, and power consumption is reduced accordingly.
Non-Patent Document (“Speedup Technique with Function Level Value Reuse and Parallel Precomputation”, Yasuhiko Nakashima, Katsuya Ogata, Shingo Masanishi, Masahiro Goshima, Shin-ichiro Mori, Toshiaki Kitamura and Shinji Tomita, Information Processing Society of Japan journal: High-Performance Computing System, HPS5, pp. 1-12, September (2002), published on Sep. 15, 2002) discloses a technique in which the value reuse is applied to functions in a program. This conventional art takes advantage of the fact that a load module is typically generated based on ABI (Application Binary Interface), especially based on SPARC (Scalable Processor ARChitecture) ABI. The value reuse is achieved by specifying inputs and outputs of functions, based on the ABI. That is, it is unnecessary for the value reuse to carry out embedding of an exclusive instruction by using compiler, and hence this conventional art can be applied to conventional load modules.
Also, since a multiple structure of functions is dynamically grasped, a local variable on an in-function local register or a stack is excluded from input/output values to be reused. This improves efficiency. As to a function, in particular, up to six register inputs and four register outputs are available, and reuse and precomputation by registering minimum main storage values exclusive of a local variable are feasible, no matter how the function is complicated. The following describes this conventional art in detail.
First, a mechanism for clarifying, as to one function, what is input and what is output and for performing one-level reuse is discussed. In a program, functions typically have a multiple structure. FIG. 46(a) shows how a Function-A calls a Function-B.
Globals may be used as input/output (Ain/Aout) of the Function-A and/or input/output (Bin/Bout) of the Function-B. A local variable (Locals-A) cannot serve as input/output of the Function-A, but can serve as input/output of the Function-B on account of a pointer. An argument (args) from the Function-A to the Function-B may serve as an input to the Function-B. A return value (Ret.Val.) from the Function-B to the Function-A may serve as an output from the Function-B. It is noted that a local variable (Locals-B) of the Function-B is not included in the input/output of the Function-A and Function-B.
To reuse the Function-B without depending on the context, it is necessary to register, as input/output, only Bin/Bout of the Function-B, at the time of executing the Function-B. In relation to this, FIG. 46(b) shows a memory map of the main memory at the time of executing the program structure shown in FIG. 46(a). In this memory map, Locals-B is the only area where the Bin/Bout is not included. Therefore, to identify the Bin/Bout, it is necessary to specify (i) the border between Globals and Locals-B and (ii) the border between Locals-B and Locals-A. As to the former border, since an OS (operating System) typically determines the upper limits of a data size and stack size during execution, the border between Globals and Locals-B is determined based on the limit (LIMIT) determined by the OS. As to the latter border, it is possible to determine the border between Locals-A and Locals-B by using a value (SP in A) of a stack pointer immediately before the call of the function B.
Now, the following describes a method for identifying (i) whether a given main memory address is a global variable or a local variable, and (ii) if the address is a local variable, to which function the local variable belongs. The load module is assumed to satisfy the following conditions (1)-(3) defined by SPARC ABI. It is noted that % fp indicates a frame pointer, while % sp indicates a stack pointer.
(1) In an area not less than % sp, an area where % sp+0 to 63 is a register save area, and an area where % sp+68 to 91 is an argument save area. Neither one of these areas is input/output of a function.
(2) An implicit argument (Implicit Arg.) in a case where a structure is output is stored in % sp+64 to 67.
(3) An explicit argument (Explicit Arg.) is placed on % sp+92 or higher.
To distinguish global variables from local variables, the following conditions are set, for the reason that an OS typically determines the upper limits of a data size and stack size during execution.
(1) A global variable is placed in an area of less than LIMIT.
(2) Since % sp is not lower than LIMIT, an area of LIMIT to % sp is invalid.
FIG. 47 outlines arguments and frames in a memory map, in a case where the Function-A calls the Function-B while the conditions above are satisfied. Referring to this figure, the following describes a method of identifying local variables of the Function-A and local variables of the Function-B.
In the figure, indicated by (a) is a state during the execution of the Function-A. An area less than LIMIT, which is circumscribed by thick lines, stores Instructions and Global Vars, and an area of not less than % sp stores valid values. % sp+64 stores the leading address of the structure, as an implicit argument in a case where the Function-B outputs the structure. The leading six words of an explicit argument for the Function-B are stored in an area of register % o0 to 5, while the seventh word and the following words are stored in an area of not less than % sp+92. If an operand % sp+92 with the base register % sp appears, the area is the seventh words of the argument, i.e. a local variable of the Function-B. Meanwhile, if the operand % sp+92 does not appear, the area is a local variable of the Function-A. In this manner, during the state (a), the local variable of the Function-A is distinguished from the local variable of the Function-B, by checking the operand.
On this other hand, (b) indicates a state where the Function-B is executed. An argument may be an input, a return value may be an output, and a global variable and a local variable of the Function-A may be input/output. However, since the Function-B may accept a variable argument, basically it is not possible to determine whether an area of not less than % fp+92 is an area for a local variable of the Function-A or an area for a local variable of the Function-B.
To distinguish local variables, first, in the state (a), a function call in which the seventh word and the following words of the argument are detected is not the target of reuse, and as to a function call in which the seventh word and the following words are not detected, a value % sp92 is recorded immediately before the call. Note that, since the function call related to the seventh word and the following words is assumed not to frequently appear, it is possible to consider that the performance deterioration due to the exclusion of the function related to the seventh word and the following words is almost negligible.
Because of the above, it is possible to understand that the main storage reference address in the state (b) is either: a local variable of the Function-A if the address is not lower than the % sp+92 which has been stored in advance; or a local variable of the Function-B if the address is lower than the % sp+92. In a case where the Function-B is executed, a local variable of the Function A and a global variable are registered to the reuse table, while a local variable of the Function-B is excluded therefrom.
Since a local variable of the Function-B is excluded from the input/output at the time of the reuse, the address of a local variable of the Function-B is not required to correspond to the table. On this account, being independent of the context, it is possible to carry out the reuse if inputs corresponds to the table. Note that, however, as to a global variable to which the Function-B refers and a local variable of the Function-A, both the address and data must completely match with the content of the reuse table. That is, how main memory addresses to be compared are grasped before the execution of the Function-B is important.
Addresses of a global variable to which the Function-B refers and of a local variable of the Function-A are based on (i) an address constant generated by the Function-B and (ii) a pointer having its roots in a global variable/argument. Therefore, after an entry in the reuse table, which entry has a completely matched argument, is selected, all of the related main memory addresses are referred to and equal comparison is performed. As a result, main memory addresses to which the Function-B refers are found. Only in a case where all of the inputs match with the table, registered outputs (return value, global variable, and local variable of the Function-A) are reusable.
To achieve the function reuse, a function management table (RF) and an input/output recording table (RF) are provided as the reuse table. FIG. 48 shows a hardware configuration required for the reuse of one function. To reuse a plurality of functions, the same number of the configurations are required.
In the table, V stored in the RF and RB is a flag that indicates whether or not an entry is valid. LRU (Least Recently Used) is a hint for the replacement of an entry. Apart from V and LRU, the RF stores a leading address (Start) of the function and a main memory address (Read/Write) to be referred to. Apart from V and LRU, the RB stores % sp (SP) immediately before a function call, an argument (Args.) (V: valid entry, Val: value), a main memory value (Mask: valid bytes of Read/Write address, Value: value), and a return value (Return Values) (V: valid entry, Val: value).
Assume that the return value is stored in % i 0 to 1 (% o 0 to 1 in terms of leaf function) or in % f 0 to 1, and a return value (double-extended precision floating-point number) using % f 2 to 3 does not exist in the target program. Read addresses are collectively managed by the RF, and Mask and Value are managed by the RB. With this, the Read addresses and a plurality of entries in the RB are compared to the table at once, by a CAM (Content-Addressable Memory).
To reuse one function, first, at the time of executing the function, input/output information regarding arguments, return values, global variables, and local variables of upper functions are registered to the reuse table, while local variables are excluded from the registration. A value of an argument register where readout is precedently performed is registered as input/output of the function, while a value written into a return value register is registered as an output of the function. Values of other registers are not necessarily registered. In a similar manner, as to a reference to the main memory, a value in an address where readout is precedently performed is registered as an input, while writing is registered as an output.
Entries registered in the input/output table are enabled at the time of executing the return instruction, if a disturbance does not occur. Examples of the disturbance include (i) the next function is called before returning from the present function. (ii) inputs/outputs to be registered exceed the capacity of the reuse table, (iii) the seventh word of an argument is detected, and (iv) system call or interruption occurs in the midst.
Referring to FIG. 48, the following describes how omission of the execution of a function is carried out: before the call of the function, (1) a leading address of the function is looked for; (2) an entry which has a completely matched argument is selected; (3) all of related main memory addresses, i.e. Read addresses each having at least one enabled Mask, are referred to; and (4) equal comparison is performed. If all of the inputs match with the entry, (5) registered outputs (return value, global variable, and local variable of the Function-A) are written in.
An example of an instruction region is discussed. In the example, an instruction region shown in FIG. 49 is executed with the RF and RB arranged as shown in FIG. 48. In FIG. 49, PC indicates a PC value at the start of the instruction region. That is, the address of the start of the instruction region is 1000. FIG. 50 briefly shows an input address, input data, output address, and output data, which are registered in the RB, in a case where the instruction region shown in FIG. 49 is executed. FIG. 51 shows how the registration to the RB is actually carried out.
A first-row instruction (hereinafter first instruction; other instructions are also abbreviated in the same manner) causes a register R0 to be set at an address constant A1. A second instruction causes a register R1 to store 4-byte data (00110000) whose address is the content of the register R0 and which is loaded from the main memory. In this case, the address A1, mask (FFFFFFFF) (in the mask, F indicates a valid byte while 0 indicates invalid byte), and data (00110000) are registered, as inputs, in the first column on the Input-side of the RB. Meanwhile, the register number R1, mask (FFFFFFFF), and data (00000002) are registered, as outputs, in the Output-side first column of the RB.
A third instruction causes an address constant A2 to be set at the register R0. A fourth instruction causes a register R2 to store one-byte data (02) whose address is the content of the register R0 and which is loaded from the main memory. In this case, the address A2, mask (FF000000), and data (02) are, as inputs, registered in the Input-side second column of the RB. On this occasion, the remaining 3 bytes of the address A2 are “-” which indicates “Don't Care”. The register number R2, mask (FFFFFFFF), and data (00000002) are, as outputs, registered in the Output-side second column of the RB.
A fifth instruction causes the register R2 to store one-byte data (22) loaded from an address (A2+R2). Since the address R2 has a value (02), the address (A2+02) and data (22) are additionally registered. as inputs, in the Input-side second column of the RB. On this occasion, the registration is carried out in a part corresponding to the address (A2+02), while parts corresponding to the addresses (A2+01) and (A2+03), respectively, are kept at “-” which indicates “Don't Care”. Therefore, the mask corresponding to the address A2 is (FF00FF00). The register number R2, mask (FFFFFFFF), and data (00000022) are, as outputs, overwritten into the Output-side second column of the RB.
A sixth instruction causes an address constant A3 to be set at the register R0. A seventh instruction causes a register R3 to store one-byte data (33) whose address is the content of the register R0 and which is loaded from the main memory. In this case, the address A3, mask (00FF0000), and data (33) are, as inputs, registered in the Input-side third column of the RB. The register number R3, mask (FFFFFFFF), and data (00000033) are, as outputs, registered in the Output-side third column of the RB.
An eighth instruction causes a register R4 to store one-byte data (44) loaded from an address (R1+R2). In this case, since the addresses R1 and R2 are those of the registers which are overwritten in the instruction region, these addresses R1 and R2 are not the inputs of the instruction region. In the meanwhile, an address A4 generated by the address (R1+R2) is the input of the instruction region. Therefore, the address A4, mask (00FF0000), and data (44) are registered, as inputs, in the Input-side fourth column of the RB. The register number R4, mask (FFFFFFFF), and data (00000044) are, as outputs, registered in the Output-side fourth column of the RB.
By a ninth instruction, a value is read out from a register R5, and a result of adding one to the value is stored in the register R5. In this case, the register R5, mask, (FFFFFFFF), and data (00000100) are, as inputs, registered in the Input-side fifth column of the RB. Meanwhile, the register number R5, mask (FFFFFFFF), and data (00000101) are, as outputs, registered in the Output-side fifth column of the RB.
As described above, the following processes are carried out to perform the readout from the memory/register at the time of executing the instruction.
(1) The Output side of the RB is searched. If the address/register number thus read out has already been registered, the process terminates while the address/register number is not registered on the Input side.
(2) If the address/register number thus read out is not found in the Output side of the RB, the Input side of the RB is searched. If the address/register number thus read out has already been registered, the process terminates while the address/register number is not registered.
(3) If the registered address/register number thus registered is not found in the Input side of the RB, a new entry is added to the RB, and the address/register number and the value of the entry are registered.
For the writing into the memory/register at the time of executing the instruction, the following processes are carried out.
(1) The Output side of the RB is searched. If the address/register number thus read out has already been registered, the process terminates while the value is updated.
(2) If the registered address/register number thus read out is not found on the Output side of the RB, a new entry is added, and the address/register number and the value, which have been read out, are registered.
Patent Document (Japanese Laid-Open Patent Application No. 2004-258905 (Published on Sep. 16, 2004)) discloses a technique to perform parallel precomputation by using a plurality of processors, in the aforesaid arrangement for reuse. The document also discloses, as a technique to predict inputs in the parallel precomputation, such an arrangement that a stride prediction is carried out based on a difference between (i) the last-appeared argument and (ii) a pair of recently-appeared arguments.
Performing the above-described prediction makes it possible to effectively carry out the reuse based on a result predicted in advance, in a case where input parameters monotonously change in a continuous fashion as above.
According to the conventional art, however, in the RB, entries must be registered as different entries, if the content of at least one item of each entry is different. For this reason, the memory is not efficiently used in the RB.
Also, the reuse cannot be performed if at least one input pattern of the function to be executed is different from an input pattern in each entry of the RB.
FIG. 52 shows an example of histories registered on the input side of the RB, in a case where the instruction region shown in FIG. 49 is repeatedly executed. In this example, the instruction region is executed each time Time shifts to the next, from 1 to 4. Each time the instruction region is executed, the address A2 changes to (02), (03), (04), and (05). In accordance with these changes, values of other input items also change.
Indicated by “diff” between the neighboring histories is a variation of a corresponding input item. The aforesaid conventional input prediction is carried out based on the diff. FIG. 53 shows a result of a prediction based on the conventional input prediction.
For example, a content of a monotonously-changing address (address A2 in the aforesaid example), e.g. a loop control variable, is correctly predicted. However, if the instruction region includes an array element, generally a value of the array does not always monotonously change even if a subscript of the array monotonously changes. In the example shown in FIG. 52, a value loaded from the address A2 is the subscript of the array. When a reference to the main memory uses the subscript as an address, this address changes and hence the number of the input items registered as history also changes. In this case the changes in one column are not orderly, so that the precision of the prediction significantly deteriorates, as the column corresponding to the address A3 in FIG. 53 shows.
In an input prediction, a prediction of a value regarding an address whose content does not change is waste of hardware resource. In a case where a change of a value has no regularity, a prediction has to be carried out with the assumption that the difference is 0. Such a strained prediction, however, may further decrease the precision of the prediction. In the example shown in FIG. 53, the position of a mask must be predicted, regarding the address corresponding to A2+4. However, it is difficult to predict the change in a position of the mask. In such a case, direct reference to a main memory value is preferable to the prediction.
The problems above occur because all of registered addresses are uniformly dealt.
The present invention was done to solve the problems above, and the objective of the present invention is to provide a data processing device which can register, in instruction region storage means, an input/output group that is appropriate for reuse.
Also, the present invention was done to solve the problems above, and the objective of the present invention is to provide a data processing device which has a relatively simple structure but can register, in instruction region storage means, an input/output group suitable for reuse.