1. Field of the Invention
The present invention relates to a data providing unit for processor and a processor having the data providing unit. More particularly, the present invention relates to a technology for improving a process efficiency of the processor by processing a load instruction at a high speed in a data providing unit which provides data to be read according to a load instruction to the processor.
2. Description of the Related Art
The technology for improving a process efficiency of the processor by using a pipeline process has been put to practical use. The xe2x80x9cpipeline processxe2x80x9d can be defined as such a scheme that a plurality of instructions are executed concurrently in parallel by shifting their processing stages by one cycle (i.e., pipeline pitch) sequentially.
FIG. 1 shows respective stages of standard five-stage pipeline in a RISC (Reduced Instruction Set Computer) type processor. As disclosed in a literature xe2x80x9cComputer Architecturexe2x80x9d (Hennessy et al.; Morgan Kaufmann Publishers, Inc.), etc., this type pipeline is a pipeline employed in a very basic processor.
In this pipeline process, one arithmetic instruction is divided into five stages and then executed. As shown in FIG. 1, these five stages are instruction fetch (IF) stage, instruction decode (ID) stage, execution (EX) stage, memory access (MA) stage, and write back (WB) stage. In the IF stage, an instruction is fetched from an instruction memory. In the ID stage, the instruction is interpreted, while an operand necessary for execution is fetched by accessing a register file. In the EX stage, an arithmetic operation is executed. In this case, when instructions (load instruction, store instruction, etc.) for accessing a data memory are executed, a data address is calculated in the EX stage. In the MA stage, the data memory is accessed and then data are fetched from the data memory by using the address which is calculated in the EX stage. In the WB stage, executed results and data read from the data memory are written back into a register file.
Next, an operation in the pipeline process when the load instruction is to be executed will be explained hereunder. For easy understanding, the operation in the pipeline process will be explained by using an example of a simple scalar processor which can execute only one instruction at a time.
FIGS. 2A and 2B show behaviors of the pipeline process when instructions are processed successively. As shown in FIG. 2A, when the preceding instruction is a standard arithmetic operation instruction (add instruction in FIG. 2A), it is possible to execute succeeding instructions successively. Arrows in FIGS. 2A and 2B indicate bypasses of arithmetic results. On the other hand, as shown in FIG. 2B, when the preceding instruction is a load instruction (Load Word (lw) instruction in FIG. 2B) to access the data memory, the situation is altered. In FIG. 2B, the load instruction is depicted as the Load Word (lw) instruction. The load instruction cannot acquire the data unless the MA stage is terminated. Therefore, the succeeding instruction (add instruction in FIG. 2B) cannot acquire the data necessary for operation until its own EX stage is started. In other words, the succeeding instruction (add instruction) must wait execution of the EX stage until execution of the load instruction has been completed. The execution of this load instruction contains two operations, i.e., the data address calculation and the memory access. Therefore, execution of a instruction which executes the process employing the result of the load instruction has a longer period of data dependency than the case where results of other operation are employed. This data dependency generates stall of the pipeline process so as to disturb improvement in processor performances.
Next, an operation in the pipeline process when the load instruction and the load instruction are to be executed successively will be explained hereunder. In this case, the operation will be explained by using an example of an out-of-order type processor in which dynamic rearrangement of the instructions can be attained at the time of execution of the instruction.
FIG. 3 shows an example of instruction sequences in which the store instruction and the load instruction are issued successively. In FIG. 3, the store instruction is depicted as Store Word (sw) instruction and the load instruction is depicted as Load Word (lw). In the instruction sequence in FIG. 3, assume that the value in a register r2 which calculates the address of the preceding store (sw) instruction is not determined, while the value in a register r3 which executes the address of the load (lw) instruction is determined. Assume that the values in registers r20, r21 which are operands of the add instruction are also determined. The sw instruction waits its execution because its operands have not been prepared. The add instruction can start execution to overtake the sw instruction because its operands have been prepared. It seems that the lw instruction can also start execution because its operands have been prepared, nevertheless actually the lw instruction cannot start execution because dependency of the lw instruction upon the sw instruction has not been dissolved. In other words, unless the data address into which data are to be stored by the preceding store instruction can be determined, the succeeding load instruction cannot be executed. This is because, if the data address calculated by the store instruction and the data address calculated by the load instruction coincide with each other, the load instruction must read out the data which the store instruction is trying to save. Therefore, the load instruction cannot be executed to overtake the store instruction which is in its standby even if these instructions employ different registers and their operands are prepared. For this reason, even in the case of the out-of-order type processor, overtaking of the instruction cannot be carried out and thus the stand-by time for the execution of the instruction is increased, so that performances of the processor cannot be improved. This problem in execution stand-by of the load instruction is also applicable for the above scalar processor.
The technology, which can improve a process efficiency of the load instruction by using the correspondence between the store instruction and the load instruction in successive execution, has been disclosed in xe2x80x9cDynamic Speculation and Synchronization of Data Dependencexe2x80x9d (xe2x80x9cProceedings of the 24th Annual International Symposium on Computer Architecturexe2x80x9d, A. I. Moshovos, et al., 1997). In this technology, the correspondence between the store instruction and the load instruction which depend on particular data stored in the same memory address is held previously, and then such correspondence is checked in execution. If no correspondence between the store instruction and the load instruction is detected by this check, the load instruction can be executed not to wait for execution of the store instruction.
However, in this technology, if the correspondence between the store instruction and the load instruction is detected, i.e., if these instructions access the data stored in the same memory address, the load instruction stalls until execution of the store instruction has been completed, like the conventional scheme. Therefore, this technology has not be able to improve sufficiently an execution efficiency of the load instruction.
As discussed above, there have been following problems in the conventional scheme.
More particularly, first, there has been the problems that a process efficiency of the load instruction is low and also the succeeding load instruction cannot be executed unless the preceding store instruction is executed. Since execution of the load instruction needs two operations such as the address calculation and the memory access, a dependency path between the load instruction and other instructions becomes longer than other instructions.
Second, there has been the problem that, when there is the correspondence of the data stored at the same address between the store instruction and the load instruction, the load instruction cannot be executed after the address calculation of the data to be accessed has been executed.
These problems have reduced a degree of parallel processing in the pipeline process, so that disturb extraction of instruction level parallelism. For this reason, these problems have brought about such a disadvantage that execution performances of the processor are extremely degraded.
The present invention has been made in light of the above problems and it is an object of the present invention to provide a data providing unit for processor which can execute data acquisition for load instructions predictivelly to execute succeeding instructions speculatively, so as to improve performances of a processor.
It is another object of the present invention to provide a data providing unit for processor which can improve a degree of parallel processing in a pipeline process by reducing a data dependency period between a store instruction and a load instruction in case the store instruction and the load instruction are executed successively.
In order to achieve the above objects, a feature of the present invention resides in that correspondences between the store instruction and the load instruction are held previously as history data of past instructions and then data corresponding to the data are provided to the processor as predictive data.
A configuration to achieve these functions comprises, as shown in FIG. 4, for example, a first address converter 100 for holding an address of a store instruction corresponding to address of data, based on execution history of the store instruction, a second address converter 200 for holding the address of the store instruction corresponding to an address of a load instruction, based on execution history of the load instruction, a data storing unit 300 for holding data corresponding to the address of the store instruction, based on execution history of the store instruction, and a data providing controller 700 for retrieving the load instruction and the store instruction, both instructions looking up a same data address, from the first address converter 100 and the second address converter 200, retrieving data which are employed by the store instruction corresponding to the load instruction from the data storing unit 300, based on the address of the load instruction, and providing the data for the processor as predictive data to which access by the load instruction is predicted.
The history data of the above instructions can be implemented in two above address converters and the data storing unit, for example.
The data providing controller 700 can acquire the address of the store instruction corresponding to the load instruction by looking up the LIST (Load Index Storing Table) 200 (referred to as the second address converter in claims described later) by using the address of the load instruction when the load instruction is executed. The data providing controller 700 can acquire data values corresponding to the store instruction by looking up the SIVT (Store Index Value Table) 300 (referred to as the data storing unit in claims) by using the address of the store instruction. Resultant data are provided for the processor as predictive data.
In other words, data values to be accessed are predicted according to the instruction address of the load instruction and then such data are provided for the processor as the predictive data. Therefore, the load instruction can acquire the data before calculation of the data address has been completed. Consequently, execution of the load instruction is accelerated, and also process performances of the processor can be improved by executing succeeding instructions speculatively by using the predictive data.
In addition, the data providing controller 700 can detect the correspondence between the store instruction and the load instruction without intervention of the data address. Therefore, the data providing controller 700 can improve the process performance of the processor by executing the store instruction and the load instruction simultaneously or executing overtaking of the store instruction by the load instruction.
As shown in FIG. 4, the configuration according to the present invention can further comprises a state holding unit 400 and a comparator 500.
The state holding unit 400 can hold the state of the processor before the processor looks up the data providing unit according to the present invention. This state of the processor contains at least values of the program counter 600 and respective registers. The comparator 500 can compare the predictive data with actual data value obtained by accessing the memory actually, so as to output the compared result.
The data providing controller 700 can restore the processor by using the state being held by the state holding unit 400 unless the compared result coincides with each other. Accordingly, it is possible to keep consistency in the process by using the compared result.
History of the compared result can be held in the tables, e.g., LIST 200, etc. counters, for example. The data providing controller 700 can suppress the inadvertent speculative execution which has a low coincidence rate by looking up the counter value. Accordingly, reduction in process performances of the processor owing to the failure of speculative execution and restoring to the original state can be suppressed.
In addition, address of the data may be held in the LIST 200, etc. in the above tables, for example. The data providing controller 700 can get predictive data address quickly by looking this the address of the data. Therefore, the speculative execution can be started quickly.
Further, address of the data may be held in the SIVT 300, etc. in the above tables, for example. The data providing controller 700 can decide quickly the success or failure of the speculative execution using the predictive data. Accordingly, restoration of the state can be executed quickly if the speculative execution has been done unsuccessfully.
Also, it is preferable that the data providing controller 700 looks up the second address converter 200 and the data storing unit 300 in different pipeline stages respectively. By looking up them in different pipeline stages, a cycle time of the processor can be avoided from being extended.
Various further and more specific objects, features and advantages of the invention will appear from the description given below, taken in connection with the accompanying drawings illustrating by way of example a preferred embodiments of the invention.