The operating frequency of a microprocessor, which is an example of a data processor, depends on cache access delay in general. Since many of the cache access delays are wiring delays, they are not decreased so much even by scaling down of a semiconductor manufacturing process. For increase in the operating frequency of a microprocessor, it is effective to pipeline the cache access (see e.g. JP-A-05-313893 and particularly FIGS. 10 and 11 thereof).
However, pipelining cache access increases the number of cycles until the time when the result of access is obtained and delays fixing of the result of load. In a typical program, the result of a load instruction for cache access is often used as the input of an execution instruction. Hence, when pipelining cache access delays fixing of the result of load, the number of wait cycles until a data input necessary for the execution of an execution instruction is fixed is increased, and thus the number of cycles required to run a program is also increased.
As described above, pipelining cache access for increase in the operating frequency of a microprocessor increases the number of cycles necessary for the execution of an execution instruction, and therefore the performance which reflects the increase in the frequency can not be attained. To achieve such performance, a measure which can avoid increase in the number of run cycles of an execution instruction even when fixing of the result of load is delayed by pipelining cache access is needed.
Techniques widely used as such measures include an out-of-order system (see e.g. JP-A-2001-236222, and particularly the second to eighth paragraphs thereof). According to the out-of-order system, a string of instructions to be run later are acquired precedently, and an instruction, which does not use the result of the running load instruction, is found out from the string of instructions and run. Basically, when instruction or operand dependence arises, the execution of an instruction is suspended until the dependency is resolved, and an instruction having no dependence is precedently executed jumping the code description turn.
However, with the out-of-order system, a mechanism of ensuring a memory access order, etc. is required because an instruction may be executed precedently with the code description turn left out. Therefore, many additional mechanisms of e.g. precedently acquiring a string of instructions, ferreting out an instruction to be executed, and ensuring a memory access order are needed. For this reason, application of the out-of-order system to low-cost microprocessors is regarded as being difficult in consideration of the manufacturing cost.
As a substitute for the out-of-order system, there has been known a delayed ALU system (see e.g. M. Ozawa, et. al., “Pipeline Structure of SH-X Core for Achieving High Performance and Low Power,” In Proc. of COOL Chips VII, pp. 239-254, April 2004). This system is a technique such that the start of execution of an execution instruction is set to the time after one cycle from the start of execution of a load instruction. According to the delayed ALU system, the operation by ALU (Arithmetic and Logic Unit) for performing an arithmetic and logic operation is arranged in a start stage of a cache access, by which the input readout of ALU can be delayed by one cycle. Thus, the number of wait cycles when the result of cache access is used for input of an execution instruction can be reduced by one cycle.