1. Field of the Invention
The present invention relates to a data processing device and in particular, relates to an instruction fetch control device for fetching an instruction and feeding it to an instruction execution unit (instruction execution processing unit) and a method thereof.
2. Description of the Related Art
In a data processing device adopting an advanced instruction processing method such as a pipeline processing system or subsequent ones, performance has been improved by speculatively fetching a subsequent instruction (fetch instruction) and speculatively processing the instruction without waiting until the execution of the current instruction has completed.
In such a data processing device, for example, by providing an instruction fetch unit that is completely decoupled from an instruction execution unit closely coupled with a branch prediction mechanism, sufficient instruction fetching capacity needed by the instruction execution unit and improved instruction fetch performance are realized.
There is also a configuration using a trace cache for the instruction fetch unit. In this configuration, decoded instruction string that is made packets is stored in the trace cache in advance. When the instruction string is used next time, instruction fetch is avoided by using the stored instruction string and omitting processes ranging from instruction fetch to decoding.
However, since in this configuration, decoded information is stored, entry size (capacity) becomes large, which is a disadvantage. Furthermore, if this configuration is linked to branch prediction, such information must be stored for each branch prediction destination. Therefore, large trace cache capacity is needed in a specific instruction area. In this case, it is difficult to secure trace cache capacity sufficient to cover a wide instruction area.
Furthermore, in the configuration using a trace cache, since decoded information must be stored, a decoding process must be performed before information is stored in the trace cache (when instruction fetch data from an instruction cache and the like is registered). Therefore, in the case of mis-trace caching (when branch prediction fails), a process storing the information in the trace cache and the like must be performed in addition to normal processes for instruction fetch from an instruction cache and decoding information. Therefore, in the case of mis-trace caching, an overhead becomes large.
For such a reason, the configuration using a trace cache is remarkably effective in a narrow target instruction area of a single small benchmark test and the like. However, even an instruction cache is miss-traced in an actual environment where a plurality of applications are executed, as in a server performing a large-scale transaction process. Therefore, in a trace cache that covers a narrower instruction area than that of an instruction cache, there are frequent mis-traces. In other words, a configuration using a trace cache is not effective in an actual environment, even worse, process performance remarkably degrades.
Furthermore, a trace cache is a kind of cache memory and requires cache coherence control like a general instruction cache. Therefore, such control becomes complex. For example, when it is necessary to rewrite an instruction string due to store instruction, processes for updating and nullifying a trace cache become complex and the control unit requires a complex design accordingly, which is another disadvantage.
However, when an instruction is fetched using an instruction buffer, there is no such problem since the configuration was originally developed for a main frame.
However, in the configuration using this instruction buffer, the average efficiency in use of an instruction buffer is low since the instruction buffer is static, which is another problem.
For example, there is a configuration which fetches an instruction using a plurality of instruction buffers that are grouped into a plurality of systems.
FIG. 1 shows an example of such a configuration for fetching an instruction using a plurality of instruction buffers.
In the example shown in FIG. 1, there are a plurality of instruction buffers (I-Buffer #0, #1 and #2) for each of three systems (A, B and C). In this case, when a branch has not been predicted, the plurality of instruction buffers belonging to one system are used. When a branch has been predicted, the plurality of instruction buffers belonging to a system other than the system that has been used, is used. Therefore, in the case of this configuration, when a branch has not been predicted, since only one system of the three systems is used, the use efficiency of the instruction buffers is one third. When there is frequent branch prediction, only a part of the instruction buffers of each system is used although all three systems are used. Therefore, the use efficiency also becomes low. In the case of a static instruction buffer configuration in which instruction buffers are fixedly determined depending on whether a branch as been predicted, the use efficiency of instruction buffers varies depending on the number of branch predictions and the use efficiency of the used instruction buffers becomes low, which is a problem.
Long distance wiring naturally caused by recent giant configurations due to the recent miniaturization of LSI processes has been the major factor obstructing high-speed operation, and the increase in the number of trace caches and instruction buffers incurs a great disadvantage. However, in order to improve performance, as many instruction buffers as possible that prevent delay in transfer of an instruction from main memory, cache memory and the like to an instruction execution unit are required.