1. Field of the Invention
The present invention relates to increasing the speed and efficiency of a microprocessor by increasing its instruction level parallelism (ILP). More particularly, the present invention is a technique for increasing the number of instructions executed per clock cycle (IPC) by adding distributed reservation stations that are organized in accordance with the basic blocks of code that are to be executed on the microprocessor.
2. Description of Related Art
In the computer industry there is a constant demand for ever faster and more efficient systems. This computer processing speed is dependent on the speed and efficiency of the microprocessor that controls the basic functions of the computer system. Today""s microprocessors, such as the Pentium and PowerPC include multiple execution units, such as integer or fixed point units (also referred to herein as arithmetic logic units; or ALU), floating point units (FPU), Load/Store units, and the like, which allow instructions to be executed in parallel. One method of increasing computer performance has been to design microprocessors with additional execution units. However, in spite of adding more execution resources, the instructions executed per clock cycle has remained at an average of 0.9 using an integer benchmark based on three (3) ALUs. Ideally, for three (3) ALUs, the IPC should be three, i.e. an instruction executes on each ALU during one clock cycle.
Typically, reduced IPC is due to inefficient coding, i.e. the compiler is not optimized to increase instruction level parallelism or memory subsystem latency, i.e. microprocessors instructions must wait until the information is stored to or loaded from memory before they can be executed. In most computer systems the speed of the memory bus which transmits the information between the execution units and memory is significantly slower than the microprocessor clock. The ratio of microprocessor core frequency to bus frequency is often three to four. For example, while a microprocessor clock may run at 133 MHZ, the system bus may only operate at 33 MHZ. Therefore, it can be seen that instructions which are dependent on a memory operation may take four times as long to complete as instructions which are independent of memory. One example is a cache miss, where the required data is not contained in the level one (L1) typically contained in the microprocessor core. In this case, the data must be received from a level two (L2) cache that is usually on a separate integrated circuit chip. If the data is not in the L2 cache (L2 cache miss), then it must be retrieved from main memory. Those skilled in the art will understand that there is a very high cost in terms of system speed and performance due to memory latency, particularly cache misses.
Modem microprocessors may include a reservation station which is basically a queue that stores instructions which are awaiting execution. When a cache miss occurs, the particular instruction(s) awaiting the operand information from memory will wait in the reservation station until the information is available. This wait period will have a detrimental affect on system performance. Most common architectures use a centralized reservation station scheme that buffers the instructions to be scheduled for execution. The depth of conventional reservation stations can be on a critical path if an instruction to be scheduled in a single cycle in a high frequency processor. As the depth of the reservation is increased, the time it takes to look up and fetch an instruction that is ready for execution also increases.
Other types of architectures use individual reservation stations where each execution unit in the microprocessor has an assigned reservation station. That is, each floating point unit (FPU), fixed point unit (FXU) or the like will have a corresponding reservation station. These reservations stations are usually shallow and can hold 2-3 instructions. Another type of reservation station configuration is group reservation stations. In this case the same reservation station holds instructions for a whole group of execution units, each of which execute the same type of instructions. For example, one reservation station may hold instructions for one or more FPU units, while another reservation station may hold integer instructions for multiple FXUs. In this case, each reservation station will only hold those specific types of instructions that can be executed by the units in the group.
However, none of these current architectures include reservation stations that are organized at the basic block level, which provides independence between instructions at each station, thereby increasing instruction level parallelism and decreasing overhead associated with look up time. Thus, it can be seen that a need exists for a microprocessor that minimizes the time instructions are waiting to be executed.
In contrast to the prior art, the present invention utilizes a distributed reservation station which stores basic blocks of code in the form of microprocessor instructions.
The present invention is capable of distributing basic blocks of code to the various distributed reservation stations. Due to the smaller number of entries in the distributed reservation stations, the look up time required to find a particular instruction is much less than in a centralized reservation station.
Further, additional instruction level parallelism is achieved by maintaining single basic blocks of code in the distributed reservation stations. Generally, more independent instructions are found in basic blocks. This is due to the fact that instructions which are grouped together are less likely to use the same resources, e.g. registers and memory locations, therefore, they will exhibit more data, control and resource independence. In contrast, when instructions are not associated with one another (e.g. in different basic blocks) they are more likely to use the same processing resources (execution units), data resources (registers) and be subject to control dependencies (branching), thus causing a greater chance of dependency that may force instructions to wait for resources to become available.
Also, with a distributed reservation station, an independent scheduler can be used for each one of the distributed reservation stations. When the instruction is ready for execution, the scheduler will remove the instruction from the distributed reservation station and arbitrate for ownership of the appropriate execution unit. When ownership is awarded to the scheduler, then it will queue that instruction(s) for immediate execution at that particular execution unit. It can be seen that multiple independent schedulers will provide greater efficiency than a single scheduler which must contend with approximately 20-24 instructions that have increased dependency on one another.
Therefore, in accordance with the previous summary, objects, features and advantages of the present invention will become apparent to one skilled in the art from the subsequent description and the appended claims taken in conjunction with the accompanying drawings.