I. Field of the Invention
This invention relates generally to computer technology, and more particularly, to improving processor performance in a computer system.
II. Background Information
Modern processors implement a variety of techniques to increase the performance of executing instructions, including superscalar, pipelining, and out-of-order instruction execution. More specifically, superscalar processors are capable of processing multiple instructions within a common clock cycle. Pipelined processors divide the processing (from fetch to retirement) of an operation into separate pipestages and overlap the pipestage processing of subsequent instructions in an attempt to achieve single pipestage throughput performance. Out-of-order processing within a processor involves allowing instructions without true data dependencies to be processed out of their original program order to gain performance and increase parallelism and efficient resource usage.
The following two types of hazards may cause pipeline stalls: a structural hazard and a data hazard. Structural hazard arises from resource conflicts (e.g., execution unit conflicts) when the hardware cannot support all possible combinations of instructions in simultaneous overlapped execution. Data hazard arises when an instruction depends on the results of a previous instruction in a way that is exposed by the overlapping of instructions in the pipeline.
Instruction steering is the process of sending an instruction to the execution unit that it requires. Steering the instructions in the pipestages between the instruction cache and the execution unit, however, unnecessarily couples the execution throughput to the throughput at which the steering can be performed.
If, however, the steering is done prior to the instruction cache, the manipulations could be performed at a slower throughput because the throughput of fetching an instruction from main memory to the instruction cache is much slower than the throughput of instructions executing in the pipeline (typically, the pipeline is executing multiple instructions every clock cycle). Therefore, if the steering is done prior to the instruction cache, it could be performed concurrently with the instruction fetch from main memory and thus the steering could be performed at the slower rate without appreciably degrading the overall system performance.
If instruction steering is done prior to the instruction cache, the steering could be captured in the instruction cache, and therefore, the instruction steering would only have to be repeated in the case of an instruction cache miss. If, however, the steering is done in the pipestages, then the steering would have to be done for all dynamic instances of the instruction. Dynamic instances of an instruction is the number of times that an instruction at a specific address was executed. Static instances of an instruction is the number of instructions in a code segment, where each instruction (at each address) is counted only once. For example, in a program loop (e.g., a “for” loop) that performs a multiply instruction one thousand times, the dynamic instances of the instruction is one thousand but the static instance of the instruction is one. Performing the instruction steering prior to the instruction cache results in performing the steering at a reduced throughput because fewer instructions would be manipulated per processor cycle. The efforts of the steering can be used multiple times, thereby reducing the total number of steering operations performed during the processing of instructions. Any increase in time to perform the steering prior to the instruction cache (as discussed in the prior paragraph) can be amortized over multiple executions.
Typically, instructions are steered as quickly as possible to the execution unit that the instruction requires without consideration of other instructions. Steering the instructions prior to the instruction cache, however, allows for more optimal steering because the steering can be performed on a window of instructions. The start of the steering activity may be deferred until a given number of instructions have been accumulated in the instruction window. Using the instruction window increases the quality of the steering because a larger number of instructions are taken into account when a particular instruction is steered.
The more aggressive the implementation of a superscalar, out of order, and pipelined processor, the more manipulations that have to be applied to each instruction as they are passed down the pipestages of the processor. Two of the most complex manipulations performed in a pipelined processor involve register renaming and scheduling. Register renaming is the process of assigning new registers for new values produced in order to remove false data dependencies (dependencies caused by the reuse of storage locations even though the conflicting instructions are otherwise independent). Scheduling involves rearranging the instructions to avoid pipeline stalls (i.e., overcome structural and data hazards). Performing the complex manipulations and steering the instructions prior to the instruction cache may allow instructions to be transferred directly from the instruction cache to the execution unit that the instruction requires without further steering in the pipestages.
A crossbar is a switch that may provide a path between devices and also may steer the instructions. For example, a crossbar may provide paths between a reservation station and multiple execution units and steer instructions to the proper execution unit. The number of wires needed in a crossbar is proportional to the number of execution units multiplied by the number of instructions being steered per cycle. As the size of the crossbar increases, the resistance and capacitance in the crossbar also increases resulting in timing delays which decrease the rate at which the instructions are steered. As the size of the crossbar decreases, the rate at which the instructions are steered increases. If the instructions are steered prior to the instruction cache, the crossbar between the instruction cache and the execution units may be eliminated or its size significantly reduced thus resulting in, among other things, a quicker rate at which instructions are steered.
For the foregoing reasons, there is a need for an instruction cache that supports pre-steered instructions.