I. Field of the Invention
This invention relates generally to computer technology, and more particularly, to improving processor performance in a computer system.
II. Background Information
In high-performance processors, one of the keys to improving performance is reducing execution latency, i.e., the number of clock cycles an instruction takes to execute. One way to reduce execution latency is to use execution units such as specialized execution units. Each specialized execution unit only executes a sub-set of the architectural instructions; several different specialized execution units are implemented in conjunction with each other on the microprocessor to cover execution of the entire instruction set. Since each specialized execution unit only performs a small number of functions, that execution unit can operate faster than a fully comprehensive execution unit.
The disadvantage of utilizing specialized execution units is the necessity to steer various instructions to their appropriate execution units. This steering function becomes exponentially more difficult with an increase in the degree of superscalar dispatch (i.e., dispatching multiple instructions simultaneously per clock cycle to the execution units) for the processor. Steering instructions to specialized execution units is handled by a full crossbar. This full crossbar provides a path for each instruction to travel to each execution unit. The number of paths in a full crossbar is proportional to the number of execution units multiplied by the number of instructions being steered per cycle. Depending on the degree of superscalar dispatch the processor employs, the crossbar can become quite cumbersome in terms of the number of routing wires needed and/or silicon area. In addition, for a processor running at high frequencies, it may take several cycles for the instructions to be routed through this extensive crossbar; this increase in pipeline depth lowers processor performance (i.e., if latches are used because instructions cannot reach its destination in one clock cycle, then the latches contribute to an increase in pipeline depth which results in a decrease in processor performance).
FIG. 1 shows an example of dispatching instructions to execution units in a prior art processor implementation. In this example, a cache line 158 (cache line 158 may have a dispatch buffer that extends from the cache line) of an instruction cache includes four positions, each of the four positions stores an instruction. A crossbar 152 steers instructions and provides a path between each of the four positions of cache line 158 and each of the specialized execution units 143a-e. Each of the positions of cache line 158 has a path to all specialized execution units 143a-e because any type of instruction may be stored in any of the positions of cache line 158 and thus all positions should have access to all specialized execution units 143a-e in order to dispatch any type of instruction to any of execution units 143a-e. 
Each of specialized execution units 143a-e includes a corresponding one of schedule queues 155a-e. Each of the schedule queues 155a-e, among other functions, stores instructions in one or more entries until the instructions can be executed by a particular processing unit within the execution unit (e.g., the processing unit may be an arithmetic logic unit (xe2x80x9cALUxe2x80x9d), a memory unit (xe2x80x9cMEMxe2x80x9d), or a complex operation unit (xe2x80x9cCMPLXxe2x80x9d)). A write port writes the instructions to the one or more entries (the write ports correspond to the arrows entering a particular one of specialized execution units 143a-e). The number of write ports within the schedule queue depends on the number of instructions that may be dispatched to the execution unit in one clock cycle. In FIG. 1, assuming that the processor employs a degree-4 superscalar dispatch (i.e., four instructions are dispatched simultaneously in one clock cycle), each of the schedule queues 155a-e has four write ports. Here, four write ports are used because in any one clock cycle, up to four instructions may be dispatched to a particular one of specialized execution units 143a-e. 
If the instructions are re-ordered prior to loading them into the instruction cache, the size of the crossbar and the number of write ports within an execution unit may be significantly reduced resulting in improved processor performance. For the foregoing reasons, there is a need for re-ordering instructions prior to loading them into an instruction cache.