This invention relates to the field of microprocessor architecture, more particularly to an architecture that makes efficient use of instruction execution units in a multi-cluster system.
Early microprocessors operated at relatively low clock frequencies. As users demanded faster microprocessors, designers responded by increasing the clock frequency. In some designs, the higher clock frequency did not interfere with the correct logical operation of the microprocessor. In other designs, the higher clock frequency caused subsystems in the microprocessor to fail. These failures were addressed in several ways. Some failures were corrected by packing the logic devices more densely on the chip in order to decrease signal path lengths between the logic devices. Others were corrected by implementing the design in a faster technology, such as gallium arsenide. As clock frequencies continued to increase, these strategies became more difficult and costly to implement, and other strategies evolved to satisfy the user""s demand for faster microprocessors.
One such strategy involved designing multiple instruction execution units into a single microprocessor. A microprocessor having multiple instruction execution units can execute more instructions per unit of time than a microprocessor having a single instruction execution unit. This strategy evolved to a point where multiple instruction execution units were grouped or clustered to further increase microprocessor performance. However, the performance improvement in these multi-cluster microprocessors comes at the cost of increased complexity in the scheduler, the microprocessor subsystem that routes instructions to the clusters in an attempt to improve the utilization of the instruction execution units. An additional problem arises when the results produced by a first cluster are required for use by a second cluster. In that case, a delay in waiting for the results produced by the first cluster to be available to the second cluster reduces the throughput of the microprocessor.
Referring to FIG. 1, a block diagram of a prior art microprocessor system is shown. Memory 100 is provided for storing instructions. Coupled to memory 100 is instruction fetch 105. The purpose of instruction fetch 105 is to retrieve instructions from memory 100 and present them to scheduler 110. Scheduler 110 routes instructions to either first cluster 115 or second cluster 120. First execution unit 125 and second execution unit 130 are provided for executing instructions routed to first cluster 115. Third execution unit 135 and fourth execution unit 140 are provided for executing instructions routed to second cluster 120. Retirement unit 145 is coupled to the outputs of first cluster 115 and second cluster 120 and couples the architectural state via write back bus 160 to first cluster 115 and second cluster 120. The architectural state is the bit configuration of all the registers in retirement unit 145 at a given time. First cluster fast results bypass 150 is provided to couple the output of first cluster 115 to the input of first cluster 115, for use in first cluster 115, prior to commitment in retirement unit 145. Likewise, second cluster fast results bypass 155 is provided to couple the output of second cluster 120 to the input of second cluster 120, for use in second cluster 120, prior to commitment in retirement unit 145.
In operation, instruction fetch 105 retrieves instructions from memory 100 and delivers the instructions to scheduler 110. Scheduler 110 attempts to route instructions to first cluster 115 and second cluster 120 in a way that provides high utilization of execution units 125, 130, 135, and 140. Unfortunately, when a read instruction is executed in second cluster 120 after a write instruction was executed in first cluster 115, the results of the write instruction are not immediately available to the read instruction, since the results of the write instruction must be fed back to second cluster 120 from the architectural state in retirement unit 145 via write back bus 160.
For these and other reasons there is a need for the present invention.
In one embodiment an apparatus for routing computer instructions comprises a plurality of queues to buffer instructions to a plurality of clusters, a chain affinity unit to store information, and a dispersal unit to route instructions to the plurality of queues based on information to be stored in the chain affinity unit.