Computer technology continues to advance at a remarkable pace, with numerous improvements being made to the performance of both the design and architecture of processors--the "brains" of a computer--as well as the integrated circuit, or "chip", fabrication techniques used to manufacture such processors.
At the most basic level, a processor includes thousands or millions of transistors coupled together by conductive "wires". The transistors operate as logical switches that are arranged together in a particular manner to perform specific functions. To enable a processor to perform a particular task, the processor must be supplied with a sequence of instructions that form a computer program, which the processor then executes to perform the desired task. To execute an instruction, the processor typically "fetches" the instruction from an external memory, decodes the instruction to determine what function is to be performed by the instruction, and then performs that function to manipulate data stored in the computer. The actual performance of an function is typically performed by an execution unit in the processor, which oftentimes is also referred to as a "pipeline" since instructions typically proceed through a sequence of stages as they are processed.
Processor performance can be affected by design and architecture improvements principally by increasing the number of instructions that are processed each "cycle" of the processor--that is, the smallest time frame between successive processor operations. For example, one manner of increasing the number of instructions processed each cycle involves dispatching instructions to multiple execution units that operate concurrently with one another, so that a group of execution units are capable of processing multiple instructions in parallel.
Processor performance can also be affected by integrated circuit fabrication techniques, e.g., by decreasing the minimum size of the transistors used in a given processor. Decreasing the size of the transistors permits a larger number of transistors to be used in a given processor design, which increases the number of operations that a processor can perform during any given cycle. For example, additional transistors can be used to implement additional execution units so that greater parallelism may be achieved in a processor.
Decreasing the size of the transistors can also shorten the lengths of the conductive wires connecting the transistors together. Conductive wires have an inherent delay that limits the speed of a processor, in effect setting a minimum duration for each processor cycle. Should the delay in a particular conductive wire exceed the duration of a processor cycle, it is possible that a value transmitted from one end of the wire would not reach the other end of the wire during the same cycle, possibly resulting in an error that could cause unpredictable results or "crash" the processor. The delay of a wire is proportional to its length, and thus, if all conductive wire can be shortened, often a shorter duration processor cycle can be used, thereby increasing overall processor performance.
While both transistor size and wire delays are both decreasing as fabrication technologies improve, wire delays are not scaling down proportionately with transistor size. In particular, as a larger number of execution units are used in a design due to reduced transistor sizes, it becomes necessary to communicate information between different execution units using conductive wires coupled between the execution units, a process known as result forwarding. Moreover, the relative lengths, and thus the inherent delays, of these conductive wires typically increase due to the need to route the wires around a larger number of execution units.
For example, some instructions, referred to as "consumer" instructions, must be provided with the results of one or more earlier instructions, referred to as "producer" instructions, before the consumer instructions can be processed. For this reason, consumer instructions are said to be "dependent" upon producer instructions.
If a consumer instruction and a producer instruction are being processed by separate execution units, the relative delay associated with forwarding the result of the producer instruction from the producer instruction's execution unit to that of the consumer instruction can become a performance limiting factor. For this reason, a number of processing designs attempt to minimize the delays associated with result forwarding by grouping related execution units together into "microclusters" to attempt to minimize the delays associated with result forwarding. A microcluster is typically designed to ensure that the delays associated with the conductive wires that forward results between execution units within the microcluster will not exceed the duration of the processor cycle. Thus, it is ensured that all result forwarding occurring within a particular microcluster will meet the processor cycle requirements.
By grouping related execution units together into microclusters, it is often possible to ensure that a significant portion of the result forwarding occurring within a processor will occur solely within the microclusters. However, with any design, at least a portion of the results may need to be forwarded between execution units in different microclusters--often using conductive wires having inherent delays that exceed the duration of the processor cycle.
One conventional attempt to address this problem is to add one or more latches along the conductive wires between microclusters to in effect break each conductive wire up into a sequence of shorter wires having individual delays that meet the processor cycle requirements. However, adding such latches introduces a one or more processor cycle penalty, which can delay the processing of dependent consumer instructions and thus erode the gains achieved by using a larger number of execution units and subsequently decrease the performance of the processor.
Another conventional attempt to address this problem is to dispatch dependent instructions to the same execution unit or microcluster to attempt to minimize result forwarding. Different techniques, such as heuristics, may be used to determine where particular instructions should be dispatched. However, often such techniques are limited in scope, e.g., only a few instructions immediately following a particular instruction being analyzed, so the predictions made may not be particularly reliable.
Also, it has been found that many producer instructions have more than one consumer. If all consumers for a given producer instruction were dispatched to the same execution unit or microcluster, however, often the efficiency of the processor would be adversely impacted since often one execution unit or microcluster would be busy while other execution units or microclusters would be idle. The performance of a multi-execution unit processor, however, is maximized whenever all available execution units are busy processing instructions. Thus, simply dispatching all dependent instructions for a given producer instruction to the same execution unit or microcluster would often result in sub-par processor performance.
The alternative to dispatching multiple consumer instructions to the same execution unit or microcluster is to distribute the instructions to different execution units or microclusters to better balance the workload of the processor. However, doing so only increases the frequency of inter-execution unit and inter-microcluster result forwarding, again decreasing processor performance due to the additional cycles needed to forward results between execution units and microclusters.
Therefore a significant need exists in the art for a manner of maximizing the performance of a multi-execution unit processor by reducing the delays associated with result forwarding for dependent instructions, while balancing the workloads of the various execution units in the processor.