Prior art microprocessor designs employ a variety of architectural features in order to increase instruction processing speed. For instance, superscalar microprocessors employ parallel execution units and are therefore capable of processing multiple instructions within a single clock cycle. Pipelined microprocessors divide the processing of an operation into separate pipe stages and overlap the pipestage processing of subsequent instructions to achieve execution throughput of at least one instruction per cycle.
Another prior mechanism by which the speed of a microprocessor can be increased is out of order execution of instructions. Non-data dependent instructions are dispatched to parallel execution units without regard to the order in which the instructions appear in the user code, thus allowing instructions to be executed earlier. One example of such a processor is the PentiumAE Pro processor from Intel Corporation of Santa Clara, Calif., the corporate assignee of the present invention. Out of order execution is described in detail in the Pentium Pro Family Developer's Manual, volumes 1-3 (1996), available from Intel Corporation of Santa Clara, Calif.
An example computer system including an out of order execution microprocessor is illustrated in FIG. 1. Instructions and data enter and exit the microprocessor 120 on the system bus 110. Moreover, data is transferred to and from the second level (L2) cache via the system bus 110. The system bus 110 may also be coupled to one or more microprocessors, memory devices, or peripherals.
The bus interface unit 121 of the microprocessor 120 communicates with the system bus 110 and the L2 cache 130. The L2 cache 130 may reside outside of the microprocessor 120, as shown in FIG. 1, or may reside on the same piece of silicon as the microprocessor 120. The first level (L1) instruction cache 122 and data cache 123 are also coupled to the bus interface unit 121 in order to communicate with devices residing on the system bus 110.
Instructions are fetched by the bus interface unit 121 in the form of an instruction stream. The instruction stream is typically a block of data from a memory device residing on the system bus 110. The fetch and decode unit 124 accepts the instruction stream from the bus interface unit 121 and decodes the instructions into a series of micro-operations. These micro-operations are represented as instruction pool 127.
The dispatch and execute unit 125 accepts instructions 127 and resolves data dependencies between them. The instructions are then scheduled to be executed by one or more execution units. Because the microprocessor is an out of order execution engine, the results from the execution units are merely speculative at this point. The retire unit 126 accepts the speculative results of the execution and determines which of the speculative results can "retire," or be committed to the microprocessor architectural state.
The out of order execution microprocessor 120 illustrated in FIG. 1 is capable of executing instructions quickly because several instructions 127 may be simultaneously executed by the dispatch and execute unit 125. FIG. 2 illustrates the arrangement of the execution units within the dispatch and execute unit 125.
The micro-operations from the instruction pool 127 are delivered to ports 220-224, each coupled to a plurality of execution units 231-239. Each port 220-224 represents a bus that is coupled to one or more execution unit. Because some ports, for example ports 0 and 1, are coupled to more than one execution unit, there must be a bus arbitration scheme to prevent bus contention while micro-operations are being scheduled to the execution units. One micro-operation may be scheduled to each port during each clock cycle. Thus, for the example shown in FIG. 2, a maximum of five micro-operations may be dispatched in parallel.
The micro-operations are dispatched to the execution units 231-239 when there are no more data dependencies. The micro-operation is no longer data dependent once the required data fields of the micro-operation are ready for execution. Each micro-operation includes fields for two data operands, called "sources," and a "destination" in which to store the result. Both source fields must be ready with data before the micro-operation is dispatched to the execution unit. Dispatch of micro-operations to the execution units will be discussed in more detail below.
For the example shown in FIG. 2, port 0 is coupled to an execution unit 231, a floating point execution unit 232 and a multimedia execution unit 233. Execution unit 231 is capable of executing a variety of integer micro-operations, and floating point execution unit 232 is capable of executing floating point micro-operations. Multimedia execution unit 233 is capable of executing micro-operations decoded from multimedia instructions. One example of a multimedia instruction format is the single instruction multiple data (SIMD) format. Thus, three different types of micro-operations (integer, floating point, and multimedia) may be scheduled to port 0. Port 1 is coupled to two execution units 234 and 235, and execution unit 235 is a multimedia execution unit. Ports 2, 3, and 4 are each coupled to only one execution unit. One or more of the execution units coupled to ports 2, 3, and 4 may be an address generation unit (AGU), for executing micro-operations that are used to compute addresses.
As mentioned previously, five micro-operations may be dispatched to the execution units at one time, provided the required resources, i.e. the data operands and the specific execution unit, are available. Conceivably, therefore, at times five execution units may operate simultaneously.
One disadvantage of operating multiple execution units simultaneously is that the power consumed by the microprocessor correspondingly increases. The microprocessor power fluctuates at various times, depending on the number of execution units that are operating at one time. The "peak power" is defined as the amount of power that the microprocessor consumes at its peak, which may occur when all of the execution units are operating at one time.
Some execution units consume more power than others. For example, a Pentium Pro Processor was tested at an operating voltage of 2.5 volts and a frequency of 200 megahertz (MHz). The integer execution unit consumed approximately 550 milliwatts (mW) of power. The address generation unit consumed approximately 660 mW of power. The floating point unit consumed approximately 1.37 watts of power; 2-3 times more than the other two types of execution units. Thus it can be appreciated that the peak power of the microprocessor will fluctuate, depending on the types of execution units that are concurrently executing.
Peak power is an important parameter to take into consideration when designing a microprocessor. Many modern computers such as laptops, subcompacts and handhelds rely on battery power. Therefore, to increase battery lifetimes, it is essential to reduce the amount of power consumed by a microprocessor. Moreover, microprocessor packaging and cooling devices, such as heat sinks, must be designed with consideration to the peak power of the microprocessor. As the peak power of a microprocessor increases, the design of the microprocessor package and the cooling devices correspondingly becomes complex.
It is therefore desirable to provide a method of dispatching instructions to multiple execution units such that microprocessor peak power is reduced. Moreover, it is further desirable to provide a method of dispatching instructions such that one high-power execution unit is powered down while another is executing.