1. Field of the Invention
The present invention generally relates to the reduction and control of power consumption in a microprocessor or system comprised of a plurality of clocked components or units.
2. Description of the Related Art
Semiconductor technology and chip manufacturing advances have resulted in a steady increase of on-chip clock frequencies, the number of transistors on a single chip, the chip die size itself and, a corresponding decrease in chip supply voltage (Vdd). Generally, the active power consumed by a given clocked unit is primarily from switching chip capacitive loads and increases linearly with the clock frequency. Thus, not withstanding the decrease of chip supply voltage, active chip power consumption has increased as well.
Moreover, independent of operating frequency, chip leakage or standby power increases linearly with the number of chip transistors. Especially for chips and circuits in the insulated gate field effect transistor (FET) technology commonly referred to as CMOS, a substantial portion of chip leakage is subthreshold leakage. Subthreshold leakage is current in a that flows (drain to source) through the FET channel even when the FET gate to source voltage is insufficient to turn on the FET, i.e., below threshold voltage (Vt) of the FET. S. Borkar, “Design Challenges of Technology Scaling,” IEEE Micro, vol. 19, no. 4, July/August 1999, pp. 23–29, describes how subthreshold leakage power is increasing as a percentage of total power dissipation. This percentage increase is occurring because as Vdd is falling, Vt must remain more or less constant. Thus, even if chip active power is reduced to zero, i.e., effectively the chip is shut down, subthreshold leakage continues to consume power.
Consequently, both at the chip and system levels cooling and packaging costs have escalated as a natural result of these chip power increases. For low end systems (e.g., handhelds, portable and mobile systems), where battery life is crucial, reducing net power consumption is important; but, it must come without degrading performance to unacceptable levels. Thus, particularly with state of the art central processing units (CPUs), even with the advances in CPU architecture, whether a scalar, superscalar, vector, or some other type of processor, this increase in microprocessor power dissipation has become a major stumbling block for performance gains.
A scalar processor fetches and issues/executes one instruction at a time. Each such instruction operates on scalar data operands. Each such operand is a single or atomic data value or number. Pipelining within a scalar processor introduces what is known as concurrency, i.e., processing multiple instructions in a given clock cycle, while preserving the single-issue paradigm.
A superscalar processor can fetch, issue and execute multiple instructions in a given machine cycle. In addition, each instruction fetch, issue and execute path is usually pipelined to enable further concurrency. Examples of super scalar processors include the Power/PowerPC processors from IBM Corporation, the Pentium Pro (P6) processor family from Intel Corporation, the Ultrasparc processors from Sun Microsystems and the PA-RISC and Alpha processor families from Hewlett Packard (HIP) Company.
A vector processor typically is pipelined and can perform one operation on an entire array of numbers in a single architectural step or instruction. For example, a single instruction can add each entry of array A to the corresponding entry of array B and store the result in the corresponding entry of array C. Vector instructions are usually supported as an extension of a base scalar instruction set. Only those code sections that can be vectorized within a larger application are executed on the vector engine. The vector engine can be a single, pipelined execution unit; or, it can be organized as an array or single instruction multiple data (SIMD) machine, with multiple, identical execution units concurrently executing the same instruction on different data. For example, typically, Cray supercomputers are vector processors.
A synchronously clocked processor or system has a single, global master clock driving all the units or components comprising the system. Occasionally, by providing ratioed derivatives of the clock may cycle, e.g., clock doubling, a particular sub-unit faster or slower than the main or master clock frequency. Normally by design, such clocking decisions are predetermined and preset statically. For example, the Intel Pentium 4 processor clocks its integer pipe twice as fast as the chip master clock, ostensibly using what is known in the art as double-pumping or wave-pipelining. Such clock doubling techniques boost processor execution rates and performance. However, bus and off-chip memory speeds have not kept pace with processor computing logic core. So, most state of the art processors have off-chip buses and caches that operate at frequencies that are integral sub-multiples of the main processor clock frequency.
Usually, these clock operating frequencies are fixed during system design. This is the reason current generation processor complexes may have multiple clocking rates. Occasionally, double pumping and wave-pipelining are used in higher end machines to alleviate any performance mismatch between the processor and external buses or memories.
Typically, clock gating is used to reduce active power. A. Chandrakasan and R. Brodersen, ed., “Low-Power CMOS Design,” IEEE Press, 1998, describes power reduction using synchronous clock-gating wherein the clock may be disabled at a point of regeneration, i.e., within a local clock buffer (LCB) feeding a particular chip region, component or latch. At a coarser level of control, clocks are gated along functional boundaries. At a finer level of control, clocks are gated at individual latches. For example, H. Sanchez, “Thermal management system for high performance PowerPC microprocessors,” Digest of Technical Papers, IEEE COMPCON, 1997, teaches gating clocks to different execution units based on instructions dispatched and executed in each cycle.
Coarse idle control can be synthesized during code generation by the compiler inserting special instructions, included in the instruction set architecture; alternately, these instructions can be issued dynamically by the operating system, e.g., when servicing a special interrupt or at certain context-switch times. At the coarsest control level, a special sleep-type instruction or command can be issued; this special sleep command can generate a disable signal that stops the clock to a selected portion of the chip for a period of time. This same special sleep command can be used to disable the instruction fetch process. Likewise an implicit wake up begins when the disable signal is negated or after the sleep period; or, the wake up can be accomplished with an explicit, asynchronous interrupt. As is well known in the art, various power-down modes can be provided (e.g. nap, doze or sleep) with the clock distribution tree selectively disabled at various levels of the LCB hierarchy. At the next finer level of granularity, the compiler can insert special instructions to start gating off the clock(s) to a given unit, e.g. the floating point unit, whenever the compiler can statically predict the computation phases.
A hardware idle self-detect mechanism may be included. The idle self-detect logic can be designed to detect localized processor idle periods. Upon detection the local unit triggers clock-disabling and/or local supply voltage reduction (Vdd and/or ground) for some or all of the idling unit region(s). Each unit disables its own clock and/or local supply voltages for a period of time. A wake-up is similarly self-initiated, based on new work received by the disabled or sleeping unit.
For finer idle control, dynamically defined signals gate local clocks (but, previously not supply voltages) cycle-by-cycle. For a typical superscalar machine for example, the processor determines during instruction decode which functional unit pipes could be clock-gated during the subsequent execute cycles. This works well in a processor with “in-order” issue mechanisms, so that the gating decision can be made unambiguously and sufficiently ahead of time, i.e., at decode or dispatch time. If the instruction class information is preserved in a centralized issue queue on an entry-by-entry basis, then such gating signals can also be generated at issue time even for an out-of-order issue queue.
In any pipelined data path, redundant clocking can be detected dynamically and selectively prevented, e.g., propagating a Data Valid flag or bit along the logic pipeline; this Data Valid flag is set only when the data generated on a cycle is valid. Then, the Data Valid flag for each logic stage can be used as a clock enable for setting the stage's output latches. Thus, invalid data is not unnecessarily clocked through the succeeding pipeline stages in what may be referred to as fine-grain, valid-bit based, pipeline stage-level clock gating.
U.S. Pat. No. 6,247,134 B1 to Sproch et al., entitled “Method and System for Pipe Stage Gating Within an Operating Pipelined Circuit for Power Savings” Jun. 12, 2001 teaches a processor with logic to identify as inconsequential any newly received operand that would not change in the pipeline in a prior cycle's computation by the first stage of logic. Detection of such an invariance condition signal as inconsequential can be used to disable the clock to the first stage and, then, successively to following stages.
Ohnishi, M., Yamada, A., Noda, H. and Kambe, T. “A Method of Redundant Clocking Detection and Power Reduction at the Rt Level Design,” Proc. Int'l. Symp. On Low Power Electronics and Design (ISLPED), 1997, pp. 131–136, discuss other, more elaborate idle detection mechanism to prevent various kinds of redundant latch clocking.
Coarse-grain unit-level clock-gating is beneficial in cases when the processor is executing a sequence of a certain functional class of instructions, e.g., integer-only or floating-point-only instructions. When the input workload is such that the processor sees integer code only, the clock regenerator(s) to the floating point unit may be disabled. Similarly, during the floating-point-only operation, clocks to the integer unit can be disabled. Coarse idle control is normally effected locally with software through serial instructions or using hardware to detect idle periods. Fine idle control, normally, is effected also locally during instruction decode by avoiding unnecessarily propagating invalid or inconsequential data.
There are at least two problems arising from coarse idle control that must be addressed. These are especially a concern when supply voltage gating is employed. First, large transient current drops and gains can cause unacceptable levels of inductive (LdI/dt) noise in on-chip supply voltage. Second, overhead cycles are required for gating off and on processes to maintain correct functional operation. Switching between gated and enabled modes too frequently for finer grain phase changes in the workload results in an unacceptable performance hit.
Further, state of the art fine idle control relies on locally generated gating signals or conditions for pipeline stage-level clock-gating, e.g., based on a data-invalid or inconsequential-operand condition. These state of the art approaches do not generate the gating signal on a predictive or anticipatory basis. So, the timing requirements are often critical because the gating signal must be available in advance of assertion and asserted for a suitable duration for error-free clock-gating operation. Gowan, M. K., Biro, L. L. and Jackson, D. B., “Power considerations in the design of the Alpha 21264 microprocessor,” Proc. 1998 ACM/IEEE Design Automation Conference, pp. 726–731, (June 1998) discuss how these constraints can significantly complicate design timing analysis, even resulting in a degraded clock-frequency performance. While in spite of these problems, clock-gating may reduce average active (or “switching”) power in a processor, it still does not reduce static or standby power.
Instead, supply voltage gating (also called power or Vdd gating) may be used for reducing static or leakage power. Even when a FET or CMOS circuit block inactive (off), current leakage from Vdd to ground still occurs as subthreshold leakage. So, these CMOS circuits consume power even with clocks disabled or held constant, i.e., high or low. As noted above, this subthreshold leakage component of total power is rising due to technology scaling effects, reducing the gap between Vdd and the device threshold voltage, Vt. Supply voltage gating gates Vdd or ground (GND) to the FET/circuit, eliminating the current flow path. So, an additional “header” or “footer” FET or device is in the path circuit current flow path from Vdd to ground. The header/footer device is on during normal activity, i.e., when the circuit is gated active. When gated off or idle, the header/footer device is turned off to electrically isolate the complementary FET pair from either supply rail, i.e., the Vdd rail or the GND rail. L. Wei, K. Roy, V. De, “Low Voltage Low Power CMOS Design Techniques for Deep Submicron IC's,” Proc. of IEEE In'i. Conf. On VLSI Design, January 2000, pp. 24–29, describes a straightforward application of Vdd gating.
However, application of the Roy method may result in a large performance degradation with unreliable circuit operation from large (uncontrolled) surges in the power supply lines and, potentially, an increase in total average power. There are two reason for the performance degradation. First, adding the supply gating control circuitry increases the basic delay of the gated circuit block because it adds impedance in the circuit's supply path. If the performance overhead is reduced by allowing sharp turn-ons and turn-offs, then LdI/dt noise may be considerable. In any case, supply voltage gating adds some performance overhead, which is the second source of performance degradation. Gating circuits on and off with a graceful ramp-up/down to minimize LdI/dt can incur a delay up to as much as several hundred processor cycles in resuming/stopping normal operation. As a result, total average power consumption can actually increase from being gated on and off too frequently. In particular, even if the average utilization is low, average power consumption can increase if the added switching or active power of the gating control device and related circuitry is more than the leakage power saved.
Current gating methods and especially, supply gating methods, whether coarse—or fine-grain, are termed non-predictive. The typical voltage gating signal is generated locally based on events and logical conditions that are tracked within a temporal window of a few cycles.
Thus, there exists a need for gated power supply designs that are able to hide the currently large performance overheads, especially is in processor designs, without impaired circuit reliability such as from increased inductive noise on the supply voltage rails.