1. Field of the Invention
The present invention generally relates to the reduction and control of power consumption in a microprocessor or system comprised of a plurality of clocked components or units.
2. Description of the Related Art
Semiconductor technology and chip manufacturing advances have resulted in a steady increase of on-chip clock frequencies, the number of transistors on a single chip and the die size itself and a corresponding increase in chip supply voltage. Generally, the power consumed by a given clocked unit increases linearly with the frequency of switching within it Thus, not withstanding the decrease of chip supply voltage, chip power consumption has increased as well. Both at the chip and system levels cooling and packaging costs have escalated as a natural result of this increase in chip power. At the low end for systems (e.g., handhelds, portable and mobile systems), where battery life is crucial, net energy reduction is important, without degrading performance to unacceptable levels. Thus, the increase in microprocessor power dissipation has become a major stumbling block for future performance gains.
A scalar processor fetches and issues/executes one instruction at a time. Each such instruction operates on scalar data operands. Each such operand is a single or atomic data value or number. Pipelining within a scalar processor introduces what is known as concurrency, i.e., processing multiple instructions in a given clock cycle, while preserving the single-issue paradigm.
A superscalar processor can fetch, issue and execute multiple instructions in a given machine cycle. In addition, each instruction fetch, issue and execute path is usually pipelined to enable further concurrency. Examples of super scalar processors include the Power/PowerPC processors from IBM Corporation. The Pentium Pro (P6) processor family from Intel Corporation, the Ultrasparc processors from Sun Microsystems and the PA-RISC processors from Hewlett Packard Company (HP), and the Alpha processor family from the erstwhile Compaq Corporation (now merged with HP).
A vector processor typically is pipelined and can perform one operation on an entire array of numbers in a single architectural step or instruction. For example, a single instruction can add each entry of array A to the corresponding entry of array B and store the result in the corresponding entry of array C. Vector instructions are usually supported as an extension of a base scalar instruction set. Only those code sections that can be vectorized within a larger application are executed on the vector engine. The vector engine can be a single, pipelined execution unit; or, it can be organized as an array or single instruction multiple data (SIMD) machine, with multiple, identical execution units concurrently executing the same instruction on different data. For example, typically, Cray supercomputers are vector processors.
A synchronously clocked processor or system has a single, global master clock driving all the units or components comprising the system. Occasionally, ratioed derivatives of the clock may cycle a particular sub-unit faster or slower than the main or master clock frequency. Normally by design, such clocking decisions are predetermined and preset statically. For example, the Intel Pentium 4 processor clocks its integer pipe twice as fast as the chip master clock, ostensibly using what is known in the art as double-pumping or wave-pipelining. Such clock doubling techniques boost processor execution rates and performance. However, bus and off-chip memory speeds have not kept pace with processor computing logic core. So, most state of the art processors have off-chip buses and caches that operate at frequencies that are integral sub-multiples of the main processor clock frequency. Usually, these clock operating frequencies are fixed during system design. This is the reason current generation processor complexes may have multiple clocking rates. Occasionally, double pumping and wave-pipelining are used in higher end machines to alleviate any performance mismatch between the processor and external buses or memories.
Rabaey, Jan M. and Pedram, Massoud, ed., Low Power Design Methodologies, (Kluwer Academic Publishers, 1996) describes power reduction using synchronous clock-gating wherein the clock may be disabled at a point of regeneration, i.e., within a local clock buffer (LCB) feeding a particular chip region, component or latch. At a coarser level of control, clocks are gated along functional boundaries. At a finer level of control, clocks are gated at individual latches. For example, Gerosa et al. “A 2.2 W, 80 MHz, superscalar RISC microprocessor,” IEEE Journal of Solid State Circuits, vol. 29, no. 12, Dec. 1994, pp. 1440–1454, teaches gating clocks to different execution units based on instructions dispatched and executed in each cycle.
Coarse-grain unit-level clock-gating is beneficial in cases when the processor is executing a sequence of a certain functional class of instructions, e.g., integer-only or floating-point-only instructions. When the input workload is such that the processor sees integer code only, the clock regenerator(s) to the floating point unit may be disabled. Similarly, during the floating-point-only operation, clocks to the integer unit can be disabled. This can save a considerable amount of chip power. Coarse idle control is normally effected locally with software through serial instructions or using hardware to detect idle periods. Fine idle control, normally, is effected also locally during instruction decode by avoiding unnecessarily propagating invalid or inconsequential data. A causal flow of gating-control information from its initial point of origin to downstream stages or units referred to as feed-forward flow. Such a flow path may include loops, with apparent backward flow, but the cause-to-effect information flow is still deemed to be a feed-forward process. Thus, both coarse and fine idle control are self triggered, feed forward.
Using downstream pipeline stall signals to regulate feed-forward flow, on the other hand, constitutes a feedback control system. Here, control information flow is from downstream “effect” to upstream “cause.” Coarse and fine grain stall control are used primarily to prevent over-writing of valid, stalled data in the pipelined processor; but such mechanisms can also be used to conserve power consumption. For example, Jacobson et al. “Synchronous interlocked pipelines,” IEEE ASYNC-2002 conference, April 2002, propose a fine-grain stall propagation mechanism for reducing power in synchronous pipelines; this complements the more conventional, fine-grain feed-forward mechanism of clock-gating using “valid” bits, as in Gerosa et al. referred to earlier; see also, Gowan et al., “Power considerations in the design of the Alpha 21264 microprocessor,” Proc. 1998 ACM/IEEE Design Automation Conference, pp. 726–731 (June 1998). Published fine-grain stall gating (feedback) mechanisms, however, as in Jacobson et al. are not used to control information flow rates (via clock or bus bandwidth throttling) as in our invention.
There are at least two problems arising from coarse idle control that must be addressed. First, large transient current drops and gains can cause unacceptable levels of inductive (Ldi/dt) noise in on-chip supply voltage. Second, overhead cycles are required for gating off and on processes to maintain correct functional operation. Switch between gated and enabled modes too frequently for finer grain phase changes in the workload, results in an unacceptable performance hit.
Further, state of the art fine idle control relies on locally generated gating signals or conditions for pipeline stage-level clock-gating, e.g., based on a data-invalid or inconsequential-operand condition. These state of the art approaches do not generate the gating signal on a predictive or anticipatory basis. So, the timing requirements are often critical because the gating signal must be available in advance of assertion and asserted for a suitable duration for error-free clock-gating operation. Gowan, M. K., Biro, L. L. and Jackson, D. B., “Power considerations in the design of the Alpha 21264 microprocessor,” Proc. 1998 ACM/IEEE Design Automation Conference, pp. 726–731, (June 1998) discuss how these constraints can significantly complicate design timing analysis, even resulting in a degraded clock-frequency performance.
Whether the basic control mechanism is feed-forward (cause-to-effect flow) or based on feedback (effect-to-cause flow), state of the art clock-gating techniques, whether coarse or fine, also are, spatial control only. This is because, utilization information is used to eliminate redundant clocking in the affected region(s) without regard to temporal activity or history in the region(s) or elsewhere in the machine. Activity states and events in downstream (consumer) units and stages (e.g. execution pipes or issue queues) are not fed back to adjust upstream (producer) clocking or information flow rates in non-adjacent regions (e.g., instruction fetch or dispatch units). Similarly, activity states and events in upstream producer regions are not fed forward to adjust the downstream consumer clocking or information flow rates. Also, gating off clock signals, typically is all or nothing, where the clock signal is either enabled or not.
Thus, there exists a need for improved clock control for connected pipelined units that can operate at a fine-grain spatial and temporal granularity, without incurring a performance (overhead) penalty and without large current/voltage swings to the underlying circuits.