1. Field of the Invention
2. Description of the Related Art
Semiconductor technology and chip manufacturing advances have resulted in a steady increase of on-chip clock frequencies, the number of transistors on a single chip and the die size itself accompanied by a corresponding decrease in chip supply voltage. Generally, the power consumed by a given clocked unit (e.g., latch, register, register file, functional unit and etc.) increases linearly with the frequency of switching within the unit. Thus, not withstanding the decrease of chip supply voltage, chip power consumption has increased as well. In current microprocessor designs, over 70% of the power consumed is attributable to the clock alone. Typically, over 90% of this power is consumed in local clock splitters/drivers and latches.
Both at the chip and system levels cooling and packaging costs have escalated as a natural result of this increase in chip power. It is crucial for low end systems (e.g., handhelds, portable and mobile systems) to reduce net energy consumption to extend battery life but, without degrading performance to unacceptable levels. Thus, the increase in microprocessor power dissipation has become a major stumbling block for future performance gains.
Accordingly, clock gating techniques that selectively stop functional unit clocks have become the primary approach to reducing clock power. Typically, clock gating is applied in an ad hoc fashion, which makes verification and clock skew management difficult. This is not expected to abate with ever larger and more complex designs unless a clearly defined and structured clock gating approach is developed.
A typical state of the art synchronous pipeline includes multiple stages, at least some of which may be separated by logic, each stage including an N latch register, at least one latch for each data bit propagating down the pipeline and, all of the stages synchronously clocked by a single global clock. A simple example of a pipeline is a first-in first-out (FIFO) register. A FIFO is an M stage by N bit register file, typically used as an M-clock cycle delay. Each cycle the FIFO receives an N-bit word from input logic and it passes an M-cycle old, N-bit word to output logic. On each clock cycle (i.e., every other leading or falling clock edge) each N-bit word in the FIFO advances one stage. Typical examples of much more complex synchronous pipelines include state of the art microprocessors or functional units (e.g., an I-unit or an E-unit) within a state of the art microprocessor.
Traditionally, synchronous pipelines have been stalled globally, where all stages of either the entire pipeline, or a multistage unit, are stalled at the same time. However, cycle time and switching current constraints limit the number of stages that can be stalled during the same cycle. A difficulty with progressively stalling synchronous pipelines is that data is lost at stall boundaries. Further, as wire delays increase and become a concern, propagating a stall signal throughout a unit or between units, for example, may cause excessive signal delay, both from long wires and signal buffering requirements. Heretofore, achieving local clock gating based on stall conditions has not been possible because stalled data may be overwritten by data progressing through the pipeline from an earlier stage.
FIG. 1A shows an example of a four portion of a synchronous pipeline 10 (e.g., in the middle of a FIFO or in a microprocessor) at stages 12, 14, 16, 18 holding data items D, C, B, A, respectively. A stall boundary 20 indicates a point in the pipeline 10 where, because of placement and cycle time constraints, the next clock edge arrives at upstream stages before stall signal 22, thus providing insufficient time to disable the clock at those upstream stages. While the stall signal 22 reaches downstream stage 16 and subsequent stages (not shown) with sufficient disable time and correctly halt; because stages 12, 14 and stages upstream of the boundary 20 do not receive the stall signal in time, they therefore latch new data on the clock edge incorrectly, potentially losing data that should be held there. So, in this example stages 16 and 18 are stalled, trapping data items B and A, respectively. Stages 12, 14 however, do not see the stall signal in time and therefore, latch data items E and D in the next clock cycle. Consequently, data item C is overwritten and lost, instead of being trapped in stalled stage 14.
FIG. 1B shows a traditional approach to handling progressive stalls wherein buffer stages 23 (often referred to as staging latches) are inserted in parallel to the pipeline at selected stall boundaries, e.g., 20. During a stall the staging latches 23 temporarily store data that would otherwise be overwritten. Unfortunately, because staging latches 23 add area, power, and delay overhead, stalls have traditionally been performed at a coarse level, i.e., staging latches are only at predicted stall boundaries. However, as noted above for globally propagated stall signals, increased wire delays, increased load on the stall signal from increasing the number of latches to achieve deeper pipelines (more stages) and demand for shorter cycle time combine to restrict how far the stall signal can propagate before it impacts cycle time. So, providing staging latches at a finer granularity, e.g., for stalling stage by stage, introduces extra buffer stages to double the number of latches in a pipeline. Clearly, the added staging latch area and power as well as increased chip complexity renders this solution impractical at other than a very coarse granularity.
Thus, there exists a need for fine grained pipeline stage level clock gating for synchronous pipelines and where the decision to or not to gate the clock can be made local to each stage rather than at the global level, while avoiding costly extra buffers.