1. Field of the Invention
This invention relates to a self-timed and self-enabled distributed clock. More particularly, this invention relates to a self-timed and self-enabled distributed clock for a pipeline-processing unit of a microprocessor.
2. Brief Description of the Related Technology
In general, microprocessors (processors) achieve high performance by executing multiple instructions per clock cycle and by choosing the shortest possible clock cycle. The term "clock cycle" refers to an interval of time accorded to various stages of processing pipeline within the microprocessor. Storage devices (e.g. registers and arrays) capture their values according to a rising or falling edge of a clock signal defining the clock cycle. The storage devices store the values until a subsequent rising or falling edge of the clock signal, respectively. The phrase "instruction processing pipeline" is used herein to refer to the logic circuits employed to process instructions in a pipeline fashion. Although the pipeline may include any number of stages, where each stage processes at least a portion of an instruction, instruction processing generally includes the steps of: decoding the instruction, fetching data operands, executing the instruction and storing the execution results in the destination identified by the instruction.
One problem inherit with a pipeline processing unit is stalling. If a subsequent pipeline stage generates a stall condition, then all previous pipeline stages must stall and retain all data until the stall condition is resolved and removed. The stall condition is derived from the unavailability of resources normally associated with a speed-path circuit. Furthermore, the stall signal, which indicates the stall condition, must traverse all previous pipeline stages to have the previous stages conditionally stall and refresh their data. The physical routing of the stall signal and associated "stall" logic increases the clock cycle time and decreases the clock frequency for the entire pipeline.
Because processor clock frequencies are expected to reach the gigahertz range by the end of the century, clock skew and jitter may account for up to 15% of the clock cycle period, leaving 85% of the period for computation logic. Clock skew is the difference in arrival times of clock edges at different parts of the circuit. Clock skew bounds the minimum and maximum delay through computation logic. Interconnection delays do not scale linearly with increasing clock frequencies, as clock skew takes an increasingly larger portion of useful clock cycles. The clock signal is also a major contributor to power consumption and noise, especially when there is no other activity. The clock can account for up to 25% of the total power consumption.
The harmonic noise over the spectrum of frequencies is related to the clock frequency of a synchronous clock. Electro-Magnetic Interference (EMI) is becoming increasingly important in the portable communications market place both from a technical perspective, with interference in analog, memory, and RF components, and from a regulations perspective, with increasingly rigorous EMI compliance specifications and testing. The logic gates in an integrated circuit, or chip, are forced to switch states simply because they are driven by a synchronous clock and not because they are performing any useful operation. In power-down mode at high clock frequency, special logic is used to gradually decrease or increase the clock frequency to the lowest level or the highest level, respectively. This is done because instantaneously turning on and off the high frequency clock causes a power surge which would cause failure in CMOS devices. Time is wasted in ramping the frequency down and up. These problems are mostly solved with asynchronous processing design.
Asynchronous processing design, however, inherently has another associated set of problems, mostly related to verification, testing, and marketing. An advantage of a synchronous clock design is that all components start together and generate output in a predetermined and predictable fashion. It is much easier to verify a synchronous design. For an asynchronous design, if each component is working at its own pace, the verification process is very difficult. The outputs of the processor are not deterministic due to actual silicon process variations. Additionally, since the gate delay varies based on the process technology, it is difficult to verify and test an output. A glitch in an asynchronous design can cause the output to be incorrect in comparison to a synchronous design where the state of a signal matters only at a next clock edge.
One prior art approach to asynchronous processing design is design techniques used on Advanced RISC Machines (ARM) processors at the University of Manchester, United Kingdom. The asynchronous ARM design technique uses request-and-acknowledge handshake protocol for synchronization between processing blocks. This technique requires several logic gate delays between the blocks for this handshake protocol. This ARM technique arguably does not show an improvement in performance over synchronous designs, but shows an advantage over synchronous designs in reducing power dissipation.
Traditionally, a two-phase synchronous clock design has been used in processor design. In two-phase clock design, one phase is normally used for precharging while the other is used for processing, which is a very convenient method for designing with a dynamic circuit. However, in two-phase clock design, the penalty due to the clocks is worse at high clock frequencies because the clock skew and jitter occur during two phases in a cycle and the two phases must be non-overlapping.
A single edged clock design is better suited for a higher performance and higher speed processor. If the second edge is needed within a functional block, it is generated locally. It is more efficient for the functional block to generate its own phase for the required logical function. Multiple clock edges can be created for blocks with many cascaded dynamic (precharge-discharge) circuits. However, the design of multiple clock edges within a clock cycle can become complex and require redesigning the clock edges whenever the logic is modified.
Therefore, the need exists for an asynchronous clock design having the heretofore typically mutually exclusive advantages of low power dissipation, and an easily verifiable output where all components start together and generate output in a predetermined and predictable fashion.