1. Field of the Invention
This invention relates to a self-timed and self-enabled clock for functional unit with variable execution time. More particularly, this invention relates to a self-timed and self-enabled distributed clock for a pipeline-processing unit.
2. Brief Description of the Related Technology
In general, microprocessors (processors) achieve high performance by executing multiple instructions per clock cycle and by choosing the shortest possible clock cycle. The term "clock cycle" refers to an interval of time accorded to various stages of processing pipeline within the microprocessor. Storage devices (e.g. registers and arrays) capture their values according to a rising or falling edge of a clock signal defining the clock cycle. The storage devices store the values until a subsequent rising or falling edge of the clock signal, respectively. The phrase "instruction processing pipeline" is used herein to refer to the logic circuits employed to process instructions in a pipeline fashion. Although the pipeline may include any number of stages, where each stage processes at least a portion of an instruction, instruction processing generally includes the steps of: decoding the instruction, fetching data operands, executing the instruction and storing the execution results in the destination identified by the instruction.
Because processor clock frequencies are expected to reach the gigahertz range by the end of the century, clock skew and jitter may account for up to 15% of the clock cycle period, leaving 85% of the period for computation logic. Clock skew is the difference in arrival times of clock edges at different parts of the circuit. Clock skew bounds the minimum and maximum delay through computation logic. Interconnection delays do not scale linearly with increasing clock frequencies, as clock skew takes an increasingly larger portion of useful clock cycles. The clock signal is also a major contributor to power consumption and noise, especially when there is no other activity. The clock can account for up to 25% of the total power consumption. Furthermore, all functions are forced to operate at the same worst case frequency. For example, most operations of an arithmetic-logical unit (ALU) require only 30% of the clock cycle to produce the results. The clock frequency of the ALU is set by the rare worst case operation. These problems are mostly solved with asynchronous processing design.
Asynchronous processing design, however, inherently has another associated set of problems, mostly related to verification, testing, availability of computer-aided-design tools, design methodology, and marketing. An advantage of a synchronous clock design is that all components start together and generate output in a predetermined and predictable fashion. It is much easier to verify a synchronous design. For an asynchronous design, if each component is working at its own pace, the verification process is very difficult. The outputs of the processor are not deterministic due to actual silicon process variations. Additionally, since the gate delay varies based on the process technology, it is difficult to verify and test an output. A glitch in an asynchronous design can cause the output to be incorrect in comparison to a synchronous design where the state of a signal matters only at a next clock edge.
One prior art approach to asynchronous processing design is design techniques used on Advanced RISC Machines (ARM) processors at the University of Manchester, United Kingdom. This asynchronous design technique uses request-and-acknowledge handshake protocol for synchronization between processing blocks. This technique requires several logic gate delays between the blocks for this handshake protocol. This ARM technique arguably does not show an improvement in performance over synchronous designs, but shows an advantage over synchronous designs in reducing power dissipation. The functional unit completes the operation in actual time, thus there is a potential for increasing in performance.
Therefore, the need exists for an asynchronous clock design having the heretofore typically mutually exclusive advantages of low power dissipation, allowing functional unit to operate close to the optimal timing, and an easily verifiable output where all components start together and generate output in a predetermined and predictable fashion.