1. Field
Embodiments of the invention relate to the field of generating clock signals for a digital system. More specifically, the invention relates to methods and apparatuses for generating and distributing a clock signal between components within an integrated circuit.
2. Background
FIG. 10 shows what is called a Mealy machine. The Mealy machine reduces computation to an instructive abstraction. The Mealy machine shows that computation is simply the controlled updating of state (state is simply the data that records the progress of a computation) depending on the value of the current state and some inputs.
The Mealy machine illustrates four elements of computing. Most prominent is the computation cloud. In VLSI systems, computation is performed by logic gates constructed from transistors. Next is the state holding element. Traditionally state holding elements are flip-flops, although they could be latches. The third element is the clock that determines when the state holding element updates. Last is the communication represented by the wire from the output of the state holding element to the computation cloud.
The abstraction might lead one to believe that the state of the computer is located, manipulated and updated at a single physical location. Rather the state holding and computation is distributed across a large plane. Communication is not limited to a single wire, but many wires that branch and merge and form long and short channels. These realities do not disturb the model as long as each of the state holding elements receives its update signal at substantially the same time and all of the computation is completed when it is time to update to the next state. Synchronous computing evolved from this model.
Unfortunately the factors that contribute to the speed of computing have changed since the Mealy machine model was adapted. The detail that seems insignificant by the Mealy machine, communication, has grown in importance while the most emphasized property, computation, has diminished. The Mealy machine was introduced when chips were relatively small and communication costs were negligible. Clock cycles were on the order of 50–100 gate delays and slight perturbations in the clock arrival time resulted in error margins that were a fraction of a percent of the clock cycle time.
Transistor mismatches, fabrication imperfections, unstable supplies, and a host of other phenomenon make it very difficult to copy a signal to a multitude of locations over a large chip clocked in the giga-Hertz range to an accuracy that supports the Mealy model. High performance microprocessors have clocks that switch many billions of times per second. The cycle time is typically on the order of 8–10 gate delays. This high speed clock signal is copied through many millimeters of interconnect and is sometimes amplified by 20+ buffers. The skew between two copies of a signal derived through millimeters of interconnect and 20+ buffers begins to approach an 8–10 gate delay cycle time.
The synchronous paradigm is built upon the assumption that clock and data signals have determinative delays. The clock tree assumes that a signal that is buffered through physically separate yet identically designed paths produces identical signals at the end of those paths. Very little certainty exist in modern transistor processes and each new process has even less certainty than the last. Transistors and interconnect of equivalent dimensions will have different delays. These differences are no longer negligible.
Typically, the clock signal is generated at a single source and is distributed through chains of inverters of equal length to the individual latches. It is important that the clock signal arrives at each data latch at nearly the same time, so that operations that take place in one part of a circuit are properly synchronized with operations in other parts of the circuit.
However, it is impossible to match exactly the delay of all paths from the source of the clock signal to the individual latches. Cross-die processing variations and imprecision in the alignment of the fabrication equipment make this impossible. To complicate matters, die sizes are becoming larger, resulting in greater die variations and longer inverter chains, which result in greater path disparities.
As clock speeds increase, these disparities consume an increasingly larger fraction of the clock period. The disparity in the arrival time of a clock signal between latches is called “skew.” Note that skew causes uncertainty about the time that data is latched. Furthermore, note that calculations cannot be performed during periods when it is not certain that the data is valid. As clock speeds increase, the skew between latches remains approximately constant. Hence, a smaller fraction of the clock period can be used for calculations.
The traditional method for distributing a clock signal is to use an H-tree topology. A square area of the integrated circuit is divided into quadrants and the centers of each quadrant are connected by an ‘H’ interconnect topology. Each of the three segments of the ‘H’ is equal to half the length of the sides of the square integrated circuit. The distance of the path from each prong to the center of the perpendicular segment, or the root, of the ‘H’ is equivalent. The prongs are called leaves in keeping with the tree image.
An area can be divided into 16 regions by superimposing an ‘H’ onto a square integrated circuit and then centering four ‘H's’ half the size of the initial ‘H’ onto the leaves of the first ‘H’. A square integrated circuit can be divided into 4^n regions, for any power of n, by recursively applying this method. A signal applied at the root of the largest ‘H’ is copied to all the leaves at substantially the same time.
Note that although the path from the root to each leaf is equivalent by design, there will be some disparity between all paths due to physical irregularities and fabrication resolutions. Although each path from the root to the leaves contains interconnect of equivalent length, and gates of equivalent size and number, separate paths are only equal to the resolution of the fabrication equipment. The more the paths from root to leaf diverge, the more skew tends to accumulate.
Note that there will be a place in an H-tree system where two adjacent signals will be derived through maximally different routes through the tree. This is typically where the skew is at a maximum.
Clock skew can be compensated for by adding a timing margin to the clock cycle time. However, this added timing margin can become a significant fraction of the clock period, and can hence limit system performance.
One way to deal with this problem is to divide an integrated circuit into multiple clock domains, where each clock domain operates from an independent clock. This relieves some of the difficulty in copying a signal across a large area of silicon to arrive at separate locations at substantially the same time. However, dividing an integrated circuit into multiple independent clock domains creates problems in synchronizing communications or data transfers between the different clock domains.
Another solution is to provide larger buffers and to use less resistive interconnect in the clock distribution circuitry. This solution uses more power and causes stronger electromagnetic fields to be emitted from the clock net which is seen as noise by other signals. Power consumption and signal noise are both limiting factors for processor performance.