Clock skew adjustment and verification is an important part of digital circuit and more specifically microprocessor design. A clock signal provides the timing reference for all data exchanges inside an integrated circuit (IC) or “chip.” This clock signal is provided from a single clock signal generator, which can be either off-chip or on-chip, and is distributed over the entire chip to every circuit element that requires a timing reference, for example, a flip-flop among others. The time required for the clock signal to propagate to a particular clocked element is known as a clock insertion delay corresponding to that clocked element. The difference between the insertion delays of two elements capable of exchanging data is known as the clock skew for these two elements. Depending on the circumstances and relative to the two elements exchanging data, clock skew may either make the clock signal too early or too late. Clock skew is classified as being one of two types known as maxtime and mintime skew. Excessive clock skew can decrease the performance and increase the size and power consumption of an IC.
Turning first to FIG. 1, a block diagram and a timing diagram exemplifying maxtime type clock skew is shown. The block diagram shows a first flip-flop (FFA) 10, a second flip-flop (FFB) 12, and a logic device 14 connected as shown. The Clock signal line is shown to have a skew 16 which makes the signal to the FFB 12 early and the signal to the FFA 10 late, relatively speaking. The timing diagram shows a propagation delay time of the FFA 10, TpdFFA, a logic delay time of the logic device 14, Tpdlogic, a setup time of the FFB 12, TsetupFFB, and a time of the skew 16, Tskew. A combination of these times determines the usable cycle time, Tusable_cycle, from a cycle time, Tcycle, according to the following equation.Tusable_cycle=Tcycle−Tskew≧TpdFFA+TdpLogic+TsetupFFB  (1)The value of the maxtime skew Tskew determines the usable cycle time. The greater the clock skew the smaller the usable cycle time. Therefore, it is essential for the performance of the microprocessor to analyze the clock skew for all possible paths in the circuit and to adjust the skew to achieve maximum performance.
Turning now to FIG. 2, a block diagram and a timing diagram exemplifying mintime type clock skew is shown. The block diagram shows FFA 10 connected to FFB 12. This time, the Clock signal skew 16 makes the signal to the FFA 10 early and the signal to the FFB 12 late, relatively speaking. The timing diagram shows a propagation delay time of the FFA 10, TpdFFA, a hold time of the FFB 12, TholdFFB, and a time of the skew 16, Tskew.TpdFFA≧TholdFFB+Tskew  (2)If the natural propagation delay of the FFA 10 is insufficient to achieve the necessary hold time, then additional circuitry must be added between the FFA 10 and the FFB 12 to increase the total propagation delay. This results in more die area and power being consumed. Further, the additional circuitry will have to be added before the circuit is fabricated in order to prevent potential functional failures. This increases production costs and design times.
In both of the clock skew cases described above, an accurate analysis of the clock insertion delay for substantially every single clocked element is valuable to achieving high performance in a microprocessor design. The cost to analyze the insertion delay for a given path increases in general more than linearly with the size of the problem. Analyzing the insertion delay of a large path is generally much more computationally expensive than dividing the large path into several smaller paths and analyzing each of these smaller paths separately. The sum of all of the computational costs for each of the smaller tasks is typically only a fraction of the cost for the entire problem processed as a single task. In addition, several of the smaller paths can potentially be processed in parallel, so that the total runtime cost can be reduced even further. The analysis of all of the insertion delays in a microprocessor design is typically an extremely large computational task, which exceeds any available computational resources as a single analysis task. It can better be solved by dividing this task into a large number of independent smaller tasks.
In conventional microprocessor design, a clock distribution network is tree-based, grid-based, or a hybrid of both. The tree-based clock net has a network of branches from a synthesized clock source to each clocked element. So the one and only one path can be traced directly to the clocked element. Each path can be analyzed separately thus making the calculation of the insertion delay relatively simple and accurate. Of course for a large number of clocked elements, these calculations will still be time consuming but the exceptionally high computational cost of simulating all of the paths simultaneously is avoided.
The grid-based clock net has a wire grid spanning over the entire chip, for example, at distribution level two or L2. At higher distribution levels, that is, for example, levels three through ten or L3-L10, the clock net has a pre-grid distribution net that resembles a tree. At L2, the clock drivers are shorted together by the grid to equalize arrival times. The result is that there is not one and only one path that can be traced directly to the clocked element. Furthermore, the clock arrival time at every clocked element is influenced by the load created by other clocked elements in the neighborhood. Therefore, it is not generally possible to analyze each clocked element separately. Instead the entire grid or at least a large cluster of the grid should be analyzed together to reflect the interaction of the clocked elements on the arrival time of the clock signal on the grid. Since conventionally the computation task cannot be separated easily into sub-tasks as with the tree-based clock net above, analyzing the clock insertion delay in a grid-based design is much more difficult than in a tree-based design and requires potentially a much higher computational cost.
Turning now to FIG. 3, a block diagram of a grid-based clock distribution system 18 is shown. The system 18 includes a phase-locked loop (PLL) 20 and a grid-based clock net 22 having levels ten through one. Levels ten through three form a pre-grid clock net or a global net and levels two and one form a local net. Only nine rows and one level one are shown for simplicity purposes. The exact number will depend on the circumstances. A source clock signal from a source clock (not shown) is fed to the PLL 20 which produces a synthesized clock signal which is fed down through the grid-based clock net 22 from level ten to level one to the clocked elements (not shown).
Turning now to FIG. 4, a schematic diagram of the grid-based clock net 22 of FIG. 3 is shown. The column made up of levels ten through six is shown above and one example row of levels five through two is shown below. Each level includes a plurality of buffers 24. The number and layout of the columns, rows, levels, and buffers will depend on the particular application. In this diagram one can see how, to an extent, the pre-grid distribution net resembles a tree.
Turning now to FIG. 5, a layout diagram of the grid-based clock distribution system 18 of FIG. 3 is shown. The system 18 is shown in a substantially idealized form. This form is rarely if ever achieved in a practical application. The non-ideal form introduces random and systematic skew components. As a result, one must verify the skew based on the actual layout. In this diagram one can see the wire grid spanning over the entire chip.
In both of the clock skew cases described above with respect to FIGS. 1 and 2, it is valuable to analyze the clock insertion delay for each element to predict the clock skew for a given data transfer path and, if necessary, improve performance by adjusting the insertion delays of the involved elements. In addition to the obvious conductor lengths, the clock insertion delay depends in part on parasitic effects such as coupling capacitances to other metal lines in the vicinity of the clock line. Therefore, the clock skew analysis has to be done after the entire microprocessor has been designed and all of the structures are present in a manufacturable form. Because all of the structures in the vicinity of the clock distribution network that might show parasitic interaction with the clock net have to be included, the clock skew analysis is typically very costly in terms of time and computational resources. Furthermore, the clock skew analysis requires circuit simulation tools with a high degree of accuracy. Any uncertainty in the clock insertion delay results caused by the limited accuracy of the simulation tools has to be accounted for as “unknown additional clock skew,” thereby limiting the analysis and the resulting system performance. Similarly, the demand for high accuracy increases the cost in terms of time and computational resources. For a standard microprocessor design, that is, one having more than ten million transistors, there comes a point when simulating the complete clock distribution net at one time with high accuracy tools becomes unmanageable with conventional means. The simulation time would be unacceptable and the tools are typically not capable of dealing with such large quantities of data with high accuracy.