This relates to the design of integrated circuit (IC) chips and more particularly to the design of IC chips having on the order of 10,000,000 gates.
Multi-million gate chip designs require hierarchy to be achievable in manageable and predictable schedules. Ensuring that timing will converge to a chosen goal requires early timing budgets, abstraction of simplified block routing and timing models, and proper margins along several axes.
The challenge for on-schedule physical implementation of multi-million gate chips starts with early floor planning and partitioning, and continues throughout the design flow with appropriate abstraction and approximations to get the most benefit out of all work expended. In particular, to achieve timing closure, each stage of the process must include sufficient margin and the overall project must focus on moving onwards to avoid getting stuck with spending too much time on premature optimizations. As a design closes in on tapeout, several different issues must converge simultaneously, and a useful concept is to incrementally relax added margins towards the desired target goals.
True Hierarchy
Improvement in the flat capacity of CAD tools still doesn't obviate the need for a hierarchical flow for top-end designs. A “truly hierarchical” flow supports making replicated instances of blocks that share a single abstraction each of their logic, timing, routing, and port location models. If different instances must end up having separate models anyway, then much of the advantage of hierarchy is lost. The point is that replicated block instances be used as truly identical objects so that the various design and analysis tasks done at a block level do not have to be repeated for each instance.
Another important characteristic of a truly hierarchical design flow is the containment of complexity by choosing an approach and algorithms such that no tool must see all of the flattened data except for final physical DRC/LVS (Design-Rule-Check/Layout-Versus-Schematic) mask checking.
In particular, the physical routing, parasitic extraction, and static timing analysis steps need to be separated such that top-level runs use only abstractions of instantiated blocks without seeing the full transistor, gate, or polygon databases within each of the blocks. This divide-and-conquer approach is essential to truly limit the amount of data seen by any one run and to ensure that virtual memory usage and job runtime stay reasonable.
Floorplanning and Partitioning
Partitioning of a design serves to break it into manageable pieces that can benefit from the parallelized effort of the individuals in a team. The goal is to allow separable progress of the work both for individual blocks and concurrently at the top-level. Another reason why flat approaches fail to provide timing convergence is because there is no good way to benefit from parallel work if final timing or routing is bottlenecked through single large runs. Instead, partitioning allows multiple people and licenses to be applied effectively. A good goal is to seek “equalized pain” between blocks and their parent in a hierarchical design, so that the block size is small enough that the effort of routing and timing closure at a block level is about the same as the effort required for their parent.
If some of the blocks are so large that they are much more difficult than their parent, then the overall projects' closure will be aided by cutting those blocks into smaller pieces, and “pushing” some of the complexity of their connections into the parent. Conversely, if the parent jobs for routing, extraction, or timing are too large, then it is worth encapsulating and pushing downwards some of the complexity by combining blocks or introducing another level of hierarchy. A good metric for the use of hierarchy is the “hierarchical re-use factor”, which is the ratio of the number of block instances to the number of block types. Although very design dependent, a good partitioning should seek a re-use of around 3 to 4, which averages out so that some blocks are replicated many times while some unique blocks have only one instance.
Another good principle in choosing the granularity of partitioning is to ensure that no individual run takes more than 20 hours. By using a hierarchical divide-and conquer to cut and partition so that each individual computer job for routing, extraction, and timing completes in less than a day, based upon the speed of computers available to the design team, progress can be assured. Work must not become irrelevant by the time it completes. Further, for many reasons, there are huge project risks in depending upon any job that would take multiple days of runtime to complete because it might never complete due to power failures or program bugs that don't become apparent until too late. By abstraction and partitioning of blocks so that individual runs take a day or less, forward progress and valuable iterations proceed with reasonable cycle times for a few days per turn, including designer time to analyze results.
Given today's tools, we have found that a good rule-of-thumb for block size is to seek blocks that have about 150K placeable instances, or around 400K gates (nand2 equivalent standard gates). Even though tools could support blocks several times this size, it is less optimal for the overall project because individual blocks with 1M gates just take too long in runtime for all steps, and are too close to failing completely due to lack of real or virtual memory, even on machines with many gigabytes of memory.
Port location assignment is part of overall floorplanning and partitioning. As early judgment is important, a powerful concept is to use a linear “signal velocity” metric that allows top-level timing before the actual placement of repeaters. By assuming good repeaters will be inserted later, floorplanning tuning can proceed without getting lost in the wire resistance capacitance effects that scale quadratically with the length of wires without repeaters. For a given technology and wire width and spacing to neighbors, there is an optimal spacing between repeaters for long wires, and a corresponding idealized velocity of signals traveling down a wire that has perfectly positioned repeaters.
By de-rating the idealized velocity (for example, by cutting it in half), a design team can choose a velocity that will have enough margin for the realities of non-ideal repeater locations, non-ideal actual wire paths due to routing congestion, and fan-out prior to the next repeater along each branch. A top-level timing analysis can then combine the input arrival times and departure times of each block's ports with a timing of the parent's signals using the de-rated signal velocity on the Manhattan (x+y) distance pairwise between each output port and every input port it drives. In this way, top-level timing budgets and wirelengths can be driven to meet design goals and to prevent wires that have to cross the full chip, since this would take several clock cycles at today's frequencies.
Abstraction
One of the classic traps that some design approaches fall into, even after doing the hard work of partitioning and floorplanning, is to then choose analysis methods that don't preserve the isolation between parent and child in the hierarchy. For example, detailed capacitance abstraction shouldn't attempt to combine parent and child data, to “see” the wires above or below. Instead, it is better to simplify by crafting the extraction flows to just assume worst-case conditions do exist. For minimum capacitance values, this means assuming no wires, and for maximum capacitance values, assuming “gridded-sheets” of wires spaced at the routing pitch, in the preferred direction of each appropriate metal layer. While modeling solid-sheets of metal would be even more conservative, it is overkill, since the reality in the routed environment will be, at worst, the full-density grids of wires at the routing pitch.
Another classic trap that makes hierarchy less effective is the handling of the timing of the signals crossing the hierarchical boundaries. The quality of separation is lost if the child timing depends on how much load is applied to a signal by a parent, or if the parent timing depends upon how much internal routing is put on a wire before it gets to an output port.
We have found that the hierarchical abstraction can be strengthened by adding a buffer at every input and output port of every block, and placing that buffer physically very close to the port location. Even if a buffer is unneeded for other reasons, the isolation between parent and child nets is worth the negligible area cost to aid in timing convergence by being able to separate parent and child timing.
Complementary Timing Analysis
In modern 0.18 micron or smaller technology, minimum-pitched wires are taller than they are wide, and this means that cross-coupling to neighboring signals can often be in excess of 50% of a wire's total capacitance. The effect of simultaneous switching cannot be ignored, but it is also unrealistic to seek a precise determination of when every coupling combination can occur over the range of process spread. Therefore, safe and successful timing convergence requires conservative choices that bound delay calculation by minimum and maximum values rather than hopelessly seeking to find a single “exact” value.
Even after making conservative choices for the handling of cross-coupling, it is still important to add additional margin to account for the effects on timing of many other factors such as process spread, variations in dielectric thickness or permittivity, on-chip process tilt, on-chip variation in power-supply voltage-drops, and inaccuracies in extraction and transistor characterizations. Continuing with the theme of bounded minimum and maximum capacitance values, multiplicative margin can be added with respect to the nominal values, making good equations for Ceff, min and Ceff, max as follows:Ceff.min=(1−x)*(Cvss.vdd+0*Ccoupling)  (1)Ceff.max=(1+x)*(Cvss.vdd+2*Ccoupling)  (2)The value of x can be chosen as desired, typically in the range of 0.05 to 0.10, where the margin introduced is, in effect, 2x. Remember, that since many of the otherwise-not-explicitly-counted factors are individually 10% effects, this margin is very much needed. Similar multiplicative margins are also suitable to apply to the wire resistance parasitics to account for the process variation in wire thickness and width, and for the on-chip localized temperature variations due to self-heating.
The minimum and maximum resistances and capacitances for every net allow us to calculate minimum and maximum delays for every net which can be backannotated using .sdf files into timing analyses. In a register (edge-triggered flop) based design, think of every setup and hold check as determining the results of a race ending at the receiving register. Both sides should be viewed as complementary duals of each other, where every setup check is based upon using the maximum delays through the launching register and combinational logic up to the receiving register, and the exact same paths with minimum delays are used for a hold check.
Furthermore, the paths used for these checks need to include the clock trees going back to the point of reconvergence between the parts of the clock distribution tree feeding the launching and receiving register. Thus, ordinary setup and hold checks also validate the quality of the clock distribution tree, emphasizing local skew instead of global skews, and allow for the possibilities of using “useful skew”, as well as not penalizing bad skew where it wouldn't make any difference to the affected set up or hold checks anyway.
This rigorous and safe approach to clock tree analysis allows a much lower-power clock distribution scheme using tall clock trees with limited average fan-out, rather than the overkill of more common short-fat solutions that sometimes expend a third of the switching capacitance of the chip just in distributing a core clock.
Where hold checks do show violations with respect to a chosen added amount of margin, delay elements can be inserted. Sometimes, because of the conservative handling of minimum and maximum delay calculations, it can be a tricky balancing act to add delay elements in just the right way to fix the hold violations without degrading the setup paths too much. A useful concept is to write scripts that search the nets selected by fan-in cones of each receiving register input that has a hold violation (using minimum delay calculations) for the upstream point with the maximum setup slack (under the maximum delay calculations), and insert the delay buffer into that net. Often this means the hold-fix delay element must be inserted at a point that is adjacent to neither the launching register's output nor the receiving register's input, which is where most simplistic hold-fix algorithms usually would have placed them.
In the later stages of timing closure, the actual results from block level static timing analysis runs can be used to create models of the blocks that are used as objects in a top-level static timing analysis run. Of course, replicated instances will only need a single timing model for the block type, consistent with the choice that timing analysis of the blocks is not complicated by the instance-specific wires of overlying routing in the parent.
Moreover, since the paths leading from input ports to receiving registers, and the paths leading from final launching register up to output ports will have both minimum and maximum delays, both types of delay arcs need to be represented in the models constructed by use in the parent. Even though no commercial flow does this automatically, internal scripts can be written to combine these arcs into block level models so that timing analysis at a parent level can use the same duality for all setup and hold checks.
Closure
Adding sufficient margins early allows a design to converge with only a handful of iterations for each block type and for the top-level. By taking a conservative design style with full complementary checking of both sides of every setup and hold race, success is built into the process rather than risking being bitten by unaccounted-for coupling or noise issues found only after the silicon returns. The conservative delay calculation metrics discussed in this article also mean that there is upside that can be expected.
By choosing to base the delay calculations on worst-case values, a typical process spread will actually produce most parts well above the timing predicted by the worst-case model. Plus, the extra multiplicative timing margins built in by the factor x in equations (1) and (2), are also an additional upside factor between the “guaranteed” goal stated at tapeout and the actual attained clock frequencies proven by testing of the finished packaged parts across environmental conditions.