1. Field of the Invention
This invention relates to methods for tuning the digital design and design automation of high-performance digital integrated circuits. The invention particularly is directed to the problem of developing an integrated circuit design optimization methodology which exploits circuit tuning of individual macros. The tuning of individual macros is conducted by optimizing transistor sizes over a defined continuous design space. To further optimize for performance, circuits with low or high threshold voltage transistors are selectively substituted for regular threshold ones.
2. Description of Background
Typically the datapath and array design sections of a high-speed microprocessor design are logically well-defined. In addition, these circuit sections are typically custom-designed circuits which are electrically and physically designed much earlier in the design cycle to assure high clocking performance. The remaining sections of the design, the control logic sections, are often changed late in the microprocessor design cycle to reach required logical function but other objectives such as timing closure put an additional constraint on the system, making automated optimal design closure techniques extremely valuable. Automated techniques deliver several advantages such as improved circuit performance, higher quality and correctness, and enhanced time-to-market.
The control logic is contained in physical entities called random logic macros or RLMs, where the term random does not imply true randomness, but instead a lack of regular structure as is found in datapath and array circuitry. Due to the unstructured nature of the logic, synthesis and place/route tools are employed to read in a logical description and transform it into primitive logical gates. Hereafter the term gate is understood to include a collection of transistors which is to be treated as a single logical circuit element. These gates are adjusted in drive strength to achieve timing objectives and placed legally while a wiring tool routes the connections between these gates to complete the physical design.
Prior-art methods of circuit tuning towards timing closure are illustrated in FIG. 1 (flow 100) in the case of a “flat” design methodology and FIG. 2 (flow 200) in the case of a “hierarchical” design methodology. For the majority of semiconductor designs, FIG. 1 is typically employed where designs are analyzed and optimized in a flat representation. The flow starts with a logic design specification (labeled “Customer Logic Drop” in box 110) whereby a logic synthesis tool (box 120) synthesizes the input logic description into proper logical gates and structures to assure logical correctness. Logic restructuring of the gates (box 130) maps the logical gates into technology-approved gates required for manufacturing. An initial timing analysis of the design is typically applied whereby the gates are put through a process of powering up and down (box 140), depending on current drive needed to move towards cycle-time goals. This process continues with the physical design (box 150) which entails the physical placement of the gates and global routing of the design which is needed to provide a rough understanding of the real wiring of the design. The global routing provides parasitic extraction details which, along with the placed design, is provided as input to static timing analysis (box 160) to predict the performance of the current design. Many iterations of this process (box 160 to box 120) are attempted with different adjustments of the entire process to achieve timing closure. If cycle time or the physical constraints are not met via this process, the design may take on new engineering logic changes (box 170), and if so the design is started once again through the synthesis, placement, timing process as described (boxes 120–160). If no logic changes are required, the gates of the design are connected via routing, the electrical parasitics of the routing are extracted, and the design proceeds through a final static timing (box 180). Other checks are applied at this time but typically the design must be adjusted logically to achieve cycle time goals and the design is presented back to the synthesis, placement and timing process of boxes 120–160. Upon final acceptance of the design, it is sent to manufacturing (box 190).
The other prior-art methodology to achieve timing closure is illustrated in FIG. 2 (flow 200) and is usually implemented for more complex and dense semiconductor designs such as a microprocessor. This hierarchical methodology presents many benefits in the spirit of “divide and conquer.” It permits a parallel approach to tackling the complexity of the design by enabling the design team to work on macro partitions of the overall design. Each macro partition has a space budget and a timing budget associated with it. Once the macro partition has been designed, its actual timing is represented as a timing abstract, which is a simplified representation of its salient overall timing characteristics. This method improves design turnaround time and time-to-market. It also permits an analysis of the performance of the chip at any stage of the design, substituting full timing abstract models for portions of the chips that have been designed for the estimated or budgeted timing models created during high-level design planning for the portions that haven't yet been designed.
The boxes on the left of FIG. 2 (boxes 205 to 230) pertain to a custom design flow. Custom design is typically applied to arithmetic or dataflow circuits, or other circuits requiring careful hand-design. The boxes on the right of FIG. 2 (boxes 235–265) pertain to a design flow for automatically synthesized and placed random logic macros. The boxes in the middle (boxes 270 to 290) pertain to the steps in the methodology that bring together the custom and random logic macros for global integration and timing of the chip. Thus flow 200 of FIG. 2 embodies two levels of timing closure: one at the individual macro level, and another at the global chip level. The following four paragraphs explain the three main sections of this prior-art hierarchical design methodology.
Functions like adders and arrays are typically implemented as custom macros. It is widely known that efficient implementations of such macros cannot be totally developed by automatic computer-aided design (CAD) software and that in-depth engineering is required in many steps to assure performance. Therefore a custom macro's logic architecture (box 205) is described but the architecture is restructured logically and modified by hand for optimal timing performance (box 210). A rough physical placement is constructed of the major building blocks (box 215) to estimate overall size and to minimize parasitic element constraints within the design. The schematic is developed with these constraints and static or dynamic timing analysis is performed (box 220) by employing a static timing analysis tool or circuit simulator, respectively. The macro design is physically engineered, parasitics are extracted and the design is timed again (box 225). As necessary, the processes of boxes 210 to 255 are iterated to improve the timing characteristics of the macro.
For the purposes of hierarchical analysis, a timing macro model, called a timing abstract (box 275), is produced. A timing abstract is a simplified model that represents the timing behavior of the entire macro. The timing abstract typically contains the timing behavior of timing arcs to and from macro boundary pins to latch points within the macro design. These timing abstracts are incorporated at the global level to enable the chip-level timing analysis (box 270) for the entire chip design. Timing assertions (box 230) or constraints are fed back and applied on the macro pins from the global timing analysis. These timing assertions when placed on the custom macro during timing analysis could result in the need for additional logic restructuring for timing optimization and the process is started again (box 225 back to box 210). To help speed this process it is extremely effective to produce the timing abstracts in the beginning of the design from the schematic timing analysis (dotted arrow from box 220 to box 275).
In parallel to the custom flow, the Random Logic Macro (RLM) flow starts with an initial logic specification (labeled “RLM Logic Drop” in box 235). This part of the flow is very similar to FIG. 1 (flow 100). The logic is synthesized (box 240), logic is restructured for technology adaptation (box 245), gates are powered up or down for required drive strengths (box 250), and the resulting netlist is physically designed (box 255) for minimal wire length on critical paths. Once completed, parasitic extraction and timing analysis is executed (box 260) on the design. Constraints are typically adjusted throughout this process and the loop is iterated many times (box 260 back to box 240) to produce optimal performance. As with custom designs, timing abstracts (box 275) for the RLMs are produced for global timing analysis. Timing assertions (box 265) are fed back to the RLM design loop as additional constraints on the design.
The third main component of flow 200 comprises of the boxes in the middle of FIG. 2 in which the various custom and RLM macros are integrated to obtain a final chip design. Global timing results (box 270) and new logic changes (box 280) would require either custom or RLM macros or both to be redesigned depending on the change required. Evaluation of a final timing run, physical design requirements and other checks (box 285) might also require a loop through either the custom or RLM flows. Until all checks and analysis are satisfied, the chip cannot be released for manufacturing (box 290).
The two described prior-art methodology flows have various strengths and weaknesses, which are discussed below. Analyzing a flat design as in FIG. 1 (flow 100), enables a small design team to close in on the cycle time quickly with the help of a fully automated chip development system. Typically the optimal cycle time for a given technology is not achieved due to the imperfections within this system including approximations made along the way and short cuts taken throughout the process to achieve a quick turnaround from the design system. Inaccuracies include, for example, the use of approximate timing models rather than transistor-level simulation. Short cuts include sub-optimal and heuristic optimization methods, a relatively small library to keep synthesis turn around time to a minimum and relatively simplistic modeling and optimization techniques. These limitations or imperfections force the system in FIG. 1 (flow 100) to iterate many times before a successful conclusion on the achievable performance. Detailed “cross-section analysis” of the worst case paths of the design is typically not conducted in such a flow.
A flat design flow as in FIG. 1 (flow 100) implies very large optimization problems with very large design spaces. Various commercial and in-house CAD (computer-aided design) tools have been developed to solve the optimization problems in this flow. Due to the large number of optimization variables and the discrete nature of these variables, heuristic methods have been favored in these optimization tools. The use of heuristic methods leads to a number of difficulties including problems in achieving performance, a large number of design iterations, incomplete exploitation of the technology, all of which lead to long design times and sub-optimal designs.
As digital designs become more complex and dense as in leading-edge microprocessors and SoCs (systems on a chip), FIG. 1 (flow 100) is insufficient to obtain timing closure and system objectives, hence a hierarchical methodology like that of FIG. 2 (flow 200) is applied. While FIG. 2 (flow 200) controls the design complexity with the use of macro partitions, timing closure must now be accounted for at two levels of the design, both locally within the macro and globally. Referring back to FIG. 2 (flow 200), the difficulty with this flow is the management of the timing budgets between macros and global connections. The process requires careful management of these budgets through the use of the wiring and buffering solutions of the global nets and careful logic implementation of the macro paths connecting between these global paths. Cycle time closure on the critical paths could require logic re-design to reduce overall delays of these timing paths. This redesign could effect connections at the global level and new solutions of wiring and buffering must be implemented. Therefore logic and circuit solutions at both the global and macro levels must be analyzed and developed to achieve optimum cycle time performance. Another potential problem is that typically only the top most critical path or paths are searched, broken down and analyzed for the required solution, but as soon as that is done, other sub-critical paths which were not exposed before may now be limiting. The procedure is therefore time-consuming and not guaranteed to produce a design that meets the required system-level performance.
Thus, the hierarchical design flow of FIG. 2 (flow 200) suffers from two main problems. The first is the difficulty of iteratively adjusting the budgets of the individual macros so as to meet global timing requirements, while giving individual macros reachable targets. The second is that focusing on just a few critical paths is not sufficient to meet overall timing and leads to a great deal of re-design.
As described in the previous two paragraphs, both flows limit the progression of the design due to the nature of the heuristic iterations that each flow applies and therefore retards the achievable circuit performance which can be obtained by the flows. This slow convergence rate towards cycle time objectives limits the flexibility of the design team to introduce functional and timing changes throughout the design process, and in particular during the crucial period late in the design cycle. In all cases, only a small sub-set of critical paths is exposed towards an optimal solution.
Modern technologies allow multiple threshold voltage transistors, whereby transistors with different threshold voltages can be integrated on the same chip. Low threshold voltage (Low Vt) transistors offer faster performance, but at the cost of increased leakage power. High threshold voltage (High Vt) transistors offer significant reduction in leakage power, but at the cost of lower performance. It is therefore beneficial to sparingly use Low Vt devices on the critical paths to achieve higher performance, but limit the usage of such devices to limit leakage power. It is also beneficial to use High Vt devices on the non-critical paths to reduce leakage power, but not to the extent that the non-critical paths slow down and turn into critical paths.
Physical synthesis CAD tools employ heuristic methods to introduce multiple threshold devices with the dual objectives of achieving higher performance and limiting leakage power. The prior-art heuristic methods limit the ability to optimally adjust the performance and leakage of the circuits.
The prior art therefore suffers from several problems and weaknesses as summarized below:    a) Prior art design flows generally require excessive iteration.    b) Prior art design flows focus on optimizing only the single or a very small subset of the most critical paths, which in turn leads to long design times and slow convergence towards cycle time objectives.    c) In prior art hierarchical flows, the interaction between imperfect budgeting across the hierarchy at the global level and imperfect optimization at the macro level can lead to poor circuit performance and excessive redesign effort.    d) Prior art flows do not include continuous optimization techniques during RLM design, despite the ability of such techniques to obtain optimal solutions. Further, prior-art flows do not include specialized techniques to deal with the relatively low capacity and high run times of such continuous optimization methods.