Static Timing Analysis (STA), is a CAD application used in the synthesis, optimization and analysis of digital electronic circuits consisting of a plurality of active circuit elements, e.g. gates or transistors connected through networks of wires, referred to as nets.
Referring to FIG. 1, a digital electronic circuit is illustrated, having shapes labeled NAND0, NAND1, NAND2, NOT0, NOR0, and NOT1 representing circuit elements, wherein gates referenced as NAND, NOT and NOR represent specific logic functions. Arrows between shapes represent nets. The labels I1-I5 and O1-O3 represent the inputs and outputs of the digital circuit.
An objective of STA is to predict the behavior of circuits in terms of bounds on the times at which certain events occur, e.g., the transition of a signal at the output of a particular gate from a logic “0” state to a logic “1” state, in order to ensure certain time constraints on those events, such as their relative order, or comparison with an externally supplied specification (e.g., overall circuit frequency). Furthermore, STA predicts circuit timing by: (1) asserting the time of certain events (e.g., transitions on circuit input signals), (2) using a variety of simulation techniques, to predict the time delay between input events and output events of circuit elements or nets and (3) Computing bounds on the time of events other than the asserted input event times by means of event propagation, in which the time of an event at the output of a circuit element or net is computed to be some combination operation (e.g., maximum or minimum) applied to a set of propagated input event times determined as the sum of each input event time and the delay of the circuit element or net.
The detailed operation of the preceding steps is best understood by way of a timing graph, a directed graph whose nodes are referred to as timing points and whose edges are referred to as propagate segments. Timing points correspond to the inputs and outputs of circuit elements (e.g., gates) and nets; propagate segments correspond to connections between timing points, of two types: devices propagate segments correspond to the internal connection between a circuit element inputs and outputs, and net propagate segments correspond to the connections between circuit elements, which comprise nets.
Referring to FIG. 2, a timing graph is illustrated that corresponds to the digital electronic circuit shown in FIG. 1. Small dots with labels starting with “tp” represent timing points, while lines with labels starting with “ps” represent propagate segments. As an example, timing points labeled tp5 and tp6 represent the two inputs to the AND0 circuit element shown in FIG. 1, and timing point tp10 represents the output of AND0. Propagate segments labeled ps6 and ps7 represent the internal connections between AND0 inputs and outputs. Propagate segment ps11 represents the external connection (net) between timing point tp10, the output of circuit element NAND0, and timing point tp13, one of the inputs of circuit element NOR0. Timing points labeled I1-I4 and O1-O3 represent the inputs and outputs of the overall circuit.
An STA application typically defines data structures that correspond to timing points, to propagate segments, and to the overall timing graph. The timing point data structures potentially hold information about bounds on the times at which certain events, e.g., logic transitions of a specified type (e.g., logic “0” to logic “1”, or logic “1” to logic “0”) occur. The propagate segment data structures potentially hold information about the delay between events at its input timing point and its output timing point; they may also contain information about how that delay value is to be computed, e.g., the modeled propagation time within circuit elements, or the resistance and capacitance of the electrical connection between circuit elements.
STA applications typically begin by inputting a description of the circuit to be analyzed and constructing a corresponding timing graph data structure consisting of a multiplicity of timing point and propagate segment data structures. In standard practice, the directed timing graph is converted to a directed acyclic timing graph (hereinafter, DAG) by breaking loops by “snipping” certain propagate segments. Possible timing errors missed by the removal of the propagate segments can be detected by adding certain constraints or tests applied to the remaining DAG elements. Once the timing graph has been converted to a DAG, in one aspect of the STA application, it may be described as follows. The aforementioned aspect is referred to as arrival time propagation, and computes for each timing point an estimate of the worst case time at which the effects of an event on one or more of the overall circuit inputs propagates to the specified timing point, i.e., a bound on the arrival time at the timing point.
Arrival time (AT) can be calculated for a given timing point by: (1) enumerating all paths leading from any input to the timing point, (2) calculating the AT at the end of the path as the AT of the input to which are added the delays of all the propagate segments on the path, (3) in late (early) timing, computing an upper (lower) bound on the AT by taking the maximum (minimum) of the ATs along all paths and assigning that as the AT at the selected timing point. Since this method may display a computational complexity of the order of exponential of the number of propagate segments, it is typically performed in a block based manner, wherein each timing point AT can be computed once the ATs of its immediately preceding timing points and delays of the propagate segments connecting them to the selected timing point are known: For the so called late mode propagation, for example, it is the maximum of the sum of each preceding timing point AT and the corresponding propagate segment delay. The forgoing is illustrated in FIGS. 3a-3e using a part of the timing graph shown in FIG. 2.
By way of example, in FIG. 3a, ATs are asserted at the overall inputs, as indicated by the numbers 0, 1 and 0, corresponding to time units. In FIG. 3b, delay values have been added to each of the propagate segments leading from the input timing points, and each of the timing points at the ends of those propagate segments can then be assigned the max of the sums of the AT and delay at each input. FIGS. 3c-3e show how these operations can be used to eventually assign ATs to all the timing points of the circuits. Analogous techniques are used to propagate required arrival times (RATs) “backward” from circuit output timing points to circuit inputs; once this is done, a “slack” (in late mode, slack=RAT−AT) can be computed at each timing point, and negative slack values indicate potential timing violations.
A critical aspect of the block based computation of ATs is that an AT for a given timing point can be computed if and only if the ATs and delays of its preceding timing points and propagate segments have been computed. One way to ensure this is by pre-levelizing the timing graph. As illustrated in FIG. 4, it is possible to associate with each timing point in a loop free timing graph (more generally, a DAG) a number (hereinafter referred to as AT level), which indicates the length of the longest path from any input to that timing point; this can be computed in time linear in the number of graph edges by topological sorting.
Referring to FIG. 5, an algorithm for a block based STA is shown.
In the first step 005, the AT level is computed for all timing points. The second step 010 consists of iteration over an integer index ranging from 0 to the maximum AT level in the graph. Within each iteration, there is a second nested iteration starting at step 015 over all the timing points with the respective AT level. For each point, there is a third nested iteration starting at step 020 over all the propagate segments incident on the given timing point. For each such segment, in step 025 its delay is calculated, and added to the AT of its source timing point, and in step 030 the maximum (minimum) of those values for each segment is computed and stored as the late (early) mode AT value for the respective timing point. The process finishes at step 035 when all the AT levels have been processed. The topological ordering ensures that when a given timing point is processed, all of its predecessor timing points, with lower AT levels, will already have been processed. Another technique to guaranty a proper dependency ordering consists satisfying an AT query by performing a depth first recursive search through the DAG starting at the query point and working backwards to the frontier point where ATs have either already been calculated, or are known via a user supplied assertion. Once the frontier of previously known ATs has been encountered, AT values can be recursively updated in the fan-in cone of the original query point.
It may be appreciated that the computation time for large digital electronic circuits, e.g., comprising millions of gates and nets, may be considerable. It may also be appreciated that there is a potential to accelerate this computation through parallel processing, a technique in which a computation may be divided among a multiplicity of computing devices so that the overall elapsed time of the computation is reduced. One example of the referenced task level parallelization is to perform STA on two or more independent digital circuits simultaneously, each on its own respective computing device. Another technique exploits the fact that within a given timing graph, corresponding to a single digital circuit, there is the possibility of performing certain computations simultaneously. By way for example, and referring back to FIG. 2, all the delays calculated on the first rank of propagate segments can be performed independently, hence simultaneously. Similarly, once those are complete, calculating the ATs of the second rank of timing points can be performed simultaneously.
It is also to be noted that there are still constraints that limit the number of computations that can be performed simultaneously, namely the requirement that the ATs and delays of a particular timing point predecessor timing points and propagate segments be computed before the subject timing point ATs can be computed.                a. Existing parallel implementations of STA enforce this constraint through the use of a parallel version of the aforementioned levelized analysis technique, as shown in FIG. 6. While the loop over timing levels (Box 040 in FIG. 6) is sequential, within a given level, all the timing points on that level can be divided (050) among a multiplicity of computing elements operating simultaneously (055-080). The main computation waits for all of the computing elements to complete their respective computations (080), at which time the next level is processed (045) until all levels have been completed (095). This technique has been demonstrated to reduce overall STA runtime, but suffers from a problem that occurs for many parallel algorithms, load balancing. This problem occurs when workload is divided unevenly among parallel computing elements, resulting in elements performing no computation while waiting for other elements to complete. In the case of parallel levelized STA, this may occur because the number of timing points to be processed for a given AT level is fewer than the number of available computing elements, or because the amount of computation assigned to each computing element is not balanced because it varies between different timing points; an example is that a timing point preceded by a very large net (e.g., power or clock) may take considerably more time for delay computation compared with a timing point preceded by a single input single output net. The overall effect of load imbalance is that the parallel speedup (the ratio of serial runtime to parallel runtime) approaches a limit in spite of adding more and more computing elements.        b. It may be appreciated that the levelized technique is unduly restrictive with respect to the underlying precedence constraint. Timing points in two disjoint paths or sub-graphs can proceed in parallel without regard to the specific AT levels. This observation leads to approaches to parallelization, so called dynamically schedule parallel STA. Another prior art method teaches a static timing analysis performed in parallel using multiple “compute modules” which can span AT levels. However, a deficiency of the prior art is that it relies on a central “control module” to perform all updates on the timing graph and to determine a set of next available work based on most recent result. To the extent that such implementations rely on a single, sequential (non-parallel) mechanism, they are subject to the well known Amdahl Law which states that parallel speedup is limited by the fraction of the computation that is non-parallel.        
A further aspect of static timing analysis involves reacting to design changes and recomputing timing results based thereon. As an example, an optimization program may perform slack based design closure wherein an initial timing analysis is performed on the entire design, and then logic/physical design changes are made with the goal of improving slack. In such scenarios, it can be appreciated that numerous such changes followed by a query for updated ATs and/or RATs may be performed in the inner loop of optimization, and that accordingly it is imperative that such incremental recalculations of timing data occur in an efficient manner. One technique for efficiently performing incremental timing calculations involves the use of level limited recalculation queues, wherein the aforementioned AT and RAT levels are used to determine a potential dependency of a given AT (RAT) time query on a pending design change. Generally, after a design change is first detected, corresponding nodes are eventually inserted in to the appropriate AT (RAT) recalculation queues which are sorted based on AT (or RAT) level. Once a query is made, queues are processed in a levelized manner, during which additional nodes may be inserted in to the recalculation queues as a result of propagated change effects. Two key advantages of levelized processing are: i) dominance limiting criteria can be used to limit the forward, in the case of AT (or backward in the case of RAT), propagation of changes (e.g., in the case of AT propagation, if a propagated change has no effect on downstream AT due to another side input fully determining said downstream value, then no further local propagation of changes need occur), and ii) level limiting can be used to prevent propagation of changes beyond the point at which one needs to propagate in order to satisfy a particular query. In the case of RAT calculation, it is to be furthermore appreciated that since RATs generally depend on ATs (e.g., RAT at the data end of a test depends on AT for the corresponding clock reference signal), a query for RAT may result in the need to compute further downstream ATs. Fortunately, such processing of AT changes in response to a RAT query can use similar levelized queue processing mechanisms as described above (and in particular, the processing of AT dependencies due to RAT calculation can be level limited if one can determine a priori a maximum AT level for which a given RAT query depends on AT).
One method for performing parallel level limited incremental analysis uses message passing between computer processors that do not necessarily share memory, and assumes a partitioning of a design into separate DAGs or sub-circuits which prevents efficient load balancing. However it is often the case that number of nodes on a given level of the AT or RAT queue at any given point in the analysis is quite small. Therefore, prior art techniques for parallel static timing analysis which involve synchronizing at each AT (RAT) level suffer from particular poor scaling, especially in cases where load imbalance causes compute resources to wait idle during the synchronization step at each graph level.
Accordingly, there is a need for a method that efficiently performs parallel static timing analysis, which enables efficient incremental re-analysis in response to design changes without imposing limitations of load imbalance introduced with partitioned and levelized configurations and, furthermore, which does not require a centralized control module, but can be adapted to scale to a large number of parallel processors.