1. Technical Field
The present invention relates to electronic design automation and more particularly to timing analysis during automatic scheduling of operations in the high-level synthesis of digital systems.
2. Description of the Prior Art
High-level synthesis (HLS) automates certain subtasks of a digital system design in an electronic design automation (EDA) system. A system architect begins by designing and validating an overall algorithm to be implemented, e.g., using C, C++, a specialized language, or a capture system. The resulting architectural specification is partitioned into boards, chips, and blocks. Each block is a single process having its own control flow. There are usually tens to hundreds of such blocks in a modem large-scale chip design. Typical blocks represent whole filters, queues, pipeline stages, etc. Once a chip has been partitioned into its constituent blocks, any needed communication protocols have to be constructed. Such protocols depend on cycle-by-cycle communication between blocks.
So-called xe2x80x9cschedulingxe2x80x9d and xe2x80x9callocationxe2x80x9d are applied one block at a time. Scheduling assigns operations such as additions and multiplications to states of a finite-state machine (FSM). Such FSM describes a control flow in an algorithm performed by a block being synthesized. Some operations are locked into particular states, and represent communication with other blocks. These input/output operations cannot be moved, or rescheduled, from one state to another, because to do so would probably upset the block-to-block communication protocol.
However, some other operations can be moved from one state to another. Moving operations from states that have many operations to states that have few allows hardware resources to be more evenly shared among operations. Timing problems can sometimes be resolved by moving certain operations from states in which operation delays cause timing problems into states in which such problems don""t exist.
Allocation maps the operations of a scheduled FSM to particular hardware resources. For example, three addition operations can be scheduled to only require a single adder. An appropriate adder is constructed, and the operations are assigned to the adder. But complications can arise when more than one hardware resource of a given bit-width and function is needed. And so which resource to use for each operation must be decided. Considerations include multiplexing cost, the creation of false timing paths, register assignment, and even using a large resources for small operations. Hardware resources can be used for multiple functions. Calculating a minimum set of resources for an entire process is difficult but rewarding. Sometimes alternative implementations will be possible. It is often possible to choose implementations that meet the overall timing constraints and minimize the gate count. Resource allocation also includes mapping resources (abstract functions) to gate level implementations.
Allocation includes calculating a register set and assigning data to registers for use in later states. For example, temporary variables are used to store intermediate results in a larger calculation. But the contents of such temporary variables could share a common register in different states. The contents are only needed in one state each. So it is possible to save on hardware by assigning the data that needs to be stored to such storage elements. But register and storage allocations can be complicated if data values can form mutually exclusive sets or can share storage. Data values often drive functional resources, and in turn are often produced by functional resources. A good assignment of data to storage will result in reduced multiplexing costs and delays. The allocation is also made more complex if any register and functional hardware interact.
Technology-independent, or Boolean, optimization follows scheduling and allocation. The circuit design comprises generic AND and OR gates connected in a netlist. Technology-independent optimization minimizes the number of literals in the netlist. An abstraction of Boolean gates lends itself to a highly mathematical treatment based on Boolean arithmetic. For example, the Boolean identity AB+AC=A(B+C) can be used to reduce the corresponding gate network.
Technology mapping follows Boolean optimization, the abstract Boolean gates of the circuit are mapped to standard cells from a technology library. Standard library cells include simple AND, OR, or NOT functions, and much more complex functions. For example, full adders, and-or-invert gates, and multiplexers. Technology-library gates are available in a variety of drive strengths, delays, input loadings, etc. Technology mapping is made more complex by the fact that there are many ways to map an individual Boolean gate, and each way having its own unique advantages.
Technology mapping can sometimes be avoided by constructing custom gate layouts for the gates of a circuit, instead of selecting cells from a library of preconstructed and precharacterized cells. But this method is not commonly associated with automatic synthesis.
The layout tasks of cell placement and net routing follow technology mapping, the physical position of each cell on the chip is established (placement), and the nets necessary to interconnect the cells are laid out (routing). In application service provider 104, the design intellectual property is downloaded to the user for placing and routing.
The usual input for an HLS system is a VHDL or Verilog process. Single conceptual units are represented with single threads of control, well-defined input and output operations, and a well-defined sequences of operations that define behavior. The output of the typical HLS system includes three interlinked parts and its input behavior. First, a finite-statemachine (FSM) comprising a finite set of states xe2x80x9cSxe2x80x9d, an alphabet of input symbols xe2x80x9cIxe2x80x9d, an alphabet of output symbols xe2x80x9cOxe2x80x9d, and a transition function xe2x80x9cFxe2x80x9d mapping (Sxc3x97I)- greater than (Sxc3x970). This is the so-called Mealy representation of an FSM. Another common representation, the Moore representation, is logically equivalent. Second, a resource graph, comprising a collection of hardware resources, which are abstract representations of registers and combinational logic elements, and interconnections between the resources. Third, a mapping of the process""s operations and data values to the states and alphabets of the FSM and to the resources. This mapping can be thought of either as two mappings. E.g., operations and data to the FSM, and operations and data values to resources. Or as a single ternary mapping whose tuples are of the form operation, symbol, resource, or value, state, resource. Each three-tuple of the ternary mapping can be thought of as describing a linkage between an operation or data value, a state or transition of the FSM, and a register or combinational resource. In other words, what, when, and where.
Event control statements like xe2x80x9c@ (posedge clock)xe2x80x9d in Verilog directly map to corresponding states in the FSM. The control flow between event control statements maps directly to state transitions, or xe2x80x9carcsxe2x80x9d. Statements like xe2x80x9cc=a+bxe2x80x9d that occur along a control flow between two event control statements can be mapped onto corresponding FSM arcs, e.g., as operation annotations. HLS systems effectively take Verilog fragments and a technology library as input, and extract a skeleton FSM of states and arcs. Then it extracts the operations and links them to the FSM. The HLS system constructs a resource set, e.g., registers to contain the values, adders, comparators, etc. The HLS system can then construct tuples that describe the what-when-where linkage between the operations, the FSM, and the resources. For example, the Verilog fragment:
input [7:0] a, b;
output [7:0] c;
always begin: process reg [7:0] x, y;
@ (posedge clock);// s1
if (x greater than y) begin
c=a+b
end else begin
@ (posedge clock); // s2
end
end
will result in three tuples being constructed: (1) a tuple containing the addition xe2x80x9cc=a+bxe2x80x9d, an ac from s1 to itself, and an adder, (2) a tuple containing the data value xe2x80x9caxe2x80x9d, a state s1, and a register r1, and (3) a tuple containing the data value xe2x80x9cbxe2x80x9d, a state s1, and a register r2.
The HLS system uses the original description and such tuples to interconnect the resources. In the example, register r1""s output must drive one input of the adder, and the adder must drive an output port xe2x80x9ccxe2x80x9d. Once the FSM has been completely described and the resources connected together, the design can be output.
Downstream tools can then be used for state assignment on the states of the FSM. Such tools construct the next-state and output logic of the FSM, optimize the logic associated with the FSM and the resources, technology-map the circuit, do place and route, do test insertion, and even power analysis.
HLS systems accept a process as input. They output an FSM, a resource graph, and a table of relationships that describes what is happening during each state or transition, and where it is happening. If the input description contains more than one process, each process can be scheduled independently. If the description contains other logic, the other logic can be passed through to back-end tools without being changed.
In the simple example given herein, an HLS system cannot do any significant optimization. However in more complex designs, there will be opportunities to improve the original design. For example, if there were more states and more additions, it might be possible to distribute the additions among the states so that only one adder would be needed. And if the data values xe2x80x9cxxe2x80x9d and xe2x80x9cyxe2x80x9d were not permanent, their registers r1 and r2 could be used to store other data values in different states.
Scheduling is an optimizing transformation that assigns operations to states and transitions of a FSM. When operations are assigned to transitions, the underlying FSM is a xe2x80x9cMealyxe2x80x9d machine. In the prior art, the problem in scheduling is one of assigning operations to abstract c-steps, which represent single clock cycles, without specifying whether states or transitions are the ultimate target.
The value of scheduling can be seen in the following example. Consider a process having ten states. Assume that in the first state, twenty-one data values are present on the process""s inputs. The design must sum up the twenty-one data values and deliver the result to an output variable in the tenth state. Assuming two-argument addition operations, twenty additions need to be made in the ten states available. These operations can be assigned to transitions in any of a number of ways. For a minimum number of adders, two additions per transition makes sense. But, to minimize storage, all of the data should b e summed in the first transition, and then use one register to store the sum over the remaining nine states. If the inputs are stable over more than one state, the summation can be done during those states to minimize both storage and the number of adders needed.
The scheduling problem is made more complex if the state graph is not linear. State graphs normally include branches, loops, and alternate flows. The operations to be scheduled are seldom homogeneous, the operations can be different types and bit-widths. Data arrival times can vary across the output bits of a resource, and it must be possible to do the scheduled operations on one transition within a single clock cycle.
If there is a conditional control-flow branch, the condition bit computation must settle in time to set up the inputs of the state bits of the FSM. If the branching transitions include operations, the condition bit computation must settle even earlier to enable those operations to complete before cycle end. The aggregate timing of a collection of operations needs to be accommodated, because an operation xe2x80x9c0xe2x80x9d scheduled in transition xe2x80x9cTxe2x80x9d consumes data that is produced by other operations also scheduled in xe2x80x9cTxe2x80x9d. A computation xe2x80x9cx=a+b+cxe2x80x9d might be scheduled in a single transition. In one design, a first addition determines xe2x80x9ca+bxe2x80x9d, and a second operation adds that sum for output xe2x80x9ccxe2x80x9d.
When two or more operations with a data dependency are scheduled in the same transition, the operations are xe2x80x9cchainedxe2x80x9d. As a practical matter, chaining cannot be avoided in HLS systems because there are always data and control dependencies. In a data dependency, the result of one operation is directly consumed by another. In a control dependency, needing to do one operation depends on the result of another, as in:
if (a less than b) begin
x=y +
end
In this case, the addition operation need not be done unless xe2x80x9caxe2x80x9d is less than xe2x80x9cbxe2x80x9d. So doing the addition at all depends on the result of the comparison.
An HLS system constructs a map of operations and data values to hardware resources, together with the timing constructs. The allocation task constructs the resources and the mapping of the operations and data to the resources. Allocation comprises operator and register allocation. Operator allocation maps the operations to combinational hardware, register allocation maps the data values to registers. A first step includes deciding on what hardware to allocate, e.g., how many adders and registers, and of what bit widths. A second step assigns particular operations and data values to identified resources after the resources have been constructed.
Allocation can be done before, during, or after scheduling. A suitable candidate set of combinational resources can be constructed before scheduling. These resources should represent an adequate set to implement the entire design. It is possible to do operation assignment during scheduling, so a realistic assessment of the timing of a chain of operations can be made when a chain is committed. This will practically guarantee that the corresponding combinational concatenation of resources will run at the desired speed.
Briefly, a design-timing-determination method embodiment of the present invention for an electronic design automation system approximates the timing of a whole design quickly and on-the-fly. Such allows a scheduling system to construct operation schedules that are ultimately realizable. A timing analysis is applied each time an individual operation is scheduled, and may be called many times to get a single operation scheduled. A graph representing combinational logic is partitioned into a collection of logic trees with nodes that represent gates and terminals, and arcs that represent connections. A compacted model of each logic tree is constructed by replacing them with equivalent trees having no interior nodes. The timing of the original circuit is analyzed along each path from the leaves to the roots. A propagation delay for each path is determined, and such is annotated onto each corresponding arc of the simplified tree. Any dependency of the propagation delay in the original circuit on the slew rate of their input signals is annotated onto the corresponding leaf of the simplified tree. Capacitive loads can also be copied from the logic-tree leaves and annotated on the simplified-tree leaves. Any load/delay response curves of the output gate at the apex of the logic tree and is copied to the root of the simplified tree. The entire delay calculation is collapsed into a simple edge-weighted longest-path traversal, and is much simpler than trying to compute the slew rates and delays for each cell in a circuit.