1. Field of the Invention
The present invention generally relates to allocation of computational resources, and more particularly to a method of partitioning an integrated circuit design to enable and optimize parallel simulation of the design.
2. Description of the Related Art
Integrated circuits are used for a wide variety of electronic applications, from simple devices such as wristwatches, to the most complex computer systems. A microelectronic integrated circuit (IC) chip can generally be thought of as a collection of logic cells with electrical interconnections between the cells, formed on a semiconductor substrate (e.g., silicon). An IC may include a very large number of cells and require complicated connections between the cells. A cell is a group of one or more circuit elements such as transistors, capacitors, resistors, inductors, and other basic circuit elements combined to perform a logic function. Cell types include, for example, core cells, scan cells, input/output (I/O) cells, and memory (storage) cells. Each of the cells of an IC may have one or more pins, each of which in turn may be connected to one or more other pins of the IC by wires. The wires connecting the pins of the IC are also formed on the surface of the chip. For more complex designs, there are typically at least four distinct layers of conducting media available for routing, such as a polysilicon layer and three metal layers (metal-1, metal-2, and metal-3).
An IC chip is created by first conceiving the logical circuit description, and then converting that logical description into a physical description, or geometric layout. This process is usually carried out using a netlist, which is a record of all of the nets, or interconnections, between the cell pins, including information about the various components such as transistors, resistors and capacitors. A layout typically consists of a set of planar geometric shapes in several layers. The layout is then checked to ensure that it meets all of the design requirements, particularly timing requirements. The result is a set of design files known as an intermediate form that describes the layout. The design files are then run through a dataprep process that is used to produce patterns called masks by an optical or electron beam pattern generator. During fabrication, these masks are used to etch or deposit features in a silicon wafer in a sequence of photolithographic steps using a complex lens system that shrinks the mask image. The process of converting the specifications of an electrical circuit into such a layout is called the physical design.
Cell placement in integrated circuit design involves a determination of where particular cells should optimally (or near-optimally) be located on a surface of an integrated circuit device. Due to the large number of components and the details required by the fabrication process for very large scale integrated (VLSI) devices, physical design is not practical without the aid of computers. As a result, most phases of physical design extensively use computer-aided design (CAD) tools, and many phases have already been partially or fully automated. Automation of the physical design process has increased the level of integration, reduced turn-around time and enhanced chip performance. Several different programming languages have been created for electronic design automation (EDA), including Verilog, VHDL and TDML. A typical EDA system receives one or more high level behavioral descriptions of an IC device, and translates this high level design language description into netlists of various levels of abstraction.
It is important to ensure that an integrated circuit design is going to work properly before proceeding with fabrication preparation. A variety of tests can be performed to evaluate the design, but simulation remains the dominant strategy for functionally verifying high-end microprocessors. A design-under-test is driven by vectors of inputs, and states encountered while walking through the sequence are checked for properties of correctness. This process can be (and often is) performed by software simulation tools; however, such programs cannot compete with the cycle times offered by hardware accelerated simulation. Hardware accelerators are custom-built machines that can increase simulation performance by several orders of magnitude, reducing otherwise month-long software simulations to days or even hours. This improvement is due in part to specialized logic processors and instruction memories, but also due to the parallelism inherent in the hardware realization of logic designs. The cost of building and maintaining a fleet of hardware accelerators is typically on the order of millions of dollars, and thus reflects a significant portion of the verification budget.
As models approach the billion-transistor mark, the ability of accelerator capacity to scale with design size is critical to the success of microprocessor verification. To exploit locality, large netlists must be decomposed into smaller groups that span several chips, boards, or systems. Likewise, the evaluation of each gate must be routed to its downstream successors without incurring excessive delay along any one path. These compilation concerns echo many of the problems faced by physical synthesis, a broad subject that concerns the placement and routing of standard cells and macros on silicon to concurrently optimize timing, power, area, etc. For instance, the allocation of gate primitives to discrete resources of the layout—a process known as partitioning—has a rich history in the context of cell placement. The area of a physical chip design is recursively divided into many sub-regions, and gates are grouped and split among these regions to optimize an objective function such as half-perimeter wire-length. The most popular algorithms in the literature are variants of the Fiduccia-Mattheyses technique, which uses hypergraphs. A hypergraph is a generalization of a graph, having vertices which can represent gates or cells, and having hyperedges representing interconnections between the vertices. Hypergraph partitioners used in cell placement typically attempt to minimize the number (or weighted number) of hyperedges cut by the partitionment. A prevalent extension of this algorithm is the Multi-Level Fiduccia-Mattheyses (MLFM) algorithm, which improves both solution quality and runtime of partitioning large hypergraphs by clustering tightly connected components and partitioning the resulting smaller hypergraph using gain-based cell movement and repeated bisection.
The approach taken by prior art placement partitioners does not, however, address the specific needs of partitioning for hardware-accelerated functional verification. A key distinction in the realm of simulation is that the partitioned logic must ultimately be scheduled; hence, the objective of a verification partitioner is not necessarily to reduce cut, but rather to minimize final simulation depth. The output of each gate depends on the result of its inputs, and hence its evaluation must be deferred until after the evaluation of its sources. Classical partitioning ignores such temporal dependencies, and may divide the netlist into a completely unparallelizable quotient. FIG. 1 illustrates an example of how the solution with the best cut for placement can be the worst cut for scheduling. A simplified circuit design 2 is shown which includes a set of twelve early vertices (e.g., gates) 4a, a set of twelve late vertices 4b, and two bottleneck vertices 4c and 4d. The best partitionment 6 for placement (only one cut between the bottleneck vertices) is the worst partitionment in terms of parallelism and is entirely unsuitable for simulation. An optimum partitionment 8 for simulation bisects the sets of early and late vertices, and has a considerably higher cut count (four).
Classical models and methods for partitioning are incapable of distinguishing the temporal distribution of vertices. In addition, minimizing the pure number of nets cut neglects the directionality of the connections, whereas a proper formulation must consider the connectivity limits mandated by the machine architecture. Furthermore, the fundamental building blocks of the accelerator (e.g., its memories, arithmetic logic units, etc.) are often not distributed homogeneously. Specific portions of the netlist, especially arrays, may be restricted to a subset of processing units. These restrictions are further compounded by complex constraints that limit the number of cycles and bits used collectively by those entities. Finally, existing partitioning algorithms fail to account for changes in problem formulation that can occur at intermediate levels; for instance, arrays may be assigned to different memory classes as a result of the partial assignment, and the criticality of edges may also change depending on where in the topology nets are cut. Because existing methods for partitioning fail in each of these cases, they threaten to undermine the ability and efficacy of compiling large models for hardware acceleration.
In light of the foregoing, it would be desirable to devise an improved method of partitioning an integrated circuit design for hardware-accelerated simulation. It would be further advantageous if the method could take into consideration temporal and directional dependencies in the design.