In the field of LSI and VLSI circuits, increased circuit complexity and increased density of the devices on a single chip have resulted in an ever increasing need for computer aided circuit simulation, which is capable of carrying out the necessary circuit simulation in a reasonable period of time. Computer programs such as SPICE and ASTAP have been employed for such circuit simulation. However, these are proven to be less and less adequate as the circuits become more dense and more complex on a single chip.
A variety of algorithms for the efficient and accurate simulation of complex circuits have been proposed for example, Nagle, "SPICE2. A Computer Program to Simulate Semiconductor Circuits," University of California, Berkley, Memo No. ERL-M520, May 1985; Weeks, "Algorithms for ASTAP-A Network Analysis Program," IEEE Transactions on Circuit Theory, Volume CT-20, pp. 628-634, November 1983; Saleh, Kleckner and Newton, Iterated Timing Analysis and SPICE, Proceedings of the IEEE International Conference on Computer Aided Design, Santa Clara, CA., pp. 139-140, September 1983; White and Sagiovanni-Vincentelli, "Relax 2.1-A Waveform Relaxation Based Circuit Simulation Program," Proceedings, International Custom Integrated Circuits Conference, Rochester, NY, pp. 232-236, June 1984. Salkallah and Director, "An Activity Directed Circuit Simulation Algorithm," Proceedings IEEE Conference on Circuits and Computers, pp. 1032-1035, October 1980.
More recently the implementation of some of these algorithms has been suggested in a multi-processing environment. Circuit simulation algorithms based on iterative methods, such as Gauss-Jacobi, however, while easy to parallelize, usually have slow convergence rates unless some of the parallelization utility is sacrificed. Iterative methods, moreover, are not appropriate for tightly coupled circuits. The RELAX simulator lumps together tightly coupled nodes and solves the sub-networks using a direct method. Such partitioning algorithms also have to deal with conflicting constraints when convergence properties, processor number and load balancing among processors need to be taken into account.
It has also been suggested in the past, that parallel processing using a plurality of computers, or computers having a plurality of central processing units which operate in parallel with each other, and closely coupling the plurality of processors to a shared memory, is a solution to the increasingly complex problem of circuit simulation. One such solution is proposed in Jacob, Newton, and Pederson, "Direct-Method Circuit Simulation Using Multiprocessors," Proceedings of the International Symposium on Circuits and Systems, May 1986. Another such solution is proposed in White, "Parallelizing Circuit Simulation--A Combined Algorithmic and Specialized Hardware Approach, Proceedings of the ICCD, Rye, New York, 1985.
Each of these references, however, recognizes a significant drawback to the utilization of parallel processing for circuit simulation. Since the processors are sharing the same memory, and the processors are in parallel computing values for different circuit elements throughout the LSI or VLSI circuit, or some defined sub-portion of such circuit, the processors will quite often be solving for values at the same node within the circuit at the same time. Since the processing algorithms also typically involve iterative processes, which update values for a given node based on computations relating to circuit elements connected to the node, nd a memory location within the computer is used to store the values at ny given time for a particular node, the problem arises that the access to he shared memory must be strictly controlled. Otherwise, more than one processor may write information into a memory location corresponding to a articular node, at the same time another processor is attempting to read previously stored information related to that node or write new information elating to that node, based on its computations with respect to a different circuit element attached to the node.
The solution in the past has been to require special locking mechanisms within the programs controlling the multiple processors in order to prevent such occurrences. This in effect lengthens the total CPU processing time required, in that much time is spent waiting for the access to the particular memory location to be opened to another of the processing units operating in parallel, which desires to read or write information into the particular memory location. In fact, the proposed solution in the past has been to block write access to the entire matrix when one CPU is writing to the matrix.
The Jacob et al. paper, noted above, investigates the parallel implementation of a direct method circuit simulator using SPLICE, and using up to eight processors of a shared memory machine. An efficiency close to forty-five percent was reported. Much of the failure to achieve a higher percentage efficiency is the result of synchronization of the multiple processors in their access to common shared memory locations.
Parallelization employing such simulation programs as SPICE uses a list of circuit components and parameters, which may be constants or variables, for example, in the case of MOSFET transistors geometric parameters, currents, charges, capacitances, etc., which are then used to calculate the individual components or terms, the sum of which equals the value for each entry at a position in a matrix. The matrix is used to solve a number of simultaneous equations for the variables, for example, voltages, at each node within the circuit. The matrix typically consists of an X by X matrix with X approximately equal to the number of nodes in the circuit. Since the circuit elements are typically connected to at least two nodes, and may often be connected to more than two nodes, the parallelization techniques employed in the past suffer from the disability that the summation of the terms, in order to define the value at the matrix location for a given node, may be influenced by more than one circuit element, and the parallel processors may be computing the effect of two or more circuit elements on the same matrix node location at the same time. This results in the need for the use of some technique, for example, interlocks, to prevent simultaneous writing or reading of data by more than one processor into or from the memory location corresponding to the matrix location.
Once the matrix is loaded, it is known in the art, using SPICE, to solve the matrix using sparse matrix LU decomposition. Parallelization of the matrix solution phase is also important, and presents unique problems. For larger circuits, the CPU time needed for the matrix solution phase will dominate over that needed for the matrix load phase. Efficient parallelization schemes are known for full matrices, as is reported in Thomas, "Using the Butterfly to Solve Simultaneous Linear Equations", BBN Laboratories Memorandum, March 1985. However, sparse matrices are more difficult to decompose efficiently in parallel. The LU decomposition algorithm has a sequential dependency and the amount of concurrent work which can be done at each step, using SPICE, in a space matrix is small. Algorithms detecting the maximum parallelism at each step have been proposed for vectorized circuit simulation. Yamamoto and Takahashi, "Vectorized LU Decomposition Algorithms for Large Scale Circuit Simulation", IEEE Transactions on computer Aided Design, Vol. Cad-4, No. 3, pp. 232-239, July 1985. Algorithms based upon a pivot dependency graph and task queues have been proposed. Jacob et al., supra. The overhead associated with task queues makes the efficiency of these algorithms questionable.
Recognizing the need for an improved circuit simulation apparatus and method, it is a general object of present invention to provide a circuit simulation apparatus and method for simulating LSI and VLSI circuits which eliminates the costly synchronization requirements of the prior art and, in addition, more efficiently implements the LU decomposition in parallel using multiple processors.
A feature present invention relates to providing memory locations corresponding both to the matrix location and to a specific one of a plurality of terms which determine the ultimate value for the entry at the matrix location, and summing the terms from the plurality of memory locations corresponding to a matrix entry to define a single value for the matrix entry.
A further feature of the present invention relates to synchronizing the parallel processors during matrix decomposition by assigning a single processor to a given row within the matrix and setting a flag when the values in a row extending from a diagonal matrix element are ready for use.
The features of the present invention discussed above, are not intended to be all-inclusive but rather are representative of the features of the present invention in order to enable those skilled in the art to better appreciate the advance made over the art and to understand the invention. Other features will become more apparent upon reference to the detailed description of the preferred embodiment which is contained below and with reference to the figures of the drawing.