This invention relates to the simulation art, and more particularly to the art of distributed discrete event simulation.
Computer simulation has become very important in recent years because of the many applications where simulation of systems is highly beneficial. One such applicaton is the use of simulation in the design of complex systems. These may be electronic systems such as a telecommunicatons switching network, robot based flexible manufacturing systems, process control systems, health care delivery systems, transportation systems, and the like. Design verification through simulation plays an improtant role in speeding up the design and insuring that it conforms to the specification. Another application is the use of simulation in analyzing, and tracking down, faults appearing in operating system. Still another application is optimizing the operation of existing systems through repeated simulations, e.g., the operation of a manufacturing facility, the operation of the telecommunications network, scheduling and dispatching, etc. Yet another application is the use of simulation to predict the operation of systems which for various reasons can not be tested (e.g., resonse to catastrophe).
Simulations can be classified into three types: continuous time, discrete time, and discrete event. Discrete event simulation means simulation of a system in which phenomena of interest change value or state at discrete moments of time, and no changes occur except in response to an applied stimulus. For example, a bus traveling a prescribed route defines a discrete event system in which the number of passengers can change only when the bus arrives at a bus stop along the route.
Of the three simulation classes, from computation standpoint it appears that discrete event simulation is potentially the least burdensome approach because simulation of time when nothing happens is dispensed with. Of course, synchronizaton of the event simulations must be considered when parellelism is employed. Most often, a discrete event simulator progresses by operating on an event list. An event at the top of the list is processed, possibly adding events to the list in the course of processing, and the simulation time is advanced. Thereafter, the processed event at the top of the list is removed. This technique limits the speed of simulation to the rate at which a single processor is able to consider the events one at a time. In a parrellel scheme many processors simultaneously are engaged in the task creating a potential for speeding up the simulation. Although techniques for performing event list manipulation and event simulation in parallel have been suggested, large scale performance improvements are achieved only by eliminating the event list in its traditional form. This is accomplished by distributed simulations.
In a distributed simulation, a number of parallel processors form a simulation multicomputer network, and it is the entire network that is devoted to a simulation task. More specifically, each processor within the network is devoted to a specific portion of the system that is simulated; it maintains its own event list and communicates event occurrences to appropriate neighbor processors. Stated conversely, if one views a simulated system as a network of interacting subsystems, distributed simulation maps each subsystem onto a processor of the multicomputer network.
Although distributed simulation provides parallelism which has the potential for improving the simulation speed, allocation and synchronization of work among the processors is a major concern which may impede the realization of the improvement goals. One well known approach for distributed simulation has been proposed by Chandy and Misra in "Distributed Simulation: A Case Study in Design and Verification of Distributed Programs," IEEE Transactions on Software Engineering, Vol. SE-5, No. 5, September 1979, pp. 440-452, and by Chandy, Holmes and Misra in "Distributed Simulation of Networks," Computer Networks, Vol. 3, No. 1, February 1979 pp. 105-113. In this approach, they recognize that physical systems to be simulated are composed of independent but interacting entities, and that those entities should be mapped onto a topologically equivalent system of logical nodes. Interaction between nodes is accomplished by the exchange of time-stamped messages which include the desired message information and identify the simulation time of the sending node. In accordancewith the Chandy-Holmes-Misra approach, the nodes interact only via messages. There are no global shared variables, each node is activated only in response to a message, each node maintains its own clock, and finally, the time-stamps of the messages generated by each node are non-decreasing (in time). In this arrangement, each of the nodes works independently to process the events assigned to it in the correct simulated order. Thus, independent event can be simulated in parallel, within different nodes, even if they occur at different simulated times.
The time stamping is required, of course, to maintain causality so that in a message-receiving node an event that is scheduled for time T is not simulated when other incoming messages can still arrive with a time-stamp of less than T. Because of this, when a particular node is able to receive input from two sender nodes, it cannot simulate an event with any assurance that it would not be called upon to refrain from simulating the event, until it receives a message from both sender nodes. Waiting to receive a message from all inputs slows the simulation process down substantially and can easily result in a deadlock cycle where each node waits for a previous node, which amounts to the situation of a node waiting for itself.
To remedy the wait problem, artisans have been employing recovery and avoidance techniques. In the recovery technique, proposed by Chandy and Misra, upon detecting a deadlock, the processors in the network exchange messages in order to determine which of the waiting nodes can process their events in spite of the apparent deadlock. This discribed in K. M. Chandy and J. Misra, "Asynchronous Distributed Simulation via a Sequence of Parallel Computations," Communications of the ACM, Vol. 24, No. 4, April 1981, pp. 198-206. In the avoidance technique, on the other hand, certain types of nodes send null messages under specific conditions even when no instructions for other nodes are called for. By this technique, nodes can be advanced more quickly in their simulation time. Jefferson and Sowizral, in "Fast Concurrent Simulation Using the Time Warp Mechanism," Distributed Simulation, 1985, The Society for Computer Simulation Multiconference, San Diego, Calif., suggest a different technique where each node is allowed to advance in its simulation time "at its own risk," but when a message arrives that would have caused some events to not have been simulated, then a "roll-back" is executed to undo the simulation that was done. Roll-back of a node may not be difficult, perhaps, but the fact that the simulated event(s) that need to be rolled back may have caused messages to be sent to other nodes does complicate the task substantially. To achieve the rollback, Jefferson et al. suggest the use of "anti-messages," which are messages that parallel the original messages, except that they cause the performance of some action that "undoes" the original action.
Neither of these techniques is very good because each potentially expands an inordinate amount of computation time in making sure that the overall simulation advances properly. The null message approach expends computing resources in generating, sending, and reading the null message; the recovery approach expends computing resources to detect and recover from a deadlock, and roll-back approach expends computing resources in simulating events and then undoing the work that was previously done.