Simulation is used to understand the behavior of systems. Typically, simulation is beneficial for use in understanding the behavior of complex systems in an environment without having to test an actual working system. For instance, in computer networks, it may be beneficial to model the network system as a group of attached nodes, each of which forwards data to other nodes and communicates routing information between the nodes. Simulation is useful to simulate a large network (e.g., the Internet) as it is not realistic to assemble and test such a network of such a large scale. A network-specific simulation system such as BONeS (available from Cadence, Inc., San Jose, Calif.) and OPNET (available from OPNET Technologies, Inc., Bethesda, Md.) are used to model such a network system. Other simulation systems are available.
Other simulation systems are used to solve different problems. For example, there are simulation systems that have been developed to solve particular problems (e.g., analog\digital circuit design). For instance, it may be useful to model and simulate components of an integrated circuit, system-on-a-chip, or other processing system. In these cases, it may be useful to understand the workings of the chip in a simulation before it is produced. By simulating the chip and identifying issues before production, design issues that may affect the quality of the chip (e.g., performance) may be avoided prior to production. Further, supporting systems (e.g., other chips, systems, etc.) with which the chip may interact may also be simulated. There are other simulation systems that may be used to solve particular problems (e.g., Flexsim (available from FlexSim Software Products, Inc.)—to simulate business processes, SPADES (available from SourceForge.net) to simulate parallel-executing software agents, etc.). Also, there are a number of general-purpose simulation systems and languages (e.g., CSMP, CSSL, SIMULA, etc.) that can be programmed to simulate different types of systems having a variety of properties.
However, simulation is a very time-consuming process, both in the programming necessary to develop the simulation and the resources (time, processing capability of the simulation system) needed to execute it. Generally, a simulation is executed on one or more computer systems having one or more processors (e.g., a personal computer, workstation, mainframe, etc.). These systems are limited in their capability for processing multiple parallel events, and as the number of entities being modeled increases, the amount of memory required for the simulation and number of processing cycles also increases. When simulating large systems (e.g., a network having thousands of nodes), limitations of the simulation system become more apparent. Thus, there is a need for increasing the performance of simulation systems, and in particular, for increasing performance in a simulation system to handle a higher number of parallel events.
There are different types of simulations that can be performed based on the system being simulated. Continuous time simulation systems are used to simulate time-dependent systems (e.g., a server that responds to one or more clients) that have some type of time relationship. Other simulation systems may model events not dependent on time, and this type of simulation is referred to as discrete event simulation. A simulation system may implement one or both of these simulation techniques.
The system being modeled may be aware of time, and may respond to events and may schedule events in time. Other entities may not be aware of time (e.g., a FIFO, a calculator), and are generally responsive to entities in the simulation environment that may or may not be time-aware. There are also time-independent entities that maintain their own time themselves, and may interact with other entities that maintain their own simulation times. Generally, modeling and simulation of time-independent systems is more complex than time-dependent systems, as more processing is involved in maintaining independent system time for each entity and for simulating the parallel processing of events at each entity. Processing of time-independent systems is referred to in the art as Parallel Discrete Event Simulation or PDES.
There are problems with simulating systems that implement parallel discrete events. Some simulators execute events serially, but generally this is inefficient as the number of parallel-operating entities increases. Multiple computers/processors/logical processes (hereinafter, a processing “entity”) may be used to perform parallel simulations to increase performance, but coordination between the parallel processing entities becomes problematic. In particular, it becomes difficult to execute events concurrently on different processing entities without knowing the exact causal relationship between the parallel executions of those entities. More simply stated, a processing of a first event on one processing entity may affect the execution of a second event on another processing entity. An execution of one event (e.g., the first event) in a wrong order with respect to the other event (e.g., the second event) may cause a simulation error to occur.
Several different methods (protocols) for coordinating the simulation of parallel entities on different processors exist. For example, there is a general class of protocols referred in the art as lookahead protocols that try to predict the future within the simulation. That is, lookahead protocols attempt to predict the receipt of future events that will be received by exchanging messages between processing entities that identify the lowest timestamp of an event that can be sent, and therefore communicating the status of other processing entities. Generally, the processing entities each wait to process additional event until they determine with certainty that events do not affect each other. These protocols are generally conservative, as the individual processing entities have a tendency to wait for one another to process events. This characteristic makes the simulation proceed non-optimally.
Another class of parallel event simulation protocols includes what are referred to in the art as “optimistic” protocols that assume that the processing of events by one processing entity does not affect processing at other processing entities. Because causal entities do affect each other and this assumption is not true in general, such protocols must maintain the ability to correct the past when an event is processed that affects another processing entity (a causal event). In this case, the simulation is “rolled back” to correct the processing of the causal event, and to advance the simulation from that rolled-back point in simulation time. This is accomplished at each processing entity by maintaining every event processed locally to the processing entity so that the processing entity may recover from an out-of-order execution. Generally this is done by saving changes in state of data in a change list or other structure used to track changes. As can be appreciated, this rolling-back of simulation time is inefficient, and as the number of entities being simulated increases, each having one or more causal relationships with each other, the number of rollbacks also increases. Rollbacks are computationally expensive, as many processing cycles are wasted due to the number of lost processing cycles. Also, the amount of memory required by the simulation system(s) to store changes corresponding to each processed event becomes prohibitive, especially when there are a large number of modeled entities. What is needed, therefore is a more efficient protocol for simulating the execution of parallel events.
Researchers have long realized that Parallel Discrete Event Simulation (PDES) is an effective approach to simulating large-scale complex systems. Research on PDES has been going on for more than twenty years. The main difficulty in this area is to achieve high efficiency of parallel execution while preserving the causality order between events for the simulation carried out on multiple processors. The logical process paradigm, widely used in the PDES community, assures that no causality errors will occur if each logical process adheres to the local causality constraint, i.e., if each logical process executes its events in non-decreasing timestamp order. Therefore, to preserve the causality order, it is sufficient, but not necessary, that each logical process finds and executes the future event with the smallest timestamp.
The advent of PDES was marked by the invention of conservative protocols, the first of which was the null message protocol, or so-called Chandy/Misra/Brynt protocol, developed in 1979. In most cases, conservative protocols require each logical process to broadcast to its neighbors, in the form of null messages, a low bound on the timestamp of events it will send to other logical processes, or Earliest Output Time (EOT). By listening to the null messages from all neighbors, each processing entity can determine the lowest timestamp it will receive in a message in the future, or Earliest Input Time (EIT). If this timestamp is greater than that of the earliest event in its local event list, the process is sure that this earliest event can be processed without violating the causality constraint. Otherwise, the processing entity has to block until this condition is met (i.e., the message in transit carrying the event with the smallest timestamp is received and this event is placed on the local event list).
In 1985, Jefferson published a paper describing a construct referred to as Virtual Time, which proposed a new synchronization paradigm called Time Warp. In the Time Warp and other optimistic protocols, a logical process is allowed to aggressively process events in its local event list, and during the event execution new messages can be sent to other logical processes. However, when an event arrives from another logical process with a timestamp smaller than the local simulation time, it triggers a causality error. As a result, all processed events having a larger timestamp must be rolled back, and anti-messages must be sent to other logical processes to counteract those messages sent during the erroneous computation. Ironically, although they are called optimistic, these protocols are actually quite pessimistic, in the sense that they must save every change made to the state in order to recover from the erroneous computation, because they assume that every operation is unsafe and subject to a rollback. The Global Virtual Time (GVT) gives a lower bound on the timestamp of the earliest event that a logical process may receive. Therefore, any event processed earlier than the GVT is regarded as a committed because it will never be rolled back. For such events, the logical process can reclaim the memory used to store the associated state (or state changes if incremental state saving is used).
Research on PDES has been largely dominated by the studies of conservative and optimistic protocols, and comparison of their performance. Unfortunately, both types of protocols have their strengths and weaknesses. Efficiency of conservative protocols in parallel execution is limited by the amount of lookahead in the simulation model, which is equal to the difference between the Earliest Output Time (EOT) and the Earliest Input Time (EIT). Both EIT and EOT are known exactly only during run-time and in many real world applications, deriving bounds for the difference between these two values, which will define useable lookahead, is difficult to do. Besides, large number of null messages required to collaboratively advance the simulation clock in conservative protocols often incur significant overhead. As a result, parallelized execution may be slower than even the sequential one. On the other hand, optimistic protocols do not depend on lookahead and null messages, however, state saving usually requires storing and accessing large amounts of memory. This negatively impacts the speed of execution because of the relatively slow improvement in memory access speed within the current VLSI technology. The handling of anti-messages complicates the simulation model development. Furthermore, optimistic models may exhibit unexpected behavior caused by inconsistent messages resulting from rollback inconsistencies and stale states.