Simulation is one method to predict the behavior of a target system (we use the term “target” to mean the system that is being simulated). A simulator mimics some or all of the behavior of the target. Simulation is often used when measuring the target system itself is undesirable for a variety of reasons including target unavailability, target cost, or the inability to appropriately measure the target.
Simulators are used in almost all fields and are implemented using a variety of technologies. Two examples of other simulators include wind tunnels used to measure the coefficient of drag on miniature models of automobiles and war games using live participants to test the capabilities of soldiers, commanders and military machinery. There is even a class of simulator games such as Simcity that simulates the growth and health of a city under the guidance of a player who acts as the city planner.
Though simulators can be implemented in a variety of ways, many current simulator hosts are computers (we use the term “host” to mean the system that runs the simulator.) A computer simulation is reproducible, duplicate-able so that many copies can be run, does not require physical objects and are generally easy to observe.
In addition to commonly serving as simulation hosts, computers are also simulation targets. Computers have long been sufficiently complex to require simulation to model behavior with any precision. Predicting the behavior of a computer system is useful for a variety of purposes including but not limited to (i) evaluating architectural alternatives, (ii) verifying a design, and (iii) development, debugging and tuning of compilers, operating systems and applications. A variety of behaviors ranging from performance to power to reliability is all useful to predict.
Virtually all computer simulators (we use the term “computer simulators” to mean “simulators of computer systems”) run on computer hosts. One significant issue facing computer simulators is simulation speed, an important concern for all simulators. For example, a weather simulator that runs slower than real time has limited efficacy. One may argue, however, that as successive generations of computers get faster over time, the simulators that run on those computers will also get faster. In fact, simulators of the physical world do run faster as the host computer increases in speed because the physical world does not increase in complexity over time.
Simulators of computers, however, do not. The problem is rooted in the fact that the more complex the simulated target becomes, the more activity it engages in per unit simulated time. This results in an increase in the computation per simulated unit time that must be performed by the host. The greater the computation per simulated unit time, the slower the simulator. Unlike the physical world, computer targets grow in complexity as fast as or faster than the computer hosts improve in performance. Thus, the increased speed of the host is consumed by the increased complexity of the target, resulting in computer simulation speeds remaining roughly constant over time.
Computers are complex systems consisting of one or more components that run concurrently and interact with each other. These components include processors, memory, disk, video, network interfaces, and so on. Each component itself is a complex system, making it very difficult to predict almost all aspects of their behavior including performance, power consumption and even functional correctness. Thus, in order to accurately simulate their behavior, we need to faithfully model the interactions between each component and the components it interacts with. On such component in a computer system is a processor which are essentially special-purpose hardware designed to execute programs expressed in a specific instruction set architecture (ISA). An ISA is a specification that includes a set of instructions, such as ADD, BRANCH and LOAD/STORE, as well as some model of the storage of the processor such as a register specification and a memory specification. All processors implement some ISA allowing programs that assume that ISA is to be executed by that processor.
Different processor families have different ISAs. For example, one of the most common ISAs is the Intel IA-32, which is often called x86. Processors made by companies such as Intel, AMD, Centaur/VIA, and Transmeta implement the IA-32 instruction set. Different ISAs are not only possible but, at one time, they proliferated. The Sun Sparc ISA, Motorola/IBM PowerPC ISA, the DEC/Compaq/HP Alpha ISA, the IBM 360 ISA and the MIPS ISA are all ISAs that were supported by real processors.
ISAs tend to evolve over time. The original x86 instruction set, for example, did not include floating point instructions. As the need for floating point became clear and reasonable to implement, however, floating point instructions were added. Many other instructions were added to the x86 instruction set, including MMX and SISD instructions.
Though all processors implement an ISA, different processors implementing the same ISA may have very different organizations. The underlying organization of a processor is called that processor's micro-architecture. The micro-architecture consists of hardware and potentially software components that implement the ISA including instructions and memory. The micro-architecture can be logically broken up into components such as an instruction decode unit, registers, execution units, caches, branch prediction units, reorder buffers, and so on. Some components, such as the instruction decode unit and registers, are essential to the correct operation of the processor while other units, such as caches, while not essential to correctness, are important to optimize some behavior such as performance. Each component can often be implemented in many different ways that result in different behavioral characteristics and resources.
To understand how a micro-architectural component can change the performance behavior of a processor, consider an instruction cache. A cache automatically stores data recently accessed from memory and routinely services future requests for that data as long as it is in the cache. Accessing the cache is faster than accessing the memory. Since the cache is smaller than memory, it relies on a replacement policy that decides what instructions to keep in the cache and what instructions to replace with newly accessed instructions. The first time some code is executed, that code is not in the instruction cache and must be obtained from memory. The second time the code is executed, there is a chance that it is in the cache in which case the access is faster. Since the cache is limited in size, it may be that the particular code in question may have been replaced before it is used again. Cache behavior is heavily dependent on the dynamic usage of that cache. Thus, without running the program and somehow modeling the instruction cache, it is very difficult to determine whether or not the code is in the cache.
There are many more components and features within a processor contributing to behavioral variance such as superscalar and out-of-order execution, branch prediction, parallel execution, and virtual memory. In addition to the processor, there are many more components within a computer system that also contribute to behavioral variance. Added together, there can be a significant amount of behavioral variation that are dependent on a large number of variables including the programs being currently run, the programs that ran in the past, and external events such as the arrival of a network packet or a keyboard stroke.
The most accurate model of a computer is the computer itself. It is often the case, however, that it is impractical and/or impossible to use the computer itself to predict its own behavior. For example, the computer is not available to be measured before it is manufactured. Running applications on an existing system and using its behavior to directly predict the behavior of a next generation system is generally inaccurate since the new system will be different than the old one.
Due to the complexity of computer systems, their behavior is generally predicted using simulators. Most simulators are written entirely in software and executed on regular computers. Simulators can model computer system behavior at a variety of levels. For example, some simulators only model the ISA and peripherals at a “functional” level, that is, at a detail level sufficient to implement functionality but not to predict timing. Such simulators are often able to boot operating systems and run unmodified applications and can be useful to provide visibility when debugging operating systems and software.
Other simulators model computer systems at a detail level sufficient to accurately predict the behavior of the computer system at a cycle-by-cycle level. Such simulators must accurately model all components that could potentially affect timing. They are often written by architects during the design of a computer system to help evaluate architectural mechanisms and determine their effect on overall performance. Most processors today are implemented in hardware description languages (HDL) that enable the specification of the processor in Register Transfer Logic (RTL). Such specifications can also be simulated very accurately.
There are, however, issues with cycle-accurate simulators. For the most part, they are extremely slow. Most truly cycle-accurate simulators run at approximately 10K cycles per second or slower. RTL cycle-accurate simulations run at a few cycles per second at best. Though computers have been getting faster, the complexity of the machines that they were simulating has also gone up, keeping simulation speed fairly constant over time. With the proliferation of chip multiprocessors (CMPs), however, it is likely that simulation performance will drop rapidly unless simulators can be efficiently parallelized. Simulating multiple processors obviously takes longer than simulating a single processor on the same host hardware resources.
Current simulator speeds are far too slow to run full operating systems and applications. For example, a simulator running at 10K cycles per second takes 402 days to simulate a two-minute OS boot. Such times are far too long, forcing users to extract kernels that are intended to accurately model longer runs. Such kernels, however, are difficult to chose and often do not exercise all of the behavioral complexity. It would be far easier if accurate simulators were fast enough to run full, unmodified operating systems and applications.
Thus, computer system simulation is a difficult problem with no satisfactory full solutions.