1. Technical Field of the Invention
The present invention generally relates to computer systems. More particularly, and not by way of any limitation, the present invention is directed to a system and method for increasing performance in a multi-CPU simulator environment.
2. Description of Related Art
Architectural simulators are often used to simulate a target hardware platform, whereby a simulated execution environment can “execute” a particular piece of software intended for the target hardware as if it were run on the actual machine itself. The target hardware platform may be comprised of any known computer architecture. For various reasons, simulators operable to simulate multiprocessor (MP) computer systems are more prevalent. Because the teachings of the present invention are particularly exemplified within the context of MP platforms, a brief introduction thereto is immediately set forth below.
In the most general sense, multiprocessing may be defined as the use of multiple processors to perform computing tasks. The term could apply to a set of networked computers in different locations, or to a single system containing several processors. As is well known, however, the term is most often used to describe an architecture where two or more linked processors are contained in a single or partitioned enclosure. Further, multiprocessing does not occur just because multiple processors are present. For example, having a stack of personal computers in a rack is not multiprocessing. Similarly, a server with one or more “standby” processors is not multiprocessing, either. The term “multiprocessing” is typically applied, therefore, only to architectures where two or more processors are designed to work in a cooperative fashion on a task or set of tasks.
There exist numerous variations on the basic theme of multiprocessing. In general, these variations relate to how independently the processors operate and how the workload among these processors is distributed. In loosely-coupled multiprocessing architectures, the processors perform related tasks but they do so as if they were standalone processors. Each processor is typically provided with its own private memory and may have its own mass storage and input/output (I/O). Further, each loosely-coupled processor runs its own copy of an operating system (OS), and communicates with the other processor or processors through a message-passing scheme, much like devices communicating over a local area network. Loosely-coupled multiprocessing has been widely used in mainframes and minicomputers, but the software to effectuate MP activity is closely tied to the hardware design. For this reason, among others, it has not gained the support of software vendors and is not widely used in today's high performance server systems.
In tightly-coupled multiprocessing, on the other hand, operation of the processors is more closely integrated. They typically share main memory, and may even have a shared cache. The processors need not be identical to one another, and may or may not perform similar tasks. However, they typically share other system resources such as mass storage and I/O. Additionally, instead of a separate copy of the OS for each processor, they run a single copy, with the OS handling the coordination of tasks between the processors. The sharing of system resources makes tightly-coupled multiprocessing platforms somewhat less expensive, and it is the dominant multiprocessor architecture in the business-class servers currently deployed.
Hardware architectures for tightly-coupled MP platforms can be further divided into two broad categories. In symmetrical MP (SMP) systems, system resources such as memory, disk storage and I/O are shared by all the microprocessors in the system. The workload is distributed evenly to available processors so that one does not sit idle while another is heavily loaded with a specific task. Further, the SMP architecture is highly scalable, i.e., the performance of SMP systems increases, at least theoretically, as more processor units are added.
In asymmetrical MP (AMP) systems, tasks and resources are managed by different processor units. For example, one processor unit may handle I/O and another may handle network OS (NOS)-related tasks. Thus, it should be apparent that an asymmetrical MP system may not balance the workload and, accordingly, it is possible that a processor unit handling one task can be overworked while another unit sits idle.
SMP systems are further subdivided into two types, depending on the way cache memory is implemented. “Shared-cache” platforms, where off-chip (i.e., Level 2, or L2) cache is shared among the processors, offer lower performance in general. In “dedicated-cache” systems, every processor unit is provided with a dedicated L2 cache, in addition to its on-chip (Level 1, or L1) cache memory. The dedicated L2 cache arrangement accelerates processor-memory interactions in the multiprocessing environment and, moreover, facilitates better scalability.
Regardless of which type of the target platform is simulated, a simulator is typically run on a host machine that may itself be a high-performance computer system having MP capability. During the execution of the simulator and any code run on the simulated platform, the host machine expends its own resources, e.g., processor cycles, memory accesses, and the like, in order to execute the simulator software. Clearly, how effective and efficient a simulator is with respect to a target hardware platform depends on how it is run on the host machine, which in turn is based on how the host resources are utilized in the process of effectuating a simulation environment.
It should be readily apparent that consumption and conservation of host resources can be particularly critical where a multi-CPU platform is being simulated for executing a specific piece of software (also referred to as code under simulation). In a conventional simulation environment, when a simulated processor is in an idle loop during the execution of the code under simulation, the host machine resources continue to be used up, thereby reducing the performance of the simulator. This situation is especially wasteful in a single-threaded simulator environment supporting a multi-CPU target platform because each simulated CPU consumes the same amount of available host resources even when executing idle loops.