1. Field of the Invention
The present invention generally relates to computer systems, and in particular to a method for emulating the memory sharing behavior of a multiprocessing computer system on another multiprocessing computing system with an intrinsically different memory sharing behavior.
2. Description of the Related Art
A major motivation for emulation, is to allow systems written for a particular architecture, to execute on another architecture, with a minimum loss of performance. Clearly then, the efficiency of the emulation process and the quality of the resulting “host” code sequence are of paramount importance.
Typically, a computing system includes several portions, including the processors, the memory, and the input/output devices. It is often necessary to emulate the behavior of one computing system on another. One of the principal reasons for emulation is to enable programs written for a system (e.g., the “target computing system”), to perform with the same results on another system (e.g., the “host computing system”).
Several conventional techniques have been developed to emulate the instruction set of one processor using the instruction set of another processor (e.g., SIMOS as disclosed by Stephen A. Herrod,“Using Complete Machine Simulation to Understand Computer System Behavior,” Ph.D. Thesis, Stanford University, February 1998; or MIMIC as disclosed in Cathy May, “Mimic: A Fast System/370 Simulator”, Proceedings of the Object Oriented Programming Systems Languages and Applications Conference, (OOPSLA), Orlando, Oct. 4-8, 1987, Special Issue of Sigplan Notices, vol. 22, No. 12, December 1987, vol. 22, No. 7, June 24).
To perform the emulation faithfully, it is necessary also to emulate the behavior of memory in such a system. Typically, the behavior includes more than reading and writing locations in memory with program-specified addresses.
More particularly, when the contents of some location are changed by a processor in a multiprocessing system, the rules governing when the change should be observed by all processors in the system are well-defined in the architecture. In this respect, most systems today behave almost identically when there is only one processor in the system. That is, the systems enforce program order, which simply means that if a statement S1 precedes another statement S2 in the sequence of instructions presented in a program, the processor must behave as though S1 completes its execution before S2 begins its execution. This implies that S2 must know of changes made by S1 to any resource, including registers and memory.
Therefore, if a multiprocessing system is emulated on a uniprocessing system as in the above-mentioned SimOS and SimICS tchniques, and if both the target system and host system obey program order, the memory accesses during emulation can be viewed as a series of epochs 101, each epoch 101 representing a processor in the target system, as shown in FIG. 1. That is the number in the area indicates the identification of an emulated processor, whereas the shaded areas 100 indicate that the system was performing other functions.
Since there is no simultaneous interaction between the emulated memory accesses of two or more processors during an epoch 101, program order can be guaranteed by simply performing a correct uniprocessor emulation of a target processor on the host processor.
Hence, in FIG. 1, when one is attempting to emulate multiple processors on a single processor, then interleaving occurs of the different multiprocessors at different times. Thus, processor 1 will be emulated at a certain time, and then processor 2 is emulated for a while, and then the emulation goes back to emulation of processor 1 for a time, and then there is a time (e.g., shaded area 100) when no emulation occurs, but the single processor is performing other functions. Thus, the single processor at a given time, is performing emulation of only one of the processors of the multiprocessors, not all of the processors.
The above interleaving operation is extremely inefficient. However, an advantage is that it is impossible that the “critical section” of one processor is interleaved with a critical section of another processor. That is, a “critical section” is a section which cannot be entered at the same time by two processors.
On a uniprocessor, one does not have to do much (e.g., nothing special) when performing the emulation, since the critical section will never be entered by two processors at the same time. In a sense, the uniprocessor automatically satisfies the condition that the critical section will not be entered by two processors simultaneously. By the same token, the situation changes dramatically when trying to emulate the processors on a multiprocessing system. That is, problems may arise.
For example, FIG. 2 illustrates an application of Dekker's algorithm in which there is an example of sharing between processors 201 and 202. In the example of FIG. 2, two processors 201, 202 are attempting to access a critical section. Again, in a uniprocessor, the accessing of a critical section (e.g., one in which only one processor can enter at a time) does not become problematic (e.g., automatically overcome by the uniprocessor), since the uniprocessor operation is sequential through the use of the epochs shown in FIG. 1.
Hence, if a program was implemented which used Dekker's algorithm and it was being run on a multiprocessing machine and emulation was being performed on a uniprocessor, then there would be no problem. However, if the same program using Dekker's algorithm was being run on a target machine and the target machine was being emulated by the multiprocessing system, then there would be a problem. Dekker's algorithm should be viewed as though it were running on processor 201, 202 as the target processor 201, 202.
Specifically, the first processor 201 sets a variable x and then ensures that the other processor 202 has not set its own variable y before getting into the critical section. The second processor 202 sets y and ensures that x is not set before progressing into the critical section. There are several ways in which the entire application can play out when emulated on a uniprocessor, two of which are shown in FIG. 3. As noted, Order 1 for P1 is performed until the epoch change at which time P2 is emulated. Thus, in the two-processor order of FIG. 3, whenever there is an epoch change, emulation switches to the other processor, and it is clear that when one processor (e.g., P1) is in the critical section, the other processor (P2) cannot enter the critical section.
However, it is impossible for the critical section of one processor to be interleaved with the critical section of the second processor since this would imply that the host processor sees (x=1; y=0) at one time and (x=0; y=1) at another without a write having occurred in between.
FIG. 4 illustrates a situation that could lead to incorrect behavior of the method/Dekker's algorithm of FIG. 2. System 400 of FIG. 4 includes a shared memory 410, as well as caches 420A, 420B respectively provided for first and second processors 430A, 430B. Hence, in the situation of FIG. 4, both processors could be executing in the critical section at the same time, thereby causing problems.
That is, suppose that the target system has a consistency model that ensures that this situation can never happen (e.g., by waiting after stores to memory until an acknowledgment is received from other processors). Examples of such systems include the IBM System/390® and the Intel x86®.
Also suppose that the host system, in an attempt to make its implementation relatively fast, has a more relaxed consistency model and does not guarantee atomic recording of writes to memory. Incorrect behavior may therefore result when the example above is emulated by such a host system.
Systems which have such relaxed consistency models like the PowerPC®, invariably offer special synchronizing instructions, or memory barrier instructions that ensure that results of memory actions before the instruction are guaranteed to have been seen by all processors before seeing the results of later instructions.
Thus, one way to ensure that the simulation is done correctly is to follow every memory instruction with a memory barrier instruction. Unfortunately, such instructions take several cycles to complete. Moreover, memory instructions (e.g., loads and stores from and to memory) are quite frequent (e.g., oftentimes as many as a third of all instructions).
It therefore becomes desirable to find a solution to the above problem. Further, it becomes desirable to minimize the cost of emulating the memory consistency behavior of a multiprocessing system on another multiprocessing system, especially when the host multiprocessing system supports a “weaker” (more relaxed) consistency model compared to the target multiprocessing system (more stringent or “strong”). It is noted that the terms “weaker” (or “relaxed”) and “strong” are believed to be well-known to those of ordinary skill in the art (e.g., see Sarita Adve et al., “Shared Memory Consistency Models: A Tutorial,” IEEE Computer, vol. 29, no. 12, December 1996, pp. 66-76, and L. Lamport, “How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs,” IEEE Transactions on Computers, C-28, 9, September 1979, pp. 690-691).
Thus, in the conventional emulation methods and techniques, various levels of translation may be employed to enhance the performance of the host instructions produced by the emulator. However, notwithstanding all the current techniques, there remains much room for improvement.
Hence, in the conventional techniques and prior to the present invention, there has been no method and apparatus for emulating memory consistency of one system on another system, especially when the host is supporting the above-mentioned relaxed consistency model and the target system uses a strong consistency model.