1. Field of the Invention
The present invention relates to processors in data processing systems and, in particular, to out-of-order execution of multi-threaded programs, and, more specifically, to an out-of-order execution scheme for executing multiple threads of instructions in a meta-program-based processor.
2. Description of the Related Art
A sequential program includes a number of processor instructions that has a single control flow of execution (or single thread of execution), at any given time. In contrast, a multi-threaded program includes a number of interacting sequential threads, each specifying a concurrent thread of execution (or control flow).
A multi-threaded program specifies the interaction between its concurrent threads, using synchronization primitives to control the order in which the instructions of concurrent threads are interleaved during execution. The synchronization primitives, often implemented as atomic memory access instructions or serialization instructions, are essential for the correctness of multi-threaded programs. They are also the building blocks used to define “critical sections” in programs in order to prevent simultaneous updates to shared data by multiple threads.
Clearly, a processor executing instructions of a multi-threaded program must execute instructions in such a way that it honors all the synchronization primitives specified by the program, in order to avoid non-deterministic and erroneous program execution. In general, in a traditional multithreaded program execution environment, the following two primitive operations are involved in synchronizing or communicating between a pair of threads A and B:
(1) Thread A signals thread B by writing to a shared address space; and
(2) If thread A is blocked, it resumes execution by reading from the shared memory address space and comparing the data written by thread B to an expected value.
In a processor, both of the above operations involve:
(a) computing the address of the shared memory location; and
(b) carrying out the memory access (load or store) operation.
One problem inherent in this process is that the latency of a memory access operation is unpredictable, often taking several cycles, causing slow synchronization/communication between the threads. Therefore, it is important to make the thread synchronization/communication operations faster and more efficient to achieve higher performance for multi-threaded programs.
The document entitled “Prescott New Instructions—Software Developer's Guide”, No. 252490-001, February, 2003, published by Intel Corporation, describes a technique to speed up inter-thread synchronization using two new instructions, MONITOR and MWAIT. The MONITOR instruction is used for defining (i.e., to setup) the address range to be monitored by the MWAIT instruction, whereas the MWAIT instruction waits for a write to a specific address range specified by the MONITOR instruction. These two instructions are used to implement a “triggering address range.”
Any writes to a specified address range is detected using an MWAIT instruction without accessing/reading the data stored in the memory address range, and, thus, can be used for providing slightly faster synchronization. This technique avoids operation (2) above by monitoring the addresses written by thread A instead of reading the data.
Meta-Program-Based Processors
A meta-program-based computing system provides an even more efficient and faster synchronization technique by eliminating operation (1), as well as both the associated address generation and memory access operations.
That is, the meta-program-based computing provides a fundamentally different way of executing multi-threaded programs. In a meta-program-based computing system, such as exemplarily described in the above-identified co-pending application, the address of the main program is used by the meta-program to monitor and follow execution path (or more precisely, the control flow graph) of the main program and change its execution behavior.
Threads in a meta-program-based computing system do not have to write to a specific memory location for a pair of “threads” to synchronize. Instead, thread B can just monitor one or more specific instruction addresses of thread A (where the store instructions should have been in the traditional model of multi-threaded computation). Note that the instruction address to be compared against is generated for free by the next-instruction-address generation logic of the processor.
Thus, meta-program-based computing provides a simpler technique for thread synchronization because it eliminates both operations (1) and (2). The meta-program-based thread synchronization is faster too, because it involves a simple comparison of two addresses, an operation that can be done early in the pipeline.
A meta-program-based computing system may be developed using either an in-order processor or an out-of-order processor. An out-of-order processor, such as the IBM POWER4™ processor, tries to execute instructions from a thread in an order different from the order in which they appear in the program. In an out-of-order processor with simultaneous multithreading (SMT), such as an IBM POWER5™ processor, instructions are executed from multiple threads out of order.
Several conventional techniques have been developed to ensure that multithreaded programs execute correctly on such out-of-order processors. The entire conventional techniques used by these contemporary processors describe different ways to hold the commit results of the oldest instruction until an acknowledge signal arrives from the synchronization point in the memory hierarchy, or enforce strict serialized execution for certain instructions.
In a meta-program-based system, since shared memory is not used for synchronization, there is no need to wait for such a signal. Instead, to implement the correct thread synchronization and communication operations as specified by the threads, newer techniques are needed.
Furthermore, out-of-order processors often fetch instructions speculatively from predicted execution paths. Since a meta-program follows the execution of the main program (a different thread), the speculatively fetched instructions of the main program may cause the meta-program fetch (MP-fetch) stage to speculatively fetch meta-program instructions, if any, along the speculative execution path of the main program.
Clearly, an out-of-order execution engine is needed that can allow the speculative execution of both meta-program and main program instructions and discard the speculative execution results of both threads whenever the speculation turns out to be wrong. While the implementation of such an instruction-level light-weight synchronization execution model needed for meta-program-based computing on an in-order processor is straight forward, as exemplified by the description in the above co-pending application, an efficient implementation of an out-of-order processing engine that would not degrade its performance, while executing programs concurrently with meta-programs, is not obvious.
A naive solution to this problem would be to “halt” instruction fetch (possibly by steering the instructions into a side buffer as done in some of the high frequency processor pipeline designs) until the synchronization condition or the outcome of the prediction is resolved. However, such a solution not only requires additional hardware structures but also affects the performance by inhibiting concurrent, speculative and out-of-order execution opportunities for both meta-program and main-program instructions.
Therefore, a need exists for a method to efficiently synchronize multiple threads in an out-of-order processor, particularly one that implements speculative executions, without affecting the performance.
The current invention provides a better solution to this problem via a new method and apparatus for implementing a meta-program execution engine for an out-of-order processor.