A xe2x80x9cthreadxe2x80x9d is a stream of instructions being executed by a processor. Software that is multithreaded has multiple threads of control that cooperate to perform a task.
A simultaneous multithreaded (SMT) central processor unit (CPU) provides, on a single CPU, the capability of executing instructions from multiple threads simultaneously.
On a simultaneously-multithreaded processor, the hardware provides facilities for executing multiple threads as if each thread were executing on its own CPU. This abstract thread processor is called a thread processing unit, or TPU. To the outside world, a TPU has all the capabilities of a conventional CPU. It holds a full process context while a process or thread is executing on that TPU. The term xe2x80x9cprocessorxe2x80x9d is used herein to refer either to a TPU or a conventional CPU.
For example, a 4-way issue CPU might have two functional units executing instructions from one thread, while the other two functional units are executing instructions from an unrelated thread. This is accomplished by providing enough registers and other process-specific resources on the CPU to support as many threads as can run simultaneously, and then choosing among the threads to determine which specific instructions will be executed. The threads may be related, where they are cooperatively doing work, or they may be entirely unrelated.
FIG. 1 compares sample execution sequences for superscalar, multithreading, and simultaneous multithreading architectures. Each row represents the issue slots for a single execution cycle: a filled box indicates that the processor found an instruction to execute in that issue slot on that cycle. An empty box denotes an unused slot. The unused slots can be characterized as horizontal or vertical waste. Horizontal waste occurs when some, but not all, of the issue slots in a cycle can be used. It typically occurs because of poor instruction-level parallelism. Vertical waste occurs when a cycle goes completely unused. This can be caused by a long latency instruction, such as a memory access, that inhibits further instruction issue.
Sequence (a) 2 corresponds to a conventional superscalar. As in all superscalars, it is executing a single program, or thread, from which it attempts to find multiple instructions to issue each cycle. When it cannot, the issue slots go unused, and both horizontal 3A and vertical waste 3B are incurred.
Sequence (b) 4 corresponds to a multithreaded architecture. Multithreaded processors contain hardware state, i.e., a program counter and registers, for several threads. On any given cycle, a processor executes instructions from just one of the threads. On the next cycle, it switches to a different thread context and executes instructions from the new thread. The primary advantage of multithreaded processors is that they better tolerate long-latency operations, effectively eliminating vertical waste. However, they cannot remove horizontal waste. Consequently, as instruction issue width continues to increase, multithreaded architectures will ultimately suffer the same fate as superscalars: they will be limited by the instruction-level parallelism in a single thread.
Sequence (c) 6 corresponds to a simultaneous mutlithreaded archictecture and shows how each cycle in an SMT processor selects instructions for execution from all threads. It exploits instruction-level parallelism by selecting instructions from any thread that can potentially issue. The processor then dynamically schedules machine resources among the instructions, providing the greatest chance for the highest hardware utilization. If one thread has high instruction-level parallelism, that parallelism can be satisfied; if multiple threads each have low instruction-level parallelism, they can be executed simultaneously to compensate. In this way, SMT can recover issue slots lost to both horizontal and vertical waste.
Simultaneous multithreading is advantageous because it allows the CPU to get better throughput. Resources which would lie idle due to limited parallelism in one thread can be utilized by other threads.
A software program can be compiled, or decomposed, into multiple threads, with the purpose of achieving improved performance through parallel execution of those threads. The threads may be executed on different processors in a multiprocessor, or they may be executed on different thread processing units within an SMT CPU.
When programs are multithreaded in this way, locking protocols are used to control access to shared data. Assigning a special memory location, called a lock, to a section of data, controls access to that section of data. A thread can only update the data when it owns the lock.
An integral part of many locking protocols is a busy wait loop, often referred to as a xe2x80x9cspin lock.xe2x80x9d In a spin lock, a process loops, looking at a particular memory location, i.e., the lock, and waiting for it to change to a specific value before proceeding. Once the value has changed, the process is then free to attempt to obtain the lock via an atomic update of the location.
In a conventional multiprocessor, the CPU resources and memory bandwidth consumed by a task in a spin lock are not simultaneously shared with any other tasks. Thus, while the task is spinning there is no resource contention within the CPU, and no reason not to let the task spin. Various studies have shown that approximately 15% of processor time is spent in spin loops.
In a simultaneous multithreaded CPU, however, the resources consumed by the spinning task are being denied to the other threads that are or could be doing useful work. In fact, Applicants have found that under these circumstances, the simultaneously multithreaded CPU provided no performance increase to the decomposed application, and can actually degrade the performance of the application.
One multithreaded computer uses fine-grained multithreading, which is different from SMT, and addresses the synchronization problem with a hardware retry which traps the thread after some number of failures and deschedules it. This is described in xe2x80x9cExploiting Heterogeneous Parallelism on a Multithreaded Multiprocessor,xe2x80x9d 1992, which can be found at the tera.com web site archives (i.e., archives/library/psdocs.html).
Patent application Ser. No. 08/775,553 by Emer et al, xe2x80x9cA Multi-threaded Processor and Method That Selects Threads Based On An Attribute,xe2x80x9d (name amended in February, 1999), filed Dec. 31, 1996, assigned to a common assignee as the present invention and incorporated by reference herein in its entirety, describes an SMT architecture.
Many papers have been published about Simultaneous Multithreading. For a fairly complete list, see the website cs.washington.edu/research/smt/. The University of Washington has done much work on efficient synchronization on SMT. See, for example, xe2x80x9cSupporting Fine-Grained Synchronization on a Simultaneous Multithreading Processor,xe2x80x9d 1995, available at the website cs.washington.edu/research/smt/papers/hpca.ps. A longer version of the paper, UCSD CSE Technical Report #CS98-587, is available at the website cs.washington edu/research/smt/papers/smt.synch.ps
These papers propose a synchronization xe2x80x9clock-boxxe2x80x9d mechanism which has the primary goal of providing faster synchronization between threads. The lock itself is memory-based, but once the lock is obtained by a thread on a particular CPU, the lock-box passes the lock among the threads on that CPU, if they require it. If a thread fails to acquire a lock and must wait for it to become available, the thread""s instructions are flushed from the pipeline to prevent that thread consuming resources on the CPU. The mechanisms of synchronizing and possibly flushing the instructions are combined into one xe2x80x9cAcquirexe2x80x9d instruction, and all actions required by that instruction are carried out strictly in hardware.
U.S. Pat. No. 5,524,247 is a software patent on scheduling to avoid locks. It does not involve hardware and it is not related to SMT architecture.
The present invention resolves the problem of spin-lock in an SMT architecture by halting, or xe2x80x9cquiescing,xe2x80x9d spin-locking threads while they are waiting for some event, i.e., the availability of a lock.
In accordance with an embodiment of the present invention, a method for halting execution of a program""s instructions while the program is waiting for one or more events to occur in a simultaneous multithreaded processor or multiprocessor environment includes arming an event monitor associated with the program by identifying one or more events to be monitored. Each thread preferably has its own event monitor.
An event may be, for example, a modification to some identified memory location or group of locations, such as a change of access state or a change of value stored in the location. A change of access state may be, for example, from shared to exclusive. Such an event is typically caused by another program.
A change of value can be observed by monitoring a memory bus. In one embodiment, a write to the identified memory location, rather than observing a change in actual value stored therein, is sufficient to recognize the event.
The expiration of a timer is another example of an event that may be monitored.
Preferably, the arming of an event monitor is performed by executing an arm instruction which identifies the memory location to be monitored. The physical address of the memory location is recorded in a working register associated with the program, and an indicator such as a flag is set to a first state which enables the event monitor to monitor for the event. The indicator is set to a second state if a change to the memory location whose address is recorded in the working register is observed by the event monitor. Preferably, a lock value is loaded from the identified memory location by the same arm instruction.
The method further includes requesting, by executing a quiesce instruction after executing the arm instruction, that the program be halted until the event is observed by the event monitor. There is no requirement that the program be halted. However, if execution of the program is halted, the event monitor monitors for the event. Subsequent to observation of the event by the event monitor, but not necessarily immediately after, execution of the program is resumed.
Preferably, it is the responsibility of the program to check whether the event has occurred when the program resumes execution from the quiescent state. Thus, to ease implementation, hardware is permitted to release a thread from its quiescent state occasionally even if the event has not occurred.
After requesting that the program be halted, i.e., the indicator has been set to the first state, its execution is halted, if at all, only if the event has not yet occurred since the arming. If the indicator is set to the second state, the program is not halted in response to the request to halt.
Preferably, upon halting execution of the program, program instructions subsequent to the quiesce instruction are flushed from the instruction pipeline.
To allow for a quick restart when execution of the quiesced program or thread resumes, while the program is halted, program instructions are fetched into an instruction buffer and allowed to propagate into the instruction pipeline. The instruction buffer could be managed in various ways. For example, the percentage or absolute number of the thread""s instructions allowed into the buffer could be limited, or different instruction buffers could be allocated to different threads.
Preferably, upon halting execution of the program, a timer associated with the program is set to time a predetermined time interval, and started. If program execution has not been resumed for other reasons, for example if the monitored event has not yet occurred, program execution is resumed upon expiration of the time. Preferably, the timer is stopped if execution of the program is resumed due to observation of the event by the event monitor.
Halting execution of, or quiescing, the program results in a reduction of power consumption, and allows other executing programs to utilize available resources.