1. Technical Field
This invention relates to the field of microprocessors and, in particular, to systems and methods for selecting instructions for execution in simultaneous multi-threaded processors.
2. Background Art
The operating system (OS) regulates access to a computer""s central processing unit (xe2x80x9cprocessorxe2x80x9d) by different programs running on the computer. Most OSs employ priority-based scheduling algorithms for this purpose. Priorities are typically assigned to programs according to the importance and/or urgency of the functions they perform on behalf of the computing system. The OS uses these priorities to determine when and for how long a program or a unit of executable code within the program (hereafter, xe2x80x9cthreadxe2x80x9d) is granted access to the processor. Generally, priority-based scheduling allocates processor time to optimize the computer system""s performance by, for example, minimizing response time to user input, maximizing throughput, and/or guaranteeing predictable (deterministic) execution times for application programs.
Once the OS schedules a thread for execution, most processors simply execute instructions in the thread as rapidly as possible. Execution proceeds until all instructions in the thread have been executed or the OS suspends the thread to execute instructions from another thread. Different processor architectures employ different strategies to speed thread execution, including executing multiple instructions on each cycle of the processor clock. For example, wide issue superscalar processors are designed to identify sequential instructions within a thread that can be executed in parallel, i.e. simultaneously. Symmetric multi-processor (SMP) systems include multiple processors, each of which executes instructions from a thread assigned to it by the OS according to its priority scheme. Provided the instructions from different threads do not interfere with each other, they can be executed parallel. The individual processors of an SMP system may or may not be wide issue superscalar processors.
As noted above, OSs periodically suspend executing threads and replace them with different threads in response to I/O, user, or system input. In most processor architectures, switching to a different thread requires saving the processor state produced by the last instruction executed in the current thread and replacing it with the processor state produced by the last instruction executed in the different thread. The processor state, which is also known as the hardware context of a thread, includes thread-specific data, instructions, and status information that is updated on each clock cycle.
Thread context switches are often used to hide latencies, such as slow memory accesses, in an executing thread. That is, a new thread is given control of the processor while data is retrieved from memory for the previously executing thread. However, context switches can be time-consuming in their own right. Fine-grained multi-threaded (MT) processors are designed to speed switching between different threads and their associated contexts.
A common feature of the above described processor architectures is that each processor executes one thread at a time. Since these threads are scheduled onto the processor by the OS, the priority-based scheduling of the OS is preserved.
Simultaneous multithreading (SMT) processors allow threads from multiple hardware contexts to execute simultaneously on a single processor. The OS schedules multiple threads onto an SMT processor, and on each clock cycle, the SMT processor selects instructions for execution from among the scheduled threads. For example, an 8-issue SMT processor, i.e. a processor capable of issuing up to 8 instructions per clock cycle, has 8 instruction slots that can be filled on each clock cycle. The SMT processor selects these instructions from different threads scheduled by the OS. Selection is made using a variety of heuristics to identify the best, e.g. most efficient, instructions for processing. The potential advantages of SMT architectures are discussed, for example, in Lo et al. xe2x80x9cConverting Thread-Level Parallelism To Instruction-Level Parallelism Via Simultaneous Multithreadingxe2x80x9d, available at www.cs.washington.edu/research/smt/index.html #publications.
With an SMT processor architecture, OSs can schedule onto a processor concurrently threads that have different priorities. Because SMT processors select instructions from among threads with different priorities, they can have a substantial impact on the rate at which a particular thread executes. In general, the heuristics employed to select instructions from among the different scheduled threads are designed to maximize the total instruction throughput of the processor. There is no guarantee that these heuristics preserve the priority-based scheduling implemented by the OS. In fact, the heuristics may actually subvert the thread priorities that the OS attempts to enforce. These heuristics are discussed, for example, in Tullsen, et al., xe2x80x9cExploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processorxe2x80x9d, Proceedings of the 23rd Annual International Symposium on Computer Architecture, Philadelphia, Pa., May, 1996.
One attempt to reflect the OS priorities of threads in SMT processors simply assigns priorities to different thread contexts, e.g. the registers and memory locations used to track processor states for a thread, and assigns the highest priority context to the thread with the highest OS priority. This static assignment strategy ensures rapid execution of the highest priority thread, but it ignores execution dynamics that impact the overall processor efficiency and reduces the opportunities for execution of parallel instructions. This strategy is also incompatible with OSs that support SMP, since it can lead to livelock. For example, the thread assigned to the high priority context spins (busy waits) on a spin lock that is held by the thread executing in the low priority context. The high priority thread, i.e. the thread operating in the high priority context, prevents the low priority thread from executing any instructions, including those necessary to remove the spin lock.
The present invention is an SMT processor architecture that combines thread execution heuristics with OS priorities to provide a dynamic priority for each thread scheduled on an SMT processor. Thread execution heuristics, based on efficiency or other criteria, are adjusted by a priority-dependent scaling function, coupling the OS scheduling policy to the SMT processor""s thread selection. This ensures that instructions from higher priority threads are executed as quickly as possible without significantly reducing the total instruction throughput of the processor.
In accordance with the present invention, an instruction from a thread that is characterized by a priority is selected for processing by monitoring an indication of the execution state of the thread, adjusting the indication according to a scaling factor determined from the thread priority, and selecting the instruction for processing according to the priority-adjusted indication.
In one embodiment of the present invention, the execution state indication is a per thread counter that tracks events related to the efficiency of thread execution. These events include, for example, the number of outstanding branch code instructions the thread has in the processor pipeline, the number of outstanding data cache misses for a thread, and the number of outstanding instructions the thread has in the pipeline.
In another embodiment of the invention, the scaling function may be a linear or log function of the thread priority or an anti-linear or anti-log function of the priority, according to whether the scheduling priority increases or decreases with increasing thread priority.