1. Field of the Invention
The present invention relates generally to increasing utilization and overall performance in multi-threading microprocessors. More particularly, the present invention relates to more effectively scheduling threads to optimize a wide in-order processor.
2. Description of the Related Art
In a conventional computer system, microprocessors run several different processes. The computer system utilizes an operating system (OS) to direct the microprocessor to run each of the processes based on priority and on the process not waiting on an event (e.g., a disk access or a user keypress) to continue. The simplest type of priority system merely directs the OS to run the programs in sequence (i.e., the last program to be run has the lowest priority). In other systems, the priority of a program may be assigned based on other factors, such as the importance of the program, how efficient it is to run the program, or both. Through priority, the OS is then able to determine the order in which programs or software threads or contexts are executed by the processor. It takes a significant amount of time, typically more than the time required to execute several hundred instructions, for the OS to switch from one running process to another running process.
Because of the overhead incurred from each process switch, the OS will only switch out a process when it knows the process will not be ready to run again for a significant amount of time. However, with the increasing speed of processors, there are events, which make the process unexecutable for an amount of time that is not long enough to justify an OS-level process switch. When the program is stalled by such an event, such as a cache miss (e.g., when a long latency memory access is required), the processor experiences idle cycles for the duration of the stalling event, decreasing the overall system performance. Because newer and faster processors are always being developed, the number of idle cycles experienced by processors is also increasing. Although memory access speed is also being improved, it has not been increased at the same rate as microprocessor speeds, therefore, processors are spending an increasing percentage of time waiting for memory to respond.
Recent developments in processor design have allowed for multi-threading, where two or more distinct threads are able to make use of available processor resources. A Simultaneous Multi-Threading (SMT) microprocessor allows multiple threads to share and to compete for processor resources at the same time. The threads are scheduled concurrently and therefore operations from all of the threads progress down the pipeline simultaneously. If a thread in a SMT system is stalled and waiting for memory, the other threads will continue execution, thus allowing the SMT system to continue executing useful work during a cache miss.
Because multiple threads are able to issue instructions during each cycle, a SMT system typically results in a dramatic increase in system throughput. However, the performance improvement is subject to certain boundary conditions. The effectiveness of SMT decreases as the number of threads increases because the underlying machine resources are limited and because of the exponential cost increase of inspecting and tracking the status of each additional thread.
A major problem with scheduling threads in a SMT system occurs when developers attempt to build a SMT system with an in-order machine rather an out of order machine. As with any threads in any single threaded system, the instructions to be executed in a SMT system must be given an order of execution, determined by whether a particular instruction is dependent on another. For example, if a second instruction depends on a result from a first instruction, the processor must execute instruction one before executing instruction two.
An out of order machine includes built in hardware that determines whether or not instructions in a thread are dependent on the result of another instruction. If two threads are independent of each other, it is unnecessary to coordinate their scheduling of execution relative to each other. However, if an instruction is dependent upon another, then the out of order machine schedules the dependent instruction to be executed after the instruction from which it depends. After examining many instructions, the out of order machine is able to create chains of dependencies for the processor within its execution profile. Because the two threads are always independent in a SMT system, the existing hardware in the out of order machine may be extended to schedule the threads to execute in parallel.
An in-order machine does not include hardware to determine instruction dependency. Instead, instructions are simply presented in memory in the same order that the compiler or program places them. Therefore, the instructions must be executed in the same exact order that they were placed into memory. Because in-order machines cannot determine the dependency of each instruction, an in-order machine is not able to properly reorder instructions from different threads in a SMT system. An additional in-order scheduling problem arises when the processor is not wide enough and does not have the bandwidth to execute the multiple threads in parallel.
While SMT systems are able to process more than two threads simultaneously (some developers have tried to schedule as many as eight threads at a time), each additional thread requires an increase in machine cost. For example, a large parallel logic array (PLA) may be required to coordinate and schedule all of the threads if a SMT system is complex enough. Therefore, it is often not an efficient use of processing power to execute more than two threads at the same time. Furthermore, such additional overhead is often completely unwarranted because few machines are wide enough or have the resources to support more than two active threads.
In view of the foregoing, it is desirable to have a method and apparatus that provides for a system able to maximize the use of wide processor resources in an in-order machine. In particular, it is desirable to have an in-order SMT system because they are simpler than out of order machines, thereby conserving valuable chip space, consuming less power, and generating less heat. It is also desirable to have an in-order SMT system with minimal circuit impact.