The present application relates generally to an improved data processing apparatus and method and more specifically to the design of multithreaded microprocessors.
Modern microprocessors are designed out of complementary metal oxide semiconductor (CMOS) technology, which has heretofore obeyed the Moore's conjecture, which predicted that the number of transistors within a given chip area would double roughly every 18 months. This doubling comes from predictable and continuous improvement, in lithography that allows the mask of a CMOS chip to improve in resolution by a factor of two every 18 months. Microprocessor design has benefited greatly from this progress, which has translated over the years to improved processor performance as smaller transistors allow faster switching, which in turn allows the processor to run at increasing frequency.
Furthermore, designers have used techniques that allow a processor to execute instructions of a program in a different order than the one specified in the application code. This mode, called out-of-order processing, enables processors to extract more performance than was possible by just exploiting frequency improvement. In its simplest mode, the hardware examines a plurality of instructions that are about to run on the processor, and executes as many of them in parallel as far as it can determine that the resulting execution would be equivalent to a sequential execution of the code. This enables the processor to extract instruction-level parallelism (ILP) from application code, resulting in improved performance at the expense of complexity in processor design and more power consumption.
The technique was refined further to include speculative execution, in which the processor would speculatively execute instructions further down the stream in the hope that prior instructions in flight would not violate the equivalence to a sequential execution (e.g., branch prediction, pre fetch data, etc.). If the speculative assumptions hold, the result is faster execution as more instructions are executed per unit of time, whereas if the speculative assumptions turn out to be invalid, the results of the speculation are simply discarded. These techniques exploit all possible avenues to improve performance of a single-thread application at the expense of more complexity and power consumption.
Recently, however, it has become difficult to harness additional frequency increases due to transistor miniaturization, as the resulting heat dissipated by a transistor at higher frequency becomes too concentrated in such a small area that it cannot be removed effectively. As a result, the frequency growth of microprocessors has reached a limit, and designers have resorted to using the additional devices on the processor chip to increase the number of processor cores, compensating for the limited speed of a single core by providing more cores. Additionally, designers have resorted to increasing the number of hardware threads that run in each core, again compensating for the limited speed of a single core by providing more contexts within the core to run additional application codes.
Additionally, techniques for speculative executions, and the power overhead necessary to identify ILP also added to the power consumption of the processor. These techniques have become unattractive because of the limited ability to supply power to a single chip due to the physical characteristics of the power supply connections, and the decreasing ability to remove heat concentrated in smaller and smaller devices. These limitations have driven processor designers to focus on simpler cores that run instructions in order of the sequential code specified by the application. These cores, typically called in-order cores, are usually simple in design, consume less power, and are unable to exploit ILP. The designers have compensated for these limitations by increasing the number of threads per core and the number of cores per processor chip.
Increasing the number of cores and the number of threads in a core is beneficial for applications that show natural parallelism, such as throughput-oriented workloads (e.g., Web servers). However, the performance of legacy application code and applications that are not amenable to parallelization cannot benefit from multi-core or multi-threaded processors. These applications have traditionally enjoyed improved performance by relying on the processor design to extract ILP, and on frequency increase, to run applications faster. Such features are no longer dependable due to limited power consumption and heat extraction as mentioned above, and thus single-threaded applications cannot benefit from newer processors. These newer processors are designed for low power consumption and benefit throughput-oriented applications, at the expense of single-thread performance. Therefore, there is a need for a method to allow single-threaded applications to benefit from newer multi-core and multithreaded processors that have limited single thread performance.