Microprocessor designers employ many techniques to increase microprocessor performance. Most microprocessors operate using a clock signal running at a fixed frequency. Each clock cycle, the circuits of the microprocessor perform their respective functions. According to Hennessy and Patterson (see Computer Architecture: A Quantitative Approach, 3rd Edition), the true measure of a microprocessor's performance is the time required to execute a program or collection of programs. From this perspective, the performance of a microprocessor is a function of its clock frequency, the average number of clock cycles required to execute an instruction (or alternately stated, the average number of instructions executed per clock cycle), and the number of instructions executed in the program or collection of programs. Semiconductor scientists and engineers are continually making it possible for microprocessors to run at faster clock frequencies, chiefly by reducing transistor size, resulting in faster switching times. The number of instructions executed is largely fixed by the task to be performed by the program, although it is also affected by the instruction set architecture of the microprocessor. Large performance increases have been realized by architectural and organizational notions that improve the instructions per clock cycle, in particular by notions of parallelism.
One notion of parallelism that has improved the clock frequency of microprocessors is pipelining, which overlaps execution of multiple instructions within pipeline stages of the microprocessor. In an ideal situation, each clock cycle one instruction moves down the pipeline to a new stage, which performs a different function on the instruction. Thus, although each individual instruction takes multiple clock cycles to complete, the multiple cycles of the individual instructions overlap. Because the circuitry of each individual pipeline stage is only required to perform a small function relative to the sum of the functions required to be performed by a non-pipelined processor, the clock cycle of the pipelined processor may be reduced. The performance improvements of pipelining may be realized to the extent that the instructions in the program permit it, namely to the extent that an instruction does not depend upon its predecessors in order to execute and can therefore execute in parallel with its predecessors, which is commonly referred to as instruction-level parallelism. Another way in which instruction-level parallelism is exploited by contemporary microprocessors is the issuing of multiple instructions for execution per clock cycle. These microprocessors are commonly referred to as superscalar microprocessors.
What has been discussed above pertains to parallelism at the individual instruction-level. However, the performance improvement that may be achieved through exploitation of instruction-level parallelism is limited. Various constraints imposed by limited instruction-level parallelism and other performance-constraining issues have recently renewed an interest in exploiting parallelism at the level of blocks, or sequences, or streams of instructions, commonly referred to as thread-level parallelism. A thread is simply a sequence, or stream, of program instructions. A multithreaded microprocessor concurrently executes multiple threads according to some scheduling policy that dictates the fetching and issuing of instructions of the various threads, such as interleaved, blocked, or simultaneous multithreading. A multithreaded microprocessor typically allows the multiple threads to share the functional units of the microprocessor (e.g., instruction fetch and decode units, caches, branch prediction units, and load/store, integer, floating-point, SIMD, etc. execution units) in a concurrent fashion. However, multithreaded microprocessors include multiple sets of resources, or contexts, for storing the unique state of each thread, such as multiple program counters and general purpose register sets, to facilitate the ability to quickly switch between threads to fetch and issue instructions. In other words, because each thread context has its own program counter and general purpose register set, the multithreading microprocessor does not have to save and restore these resources when switching between threads, thereby potentially reducing the average number of clock cycles per instruction.
One example of a performance-constraining issue addressed by multithreading microprocessors is the fact that accesses to memory outside the microprocessor that must be performed due to a cache miss typically have a relatively long latency. It is common for the memory access time of a contemporary microprocessor-based computer system to be between one and two orders of magnitude greater than the cache hit access time. Instructions dependent upon the data missing in the cache are stalled in the pipeline waiting for the data to come from memory. Consequently, some or all of the pipeline stages of a single-threaded microprocessor may be idle performing no useful work for many clock cycles. Multithreaded microprocessors may solve this problem by issuing instructions from other threads during the memory fetch latency, thereby enabling the pipeline stages to make forward progress performing useful work, somewhat analogously to, but at a finer level of granularity than, an operating system performing a task switch on a page fault. Other examples of performance-constraining issues addressed by multithreading microprocessors are pipeline stalls and their accompanying idle cycles due to a data dependence; or due to a long latency instruction such as a divide instruction, floating-point instruction, or the like; or due to a limited hardware resource conflict. Again, the ability of a multithreaded microprocessor to issue instructions from independent threads to pipeline stages that would otherwise be idle may significantly reduce the time required to execute the program or collection of programs comprising the threads.
The need for increased performance by microprocessors has developed in parallel with the need for reduced energy consumption by microprocessors and the systems that contain them. For example, portable devices—such as laptop computers, cameras, MP3 players and a host of others—employ batteries as an energy source in order to facilitate their portability. It is desirable in these types of devices to reduce their energy consumption in order to lengthen the amount of time between battery re-charging and replacement. Additionally, the need for reduced energy consumption has been observed in large data centers that include a high concentration of server computers and network devices in order to reduce device failure and energy costs.
A significant technique that has been employed to reduce energy consumption is what is commonly referred to as dynamic voltage scaling (DVS). The active power consumption of most microprocessors is the product of the collective switching capacitance (C), the switching frequency (f), and the supply voltage (VDD) of the microprocessor, or P=C*f*V2DD. Thus, lowering the voltage has the greatest effect on lowering the power consumption of the microprocessor. However, lowering the voltage increases the propagation delay of signals within the microprocessor. Thus, as the voltage is decreased, the frequency must also be decreased to enable the microprocessor to function properly. Reducing the frequency also reduces the power consumption; however, it also reduces the performance of the microprocessor. DVS attempts to dynamically scale down the voltage and frequency of the microprocessor during periods in which it is acceptable for the microprocessor to perform at a lower level, and to scale up the voltage and frequency during periods in which higher performance is needed.
It has been noted that, with many applications, the performance required of the microprocessor may vary relatively widely and frequently. Stated alternatively, the applications may utilize the processing power of the microprocessor relatively fully for periods intermixed with periods in which the applications utilize the processing power relatively sparingly. The length of the periods between which the utilization changes significantly may be relatively short, such as on the order of hundreds of nanoseconds. Thus, the finer the granularity at which a DVS implementation can scale the voltage-frequency, the potentially larger the energy savings that may be realized. Otherwise, much potential energy savings is lost due to the coarseness of the granularity.
However, the voltage-frequency scaling granularity has historically been limited by the time required for the power supply to change the operating voltage, which has typically been on the order of hundreds of microseconds. DVS has been typically implemented thus far in software. That is, the system software controls the voltage and frequency scaling. The granularity of software implementations of DVS has been commensurate with the historically large voltage changing times. However, current power supply trends, such as fast on-chip voltage converters, and the notion of voltage islands promise to reduce the time in the near future to on the order of hundreds of nanoseconds. At that point, software DVS solutions that were fast enough for the larger voltage changing times will become too slow to take advantage of the smaller voltage changing times.
First, the software DVS solutions typically involve multiple layers, including one or more calls to the operating system, which typically involves switches in and out of a privileged execution mode, requiring large amounts of time relative to the fast voltage changing times. Second, since the DVS software consists of program instructions that must be executed by the microprocessor, the DVS software is actually increasing the performance demand on the microprocessor, and further, is consuming processor bandwidth that could be used by the application programs running on the microprocessor. To take advantage of the fine-grained voltage switching times anticipated in near future, it appears that software DVS solutions would have to use up an even larger percentage of the microprocessor bandwidth than they do currently.
Therefore, what is needed is a voltage-frequency scaling scheme for a multithreaded microprocessor that is capable of taking advantage of the potential energy savings that may be achieved by fine-grained voltage-frequency changes.