The present disclosure relates to reducing power consumption in computing environments, and more particularly to a performance-aware algorithm that dynamically adapts processor operating voltage and frequency settings to achieve significant power reduction with minimal impact on workload performance.
Computing performance (e.g., processor clock frequency) continues to improve at the expense of higher power consumption. According to Moore's Law of 1965, the number of transistors on a processor would double every 18 months. However, with each doubling in the number of transistors comes a corresponding increase in power consumption of compute nodes. High power consumption burdens the electrical supply load and increases operating costs, which results in a negative economic and environmental impact on society. In addition, when processor clock frequency is increased, processor(s) tend to generate increased heat. Increased heat can cause computing system reliability and productivity to deteriorate exponentially.
Power consumption by processors in a computing environment can be managed using Dynamic Voltage and Frequency Scaling (DVFS) techniques. Dynamic voltage scaling is a power management technique in computer architecture, where the voltage used in a component is increased or decreased, depending upon circumstances. Dynamic voltage scaling to increase voltage is known as overvolting. Dynamic voltage scaling to decrease voltage is known as undervolting. Undervolting is done in order to conserve power. Moreover, while undervolting can reduce power consumption, it can also lead to circuit failures if the applied input voltage is not matched with a corresponding decrease in clock frequency. Overvolting is done in order to allow one to increase the clock frequency of the processor, which in turn can increase computing performance. Dynamic frequency scaling is another power management technique in computer architecture where a processor is run at a less-than-maximum frequency in order to conserve power.
DVFS techniques are commonly used in laptops and other mobile devices, where energy comes from a battery and thus is limited. In addition, DVFS is also used in quiet computing settings and to decrease energy and cooling costs for lightly loaded computing machines. Less heat output, in turn, allows the system cooling fans to be throttled down or turned off, further decreasing power consumption. DVFS is also used for reducing heat in badly cooled computing systems when the temperature reaches a certain level. Most computing systems affected by increased heat are inadequately cooled overclocked systems. DVFS allows a processor to switch between different frequency-voltage settings at run time under the control of software. Examples of software employing DVFS techniques include powerNow! (AMD) and SpeedStep (Intel).
However, the power-performance tradeoffs provided by the DVFS techniques should be used judiciously. A computer user is seldom willing to sacrifice performance in exchange for lower power consumption. Thus, one goal for power management methodology via DVFS is to create a schedule of the use of processor clock frequency-voltage settings over time so as to reduce processor power consumption while minimizing performance degradation. A DVFS scheduling algorithm needs to determine when to adjust the current frequency-voltage setting (i.e., scaling point) and to which new frequency-voltage setting (i.e., scaling factor) the computing system is adjusted. For example, a DVFS scheduling algorithm may set the scaling points at the beginning of each fixed-length time interval and determine the scaling factors by predicting the upcoming processor workload based on the past history.
Existing DVFS algorithms possess a number of drawbacks. For example, DVFS algorithms may be too pessimistic in predicting future processor workload and lose great opportunities in exploiting DVFS for maximum power savings. Many existing DVFS algorithms assume that the performance of an application scales perfectly with respect to processor clock frequency, i.e., the computing system's performance will be halved if processor clock frequency is reduced by half. It is only in the worst case that the execution time doubles when the processor clock frequency is halved. Thus, a DVFS scheduling algorithm based on such a model will schedule a faster processor clock frequency and complete a task far ahead of its deadline, whereas a slower processor clock frequency can be scheduled that still meets its performance deadline (e.g., guaranteed transaction rate), but consumes less power.
In addition, other existing DVFS algorithms are geared to executing High Performance Computing (HPC) applications. These DVFS algorithms assume a workload having a fairly constant degree of frequency sensitivity. These types of HPC driven DVFS algorithms apply curve/line fitting techniques using a single set of performance data that is calculated over a predetermined range of allowed operating frequencies. As a result, the single set of performance data is updated only for the last performance metric reading at a test frequency, which when incorporated with the previous measured values within the set, may result in a slower response to actual and frequent changes in workload frequency sensitivity. This is because many of the values within the single set may be significantly outdated (i.e., erroneously reflecting a frequency-performance relationship for an earlier-executed workload or earlier phase in a current workload).
Computer system performance can depend on whether the operations that are being executed by the processor are core-bound (or processor-bound) operations or non-core-bound (or non-processor-bound) operations. Core-bound operations do not have to go outside the core for their completion. When a processor is executing core-bound instructions, then the rate in which the processor can complete the instructions is directly proportional to how fast the processor is clocked.
In contrast, non-core-bound operations need to go outside the core for their completion. Core-bound operations generally refer to high latency operations/instructions that have a stronger likelihood of inducing processor pipeline bubbles. For example, retrieving data from L2 and L3 caches, while on-chip, can have moderately long latencies of 8-60 cycles. Moreover, in the case of DRAM accesses, there can be even longer latencies (e.g., 200+ cycles). To improve performance and avoid such bottlenecks as waiting to retrieve data or waiting for an input signal, instruction pipeline processing is employed in processors to allow overlapping execution of multiple instructions with the same circuitry. There are instances, however, when an instruction in the pipeline depends on the completion of a previous instruction in the pipeline.
For example, FIG. 1A shows a stack of instructions 100 that are processed at a maximum operating frequency, according to one embodiment. At row 101, the first instruction is a load operation in which a value X is loaded and stored in memory register RO. Subsequent instructions at rows 102-105 are instructions that do not depend on the completion of a previous instruction for their execution. However, the instruction at row 106 is a multiplication operation that requires the updated value of RO that is obtained by instruction 101. If the completion of load instruction 101 is delayed (e.g., value X must be loaded from a memory that is outside the core), a load latency is created and the processor must wait until instruction 101 is completed. Since the processor is operating at a maximum operating frequency, there are cycles that are potentially wasted (i.e., no instructions are executed during these wasted cycles) in waiting for instruction 101 to complete. The wasted cycles that are present in the example is referred to as the architectural slack of the processor. In the example shown, there is an 8-cycle load latency in waiting for instruction 101 to complete if the processor is running at maximum (100%) operating frequency.
In such instances as described in FIG. 1A, lowering the operating frequency may not significantly impair performance due to the inherent load latency in completing a previous instruction. Referring now to FIG. 1B, the same instruction stack as shown in FIG. 1A is shown, except that the processor is now processing at 50% of its maximum operating frequency. Thus, instead of an 8-cycle load latency that is present when the processor is processing at 100% of its maximum operating frequency, the load latency is reduced by half (i.e., 4-cycle load latency) when the processor is processing at 50% of its maximum operating frequency. Notably, the halving of the operating frequency does not necessarily imply a 50% reduction in the instructions being completed within a given time period. Such processing systems can be described as being “frequency insensitive”. In practice, main memory latency can be much larger (e.g., 200+ cycles) than what is represented in the example shown in FIGS. 1A and 1B. Thus, larger power consumption savings can be realized by taking advantage of the architectural slack in a processing system, while not significantly compromising processing performance.