1. Field
The present invention relates to techniques for enhancing the performance of computer systems. More specifically, the present invention relates to a method and apparatus for managing the performance of a computer system.
2. Related Art
As the power consumption of semiconductor chips has increased significantly due to technology scaling, design trends have shifted toward building multiprocessor system-on-chips (MPSoCs). MPSoCs are able to provide higher throughput per watt and can also support thread-level parallelism, which brings opportunities to reduce power consumption and manage temperature more efficiently. Thermal hot spots and high temperature gradients are among the major challenges in MPSoC design, since they can degrade reliability, increase the load average and cooling costs, and complicate circuit design. Note that load average is one metric to evaluate system response time, and a lower load average indicates a faster system.
More specifically, thermal hot spots can increase cooling costs while potentially accelerating failure mechanisms such as electromigration, stress migration, and dielectric breakdown, which can cause permanent device failures. Increased temperatures can also affect the load average, since the effective operating speed of devices decreases with higher temperatures. For these reasons, expert policies for computer systems, such as conventional dynamic thermal management (DTM) techniques, generally focus on keeping the temperature below a critical threshold to prevent hot spots. Examples of conventional DTM techniques are clock gating, voltage/frequency scaling, thread migration, and applying proportional-integral-derivative (PID) control to maintain safe and stable temperatures. These techniques can prevent thermal hot spots but typically involve a considerable increase in the load average.
Moreover, since DTM techniques do not focus on balancing the temperature across the chip, they can create large spatial gradients in temperature. These spatial gradients can lead to an increase in the load average, accelerate logic failures, decrease the efficiency of cooling, and in some cases, cause reliability issues.
Another issue with expert policies, such as the DTM or dynamic power management (DPM) methods, is that they do not prevent thermal cycling, or they sometimes exacerbate thermal cycling. High magnitude and frequency thermal cycles (i.e., temporal fluctuations) can cause package fatigue and plastic deformations, and can lead to permanent failures. In addition to low-frequency power changes (i.e., system power on/off), cycles are created by workload rate changes and power management decisions. Note that thermal cycling can be especially accelerated by DPM methods that turn off cores, because in the sleep state, cores have significantly lower temperature than the active state.
Some of the foregoing reliability challenges have been addressed by expert policies that optimize power management decisions for a given reliability constraint. Unfortunately, existing DTM methods typically cannot guarantee effectiveness for all execution periods, because the trade-off between temperature and load average can vary markedly between different types of workloads.
Many expert policies used in the computing industry today have specific optimization goals and, as such, their advantages vary in terms of saving power, achieving better temperature profiles or decreasing the load average. For example, DPM can reduce thermal hot spots while saving power. However, when there is an increased workload arrival rate, typical DPM schemes significantly increase thermal cycling and cannot effectively optimize power, reliability, and load average under dynamically varying workload profiles.
Hence, what is needed is a method and apparatus for managing the performance of a computer system without the problems described above