1. Technical Field
The present application relates generally to an improved data processing system and method. More specifically, the present application is directed to a system for unified management of power, performance, and thermals in computer systems.
2. Description of Related Art
Power and thermal issues have joined performance concerns to dictate design and management of computer systems. Increasing circuit density in semiconductor chips and component densities in computing systems, coupled with increase in operational frequencies, have considerably increased component and system power consumption and local power and heat densities in systems.
Increasing circuit densities are realized by decreasing transistor dimensions. This, however, causes increased variability in transistor features and decreased predictability in the realized semiconductor chips, including their power consumption and reliability characteristics.
Increasing power densities cause increased power dissipation. However, computer component and system cooling solutions have not kept pace in a cost-effective manner. The increasing mismatch between power dissipation and heat removal raises operating temperatures, which can cause intermittent and permanent circuit and component failures.
With increased component variability, wider margins are required in power distribution systems to accommodate the wider ranges of power consumption by the components. Coupled with an increase in supply requirements, this implies increased loss from inefficiencies in the power system.
To decrease power consumption, component and circuit designers employ a wide variety of techniques, such as clock-gating, power gating, etc., which cause power consumption to track activity. Thus, consumption now varies also with activity or usage. This increases the variability in consumption and consequently somewhat exacerbates the problem of predicting consumption and designing power supply and delivery systems to match anticipated consumption requirements.
Increasing component/system power consumption also requires increasing cooling requirements, which also leads to increased energy consumption. This has reached an extent that data centers find it difficult to replace their existing systems with new machines without a major overhaul and/or redesign of their facilities.
System management solutions have to avoid failures due to thermal and power supply/distribution issues. Unpredictability in requirements leads to considerable over-provisioning in cooling and supply solutions to avoid failures. However, this is not cost-effective. Consequently, most modern systems employ some form of emergency actuations to kick in when their statically determined margins prove inadequate. Both static margins and emergency actuations are sources of constrained performance that may prevent the system from realizing peak performance efficiencies. But, to ensure safe operation, systems designers routinely trade off system performance, employing static margins, emergency actuations, or both.
The industry is awash with a number of solutions to address different aspects of the power/thermal problems facing computer systems. A common oversight in currently proposed solutions is in that they target only a subset of the problems, and do not take a comprehensive approach. On-chip thermal emergency throttling solutions inherent in many current microprocessor chips are designed to address only thermal problems detected as temperature sensor overdrive on chip.
Demand-based switching or load-aware energy saving solutions detect when the application load on the processor is very low and adopt lower power/more energy-efficient, but lower performance, operating points in such situations. These solutions address the need for saving energy, but are not reliable for safe operation under process-induced, environmental or workload-induced variability.
Some microcontrollers employ on-chip ammeters and temperature sensors to detect chip-level thermal and power supply problems. These solutions react to any indication of either problem by employing a dynamic voltage and frequency scaling mechanism to lower chip operating point reducing or even eliminating chip-level failures from power/thermal overdrive. Alternatively, these solutions can be viewed as a solution to extract high performance, by boosting frequency, when there is sufficient room with respect to on-chip power/thermal constraints. While appropriate for reliable operation and increased performance at the microprocessor level, these solutions must be coupled with other solutions for providing the same benefits at the system level and for energy savings even at the chip level.
In effect, existing solutions address only part of the fallouts created by power and thermal issues.