Embodiments of the inventive subject matter generally relate to the field of computer systems, and, more particularly, to failure recovery of distributed control of power and thermal management.
In the field of computing (e.g., servers), power and thermal management can be an important issue because of the continually increasing demands on processing rate and cooling capacity. A key element to power management is a centralized entity that implements power and thermal management across the different devices (e.g., processors, memories, etc.) in the system. Centralized management can be challenging when devices exceed power limits with sudden changes in power consumption. For example, a statistically rare event can arise from alignment of all devices suddenly increasing their usage simultaneously. As the number of devices in a computer system (e.g., a server) contributing to power swings increases, the processing load on the power management entity increases.