1. Technical Field
The present invention relates in general to thermal management of electronic devices and, in particular, to managing the operation of electronic devices in response to a thermal stress condition. More particularly, the present invention relates to a system and method for co-operative thermal management of independent electronic devices housed within a common enclosure.
2. Description of the Related Art
In computing environments where resources are shared, there is always a concern that a failure of a shared resource may affect all of the devices that depend upon it. One example of this is the case where a failure of a cooling fan jeopardizes the operation of multiple servers in a server blade environment. Products often include multiple fans to allow for the fact that one might fail. However, when a fan fails, the remaining fans may not be able to cool the entire configuration of server blades. It may therefore be necessary to turn off all of the devices within the enclosure to reduce the temperature or risk damaging all of the devices in that enclosure. Ideally, it should be possible to deal with this situation without taking all of the servers in the affected enclosure out of service. For example, it would be better to address the failure of a single fan by taking selected servers completely out of service or by gracefully degrading the performance of all servers.
In some cases, such as an arrangement of server blades within a chassis may include a separate service processor that handles chassis-level management functions and the service processor may turn some devices off to reduce the thermal load. However, having a centralized service processor designated to handle the management functions introduces a single failure point that may be catastrophic in the event that the service processor fails for whatever reasons. With the loss of the service processor, thermal management of the server blades may cease to exist which, in turn, may result in the shut-down of all the server blades to prevent any potential thermal stress conditions from adversely affecting the server blades. Furthermore, lower cost products may opt to eliminate this service processor and its associated management functions. In addition, circumstances may prevent the individual server blade devices from communicating with each other. For example, they may not be connected on a common network, or they may be running different sets of applications under different operating systems, etc. This makes it difficult for the server blades to co-operate in dealing with chassis level problems such as a fan failure.
Individual server blades may be capable of detecting the over-temperature condition and shutting themselves down when a programmed temperature threshold is exceeded. However, this can still result in all of the server blades in an enclosure powering down. Due to the slow rate at which the temperature in the enclosure changes, they may all sense the over-temperature condition and make a decision to power themselves down before the reduction in thermal load can bring the enclosure's internal temperature back down to acceptable levels.
Accordingly, what is needed in the art is an improved method by which devices, such as server blades, can co-operate to resolve thermal problems within their shared enclosure without the need for a coordinating service processor or communication between the devices.