The use of multiple servers in computing environments has increased over the last several years. Typically, these multiple servers may work in conjunction as a “server farm” or “server cluster.” These server clusters often have needs on a scale that would not ordinarily concern a user with a single computer. One such need is the management of heat generated by the server cluster and maintaining an acceptable operating temperature for any given server within the server cluster.
In managing the heat generated by the server cluster, any number of temperatures may be monitored. One temperature that is usually monitored is the thermal exhaust temperature of a given server or the exhaust temperatures of multiple servers. Generally, the thermal exhaust temperature is the temperature of the air that exits from a server's exhaust. The thermal exhaust temperature may provide an indication of how hot the server is operating at any given time.
At certain times, the thermal exhaust temperature of a given server or a group of servers may exceed an acceptable operating temperature. When this occurs, many problems can happen. One problem is that the internal hardware of the given server or group of servers, such as processors, hard drives, memories, or other types of hardware, may start to fail. The hardware may fail because the hardware may be rated not to exceed a given temperature during operations.
Another problem that may occur is “throttling.” Throttling is a phenomenon whereby an over-heating server reduces hardware operations to reduce heat. The reduction in hardware operations typically means that the throttling server has reduced performance when compared with other non-throttling servers. Moreover, when the number of servers that are throttling reaches a critical level, the entire server cluster may have reduced performance.
Furthermore, the throttling of one or more servers may indicate a more serious failure is imminent. Throttling may be only the first stage in a typical scenario. Throttling may be followed by server failures where errors are introduced that may be correctable, progressing into server failures where errors are introduced that may not be correctable. Finally, server faults and, ultimately, server shutdown, may be the final stages in the failure progression. Thus, throttling may be but a indication that more or severe failures are expected or imminent.
Although monitoring the thermal exhaust temperature of the servers in the server cluster is a typical solution to the overheating and throttling problem, there are complexities associated with monitoring the thermal exhaust temperature. Such complexities include that the individual servers of server cluster may have different configurations, server clusters may be operating in different environments, the power requirements for server clusters may vary, and the inlet temperature (the temperature of the air that a server intakes) may vary from server to server or from server cluster to server cluster.