Computer units generate heat as a result of conducting electrical currents within their various semiconductor integrated circuits, power supply and other internal components. The amount of heat generated and the criticality of that heat varies depending on the component. A central processing unit (CPU) generates considerable heat because of its very high frequency of operation and its continual operation. A CPU usually has its own directly attached heat sink to dissipate its heat. Other components, such as memory integrated circuits (ICs) generate more moderate amounts of heat, due to the more intermittent nature of the electrical currents they conduct. Generally speaking, less intensively used integrated circuits do not require separate heat sinks but are generally attached in groups to circuit boards or to a main circuit board called a motherboard. Other components within the computer unit enclosure may include backplanes, data transfer buses and specific devices connected to the data buses. Devices connected to the bus may generate significant heat. The power supply, which converts conventional AC power to DC power used by the computer components, is usually a significant source of heat.
The heat generated by the internal computer components must be removed. Otherwise, the components within the computer will degrade, their performance will become unreliable, and under extreme conditions the components will self-destruct. Consequently, computer units utilize cooling systems to remove the heat. The most prevalent type of cooling system is an air cooling system. Cooling air is drawn through a vent in an enclosure for the computer unit which surrounds and encases the computer components. One or more cooling fans or blowers draw in intake cooling air and force the cooling air through the enclosure and over the internal components of the computer unit. In some circumstances, the cooling fan or blower is integrated with the power supply. The heat sink attached to the CPU may have its own dedicated cooling fan to remove the higher concentration of heat generated by the CPU. The temperature of the cooling air has an effect on cooling the computer unit. A higher air temperature reduces the cooling effect.
Because of the critical need for cooling, modern computer units include a capability for monitoring thermal conditions. Temperature monitoring capability is particularly important for high-end servers because its reliability for communicating data is directly related to the operating temperature of its critical components. The the reliability of the servers internal components must be protected from harsh thermal conditions which might allow or cause its internal operating components to exceed their acceptable operating limits.
Higher performance computer servers typically employ multiple thermal sensors to assure reliable and safe operation. Critical components such as CPU's have a dedicated sensor embedded in their integrated circuitry for monitoring the die temperature during operation. At the subsystem level, such as on the motherboard, the backplane, any devices connected to an internal bus, and the power supply, on-board sensors monitor the local operating temperature of these components. Finally, at the system level, such as for the cooling air which flows through the enclosure, thermal sensors safeguard the overall system operating environment.
The conventional practice in monitoring the thermal conditions of computer units is straightforward in terms of making decisions based on the temperature signals supplied by the multiple thermal sensors. So long as the temperature indications fall within a normal operating range, the computer unit continues its operation in the normal way. However, if any one of the temperatures sensed exceeds the normal operating range, warnings are issued and/or the operation of the computer unit is shut down.
While the conventional practice is generally reliable in preventing damage to the computer unit, problems of reliability have arisen as a result of permitting each individual temperature sensor to control the continued operation of the computer unit. An intermittent or permanent sensor failure or malfunction cannot be accounted for, because the indications from each individual sensor have the capability of individually shutting down the computer unit. Individual sensor indications are not evaluated for accuracy or reliability. The chances of false decision-making are increased, with the result that the system performance is adversely affected by limiting or reducing system uptime and availability while increasing maintenance costs.
The degree of importance of the indications from the different sensors is not differentiated. For example, the CPU temperature and the intake cooling air temperature, when beyond limits, can create immediate and serious consequences. On the other hand, an occasional increase in temperature above the upper limits of less critical components can be more readily tolerated. Due to the unique airflow, thermal and fluid dynamic characteristics of each different computer unit and its use at different installation sites, the sensors will experience different temperatures. As a consequence, some of the sensors will be more prone to exceed normal operating ranges, while other sensors will be less prone to do so. The conventional practice does not recognize these significant differences.
These and other similar and related problems have led to system shut-downs, and the delivery of automatic support (ASUP) messages to system administrators to report abnormal operating conditions under circumstances where the thermal operating environment was within acceptable limits. Proper system operation and availability has been needlessly and adversely affected, and the costs associated with maintenance and monitoring of the computer unit have been unnecessarily increased, among other undesirable consequences.