Modern computer systems typically comprise a number of integrated circuits and other active electronic devices. These integrated circuits are generally fabricated from a semi-conductor material such as silicon and encapsulated in an integrated circuit package for attachment to a printed circuit board. It is well known in the art of integrated circuits and computer systems that the circuits' maximum possible performance may be correlated to the temperature of the device itself. The temperature of the device is driven by the ambient temperature of the air surrounding the device, the altitude of the device, airflow across the device, and self-heating of the device itself during operation. Most integrated circuits may be operated at higher speeds in a cool environment than in a hot environment. When integrated circuits are tested, often some portion of the test is performed at an elevated temperature simulating the maximum allowable temperature during operation in order to provide assurance that the circuit will work properly at its maximum speed in an environment including its maximum allowable temperature. Often, the same device will be capable of performing properly at greater speeds in environments that include temperatures lower than its maximum allowable temperature.
It is also well known in the art of integrated circuits and computer systems that these electronic devices produce heat during their normal operation. Most integrated circuits produce more heat at higher operating frequencies than they do at lower operating frequencies. In many computer systems comprising one or more integrated circuits, cooling these integrated circuits is necessary to insure an operating environment within the allowable temperature range. Cooling may be accomplished in a variety of methods. Many computer systems include fans to move air across the integrated circuit packages. Some integrated circuit packages include heat sinks to help dissipate heat from the integrated circuit through the package and heat sink and into the air moving across the heat sink. Other integrated circuit packages, particularly for circuits dissipating large amounts of power, include channels for water or another liquid to flow through the package removing heat from the circuit. Still other integrated circuits are cooled by immersion cooling, spray cooling, and micro-channel cooling on the actual silicon die.
In addition to the desire to control the environment within a computer system, there is a desire to control the environment surrounding the computer system since the fans in a typical computer system simply take air from the environment surrounding the computer system and move it across the electronic devices. If the air surrounding the computer system is very warm, this warm air may be all that is available to cool the computer system and because of the higher ambient temperature, the devices within the computer system may operate at a higher temperature. When large numbers of computer systems are placed in physical proximity to each other, cooling the surrounding air may become critical to ensure that the devices inside each of the computer systems are operating within their temperature specifications. Thus, many users of multiple computer systems place the computer systems together in one room or area that may be cooled sufficiently to allow operation of all of the computer systems within their temperature specifications. These special rooms are often called ‘data centers.’
Many data centers include special refrigeration equipment that cools the air within the data center to a level insuring the proper operation of the computer systems within the data center. This special equipment is necessary since many computer systems produce large amounts of heat during operation and without the additional refrigeration equipment, the normal building air conditioning might be unable to remove enough of this heat from the air to allow the computer systems to operate within their temperature specifications. Other facilities include liquid refrigeration equipment plumbed to the computer systems to provide liquid cooling to the devices within the computer systems.
Problems arise when portions of this refrigeration equipment breaks down. The cooling capacity of the refrigeration equipment may be reduced and the air within the data center may rise above the maximum temperature allowed by the computer systems. Most computer systems run at a fixed clock frequency. When the device temperature of their integrated circuits rise, the actual switching capacity of the integrated circuits slows down. Since the latches or registers of these circuits are clocked at a fixed frequency, when the switching slows down too far, the latches and registers may set before their inputs arrive causing them to store incorrect data. This incorrect data may culminate in incorrect results or may cause the computer to shut down and require a reboot.
Other data center problems may arise when the data center is not properly designed, or is used outside of its capabilities. If proper airflow is not maintained through out the data center, some of the computer systems may have a higher ambient air temperature than other systems. When computer systems are placed in close proximity to each other, it is possible that the air intake of one machine may be very near the outflow of an adjacent machine that may flow hot air into the air intake, causing over-heating. The warmer computer systems may be more prone to failure than the cooler systems.
Some computer systems include temperature-sensing circuitry controlling fans within the system. When the temperature rises, these systems increase fan speed to better cool the electronic devices. As the temperature falls, these systems decrease fan speed to save power and reduce the noise of the system fans. However, these systems can only move a limited quantity of air over their circuits and are dependant on the outside environment for their cool air. If the outside environment is too warm, it is possible that the temperature within the computer system will continue rising beyond the cooling capability of the system fans. Once the internal temperature rises above the maximum allowable temperature, the computer system may give a warning and then shut itself down to prevent computing errors or possible damage to the system. Further, reliability may be reduced when computer systems are operated at temperatures outside of their ranges. It is well known in the art that metal migration within integrated circuits increases at elevated temperatures and over time. The longer an integrated circuit is run at an elevated temperature, the greater the chances that a physical failure of the device will occur. Thus, it is desirable to prevent overheating of integrated circuits for extended periods of high temperature operation whenever possible.
Another problem with air-cooled computer systems is that at high elevations, the air is less dense and therefore less efficient in conducting heat away from the devices. Computer systems must be designed to operate properly at high elevations while the vast majority of users never operate their computer systems in such an environment. Thus, a computer system designed to work at 10,000 feet elevation may have the ability to perform at a higher frequency at sea level due to the better cooling capabilities of the dense air at sea level. This computer system used at sea level would then be performing below its actual capabilities, depriving the user of some portion of its performance capabilities.
Many computer systems include extra fans to allow a margin of safety in the event of one or more of the fans failing. Also, many data centers are designed to include extra refrigeration capacity allowing an additional margin of safety in the event that one of the refrigeration units fails. However, even with these precautions, failures still occur, causing the air temperature to rise above the maximum allowed by the computer systems. In these situations, the computer servers may perform improperly or shut down and require a reboot, causing great difficulty for their users. Also, it is possible that a fan failure would result in a heat rise in one part of the system and not another.
Along with extra fans, some computer systems include extra power supplies to provide sufficient power to the system should one or more of the power supplies fail. However, these precautions are very costly and even if used, may still not be sufficient to allow for full performance of the computer system in the event of one or more failures. For example, a system built with one extra power, may have two power supply failures, and not have sufficient current capability remaining to power the system at maximum performance.