Computer data processing chips such as CPUs (central processing units), GPUs (graphics processing units) and the like are becoming increasingly powerful. This increase in performance has been accomplished by increasing clock frequencies, shrinking geometries within integrated circuits, and adding additional logic for more features.
Current high performance data processing chips generate significant amounts of heat. For example, some state of the art CPUs generate heat in excess of 80 watts. Since excessive temperatures can damage integrated circuits, it is common to provide active cooling systems to CPUs and similar devices. For example, it is common to attach large heat sinks to CPU chips and to provide a fan to ensure that there is adequate cooling air flow through the heat sink at all times while the computer is operating. If the air flow is interrupted for as little as a minute or two, the CPU can be destroyed by excessive heat buildup.
The fan may be mounted directly on the CPU heat sink to push air past the fins of the heat sink. The fan may alternatively be mounted elsewhere in the computer or on the surface of the computer's case. The fan is typically mounted in such a way that its air flow is directed to the vicinity of the CPU.
Like any other devices with moving mechanical parts, cooling fans can fail. If the cooling fan fails, air flow is interrupted. As a result, heat builds up in the CPU and the CPU's temperature can rise quickly to critical levels. Many modern computers prevent destruction of the CPU in such an eventuality by providing a system for monitoring the die temperature in the CPU. If the temperature of the die increases beyond a threshold temperature, the CPU is shut down. Shutdown of the CPU typically occurs very abruptly with no warning to software. The CPU essentially crashes. After the computer is restarted, it is necessary to return the CPU to an appropriate state and/or clean up any corrupted data resulting from the CPU crash before the computer can resume its intended role. The computer could be out of service for a significant period of time before a fan failure is detected and corrected.
In recent years, cooling fans have been improved such that incipient failures can be detected. Many cooling fans have voltage sensors and fan speed sensors. If the fan speed drops slowly over time then this may indicate that the fan is becoming clogged with dust and requires cleaning. An increase in the fan voltage which is not accompanied by a corresponding increase in the fan speed may indicate that the fan's bearings are starting to fail. With these improvements, it is sometimes possible to detect emerging problems before the fan fails. Computers are increasingly provided with software that monitors these sensors while the computer is operating. It is possible to shut down the computer gracefully to replace the fan instead of waiting for it to crash after the fan fails. If a graceful shutdown is achieved then the computer will be out of service for a shorter interval.
Some computers are required to operate continuously for long periods, in so-called “24×7” operation. For example, a computer may process sales orders for an online shopping web site. If such a computer is shut down to replace a cooling fan, revenue may be lost in direct proportion to the length of time that the computer is out of service. It is highly desirable to avoid shutting down the computer altogether or at least to minimize the length of time that the computer is out of service.
As another example, modern high performance computing systems (i.e. supercomputers) typically consist of hundreds or thousands of interconnected rack-mounted computers. Such computer systems often run a computer intensive application for hours or days across all of the computers making up such a system. The application runs a program on each of the computers. The programs communicate among themselves to share intermediate results. If one computer fails, the whole application will stop executing or fail. This may result in the loss of several hours or days worth of results.
To satisfy the needs of 24×7 operation, high performance computing systems, and other situations with similar requirements, it is desirable to find a way to change a cooling fan without interrupting the operation of a computer and without risking destruction of the CPU due to excessive heat.