This application contains subject matter related to a co-pending application entitled xe2x80x9cA Method and Apparatus for Clearing Obstructions from Computer System Cooling Fansxe2x80x9d by Benjamin D. Osecky et al., which has been assigned Hewlett-Packard Docket Number 10011795-1. This application is hereby incorporated by reference, is assigned to the same assignee as the present application, and was filed on Jan. 31, 2001, which is also the date on which the present application was filed.
The present invention relates to processor cooling in highly available multiprocessor computer systems. More specifically, the present invention relates to detecting the impending failure of a processor cooling device, suspending processes executing on the processor cooled by the cooling device, resuming execution of the processes on one or more other processors, and reducing power to the affected processor.
In the art of computing, it is desirable to maximize the availability of a computer system. This is known in the art as high availability (HA) computing. Companies desiring to market HA computing systems have set very high goals. For example, Hewlett-Packard Company has announced a goal of achieving 99.999% availability for high-end server platforms. This translates to about five minutes of downtime per year.
The design of an HA computer encompasses many of the computer""s subsystems. For example, redundant and uninterruptible power supplies, error correcting memories, disk array subsystems, and robust software are all critical to the design of an HA computer system.
One popular redundancy technique is known in the art as xe2x80x9cN+1xe2x80x9d redundancy. The concept behind N+1 redundancy is that if N devices are needed to operate a system, N+1 devices are provided. If one of the devices fails, the failure is detected and the failed device can be replaced before one of the other devices fails. N+1 redundancy has been used successfully to provide redundancy for power supplies, hard disk drives in disk array subsystems, as well as many other devices.
N+1 redundancy has also been used to provide redundant cooling to processors in multiprocessor computer systems. For example, if three cooling fans are required to cool the processors, a forth fan is provided. Typically, some type of manifold or plenum is used to direct the airflow to heat sinks attached to the processors. If one of the fans fails, the failure is detected and signaled to the operator. Thereafter, the failing fan can be replaced. Note that the computer system can continue operating without interruption.
In computing systems where high availability is less important, such as desktop workstations, it has become common to use a turbo cooler fan coupled directly to each processor. One popular line of turbo cooler fans is the ArctiCooler family of turbo cooler fans, which are a product of Agilent Technologies, Inc. Of course, turbo cooler fans are also available from many other companies.
Typically a heat sink is coupled to the processor using a thermal interface material, and the fan is mounted to the heat sink. Turbo coolers have several advantages. First, turbo coolers are very effective since the fan is mounted in close proximity with the processor and heat sink. Second, the turbo cooler is often integrated with the heat sink and processor to form a single field replaceable unit. Third, high-end turbo coolers have become highly reliable, and often have low failure rates that are comparable to the processor itself. Fourth, turbo coolers are volumetrically efficient because they require little space, and manifolds or plenums are not required. Finally, turbo coolers are inexpensive. As is known in the art, there are substantial competitive forces to continually lower the cost of computer systems.
While turbo coolers have proven to be an inexpensive, efficient, and reliable solution to the problem of processor cooling, the use of turbo coolers in HA computer systems has proven to be controversial because the failure of a single turbo cooler fan has the potential to bring down the whole computer system. Accordingly, what is needed in the art is a way to use turbo coolers to cool processors, while providing the level of redundancy traditionally associated with providing N+1 cooling fans.
The present invention allows a multiprocessor computer system to continue operation after the failure of a non-redundant turbo cooler fan, or other non-redundant per-processor cooling device, that is coupled to a central processing unit (CPU). The present invention is implemented via several hardware and software components. The hardware components include a cooling device monitoring and control unit which monitors one or more signals indicative of cooling device health, and the ability to deallocate a CPU.
The software components include routines that interface with the cooling device monitoring and control unit and deallocation hardware, and provide the ability to detect an impending cooling device failure. If an impending failure is detected, the software components cause all user and operating system processes to be moved from the CPU coupled to the failing cooling device to one or more other CPUs. The system state is altered so that interrupts are no longer received and processed by the affected CPU, and all memory caches associated with the affected CPU are flushed back to main memory to ensure cache coherency.
At this point, the CPU is either powered-down, or placed in a low-power mode that allows the CPU to operate without the cooling device. The system operator or other management software is then notified that the cooling device has failed. Note that the field replaceable unit may be the fan only, a fan/heat sink assembly, or a fan/heat sink/CPU assembly. After the cooling device has been replaced and is operating normally, the CPU can be powered back up, interrupts can be enabled, and the CPU can once again be available to execute user and operating system processes.
The present invention allows a multiprocessor computer system to be constructed with non-redundant, efficient, low cost, and highly effective turbo cooler fans or other types of non-redundant cooling devices, while ensuring continued operation of the computer system in the event of a failure of a cooling device. Accordingly, the present invention is ideally suited for use in low-cost high availability computer systems. In essence, the present invention moves the point of redundancy from the cooling device to the redundancy provided by other CPUs in a multiprocessor computer system.