1. Field of the Invention
The present invention relates to fault tolerance in computer systems, and more particularly to an apparatus for swapping, removing or adding processors in a computer system while the computer system continues operating.
2. Related Art
Continuous operation and high reliability are essential for some computer systems. A failure, or even a temporary cessation of operation, can have catastrophic consequences for electronic fund transfer system, or airline traffic control systems, for example. To this end people have developed fault-tolerant computing systems that allow xe2x80x9chot swappingxe2x80x9d of computer system components. Hot swapping involves removing and replacing a failed computer system component while the computer system continues to operate. This potentially allows a computer system with a failed component to be repaired without shutting the computer system down.
Hot swapping is typically applied to devices that plug into a computer system""s peripheral bus, such as a disk drive. This allows peripheral devices to be replaced, without shutting the computer system down. However, more centrally located components, such central processing units (CPUs) cannot be replaced in this way. This is because most computer systems are uniprocessor systems with only one central processing unit. Hence, removing the central processing unit will prevent the computer system from functioning. Furthermore, CPUs are typically deeply integrated into the motherboard, or center of a computer system, and cannot easily be removed. Additionally, CPUs are harder to initialize, and are more tightly bound into the computer system""s operating system and interrupt structure than are peripheral devices, such as disk drives. Consequently, it is a much harder to facilitate removal and re-insertion of a CPU in an operating computer system.
Consequently, when central processing units fail or need to be upgraded for additional performance, a computer system must be shut down to replace the CPU. Furthermore, in order to restart the computer system a lengthy rebooting process is typically required to re-initialize the operating system and other computer system components.
What is needed is a computer system that allows a CPU to be removed without shutting the computer system down.
Additionally, what is needed is a computer system that allows a CPU to be inserted and initial while the computer system is operating.
One embodiment of the present invention provides a computer system that allows a processor module to be removed while the computer system is operating. This computer system includes a connector, for connecting the processor module to the computer system. It also includes a power switch coupled between a power source and the connector, for selectively removing power from the processor module in the connector while power is maintained to other components of the computer system. The computer system additionally includes a mechanism that modifies the operating system so that the computer system will continue to function without the processor module. Thus, this embodiment of the present invention allows the processor module to removed, replaced and reinitialized without shutting down the computer system.
Another embodiment of the present invention includes a plurality of isolation buffers, for isolating electrical pathways between the processor module in the connector and the computer system.
Yet another embodiment of the present invention includes a mechanism that activates preparation of the computer system for removal of the processor module. In a variation on this embodiment, this mechanism includes a switch. In another variation, this mechanism receives a command to activate the preparation from a computer program. In yet another variation, the mechanism includes resources that detect a problem in the processor module before activating preparation of the computer system for removal of the processor module
One embodiment of the present invention includes a mechanism that saves state from the processor module to a first location in the computer system. In a variation on this embodiment, the first location includes another processor in the computer system. In another variation, the first location includes a storage area in the computer system. In yet another variation, the computer system includes a mechanism that overwrites boot code with code that restores state from the first location in the computer system.
Another embodiment of the present invention includes a mechanism that modifies an interrupt structure in the computer system so that the processor module will not receive interrupts. In a variation on this embodiment, the interrupts are redirected to another processor in the computer system.
Another embodiment of the present invention includes a mechanism that waits for a bus transaction involving the processor module to complete before preparing the computer system for removal of the processor module. Yet another embodiment includes a mechanism that waits for a computational task involving the processor module to complete before preparing the computer system for removal of the processor module.