1. Field of the Invention
The present invention relates to fault-tolerance mechanisms in computer systems. More specifically, the present invention relates to a method and an apparatus for automatically integrating a module into an operating computer system to replace a module that has failed.
2. Related Art
Hot maintained computer systems are designed to facilitate removal and replacement of broken modules while the computer system is operating. When a module fails within a redundant hot maintained computer system, a secondary module takes over for the failed module. This allows the computer system to continue operating without interruption. The failed module is subsequently removed from the computer system and a replacement module is inserted in its place. Once the replacement module has been inserted, a technician manually enters commands to integrate the module into the computer system. This integration process typically involves: making preliminary checks on the replacement module; powering on the replacement module; running functional test on the replacement module; and loading state information into the replacement module.
The fact that the integration process requires commands to be entered manually can give rise to a number of problems. First, the technician must find the system console in order to enter integration commands. Second, the technician must remember the integration commands. If the technician forgets a command or inadvertently enters a wrong command, the technician can potentially cause the computer system to crash. Furthermore, allowing the technician to control the integration process can allow for sloppy service practice. In some situations a service technician may try to integrate a questionable failed module into the computer system on the chance that it will operate properly, instead of returning the failed module to a service depot for testing.
What is needed is a method and an apparatus for automatically integrating a replacement module into an operating computer system without requiring a technician to explicitly enter integration commands.
One embodiment of the present invention provides a system that automatically integrates a module into a computer system to replace a module that has failed. The system operates by detecting an insertion of the module into the computer system. In response to this insertion, the system reads information from the module in order to identify what type of module has been inserted into the computer system. If the newly inserted module cannot perform the functions of the prior module, the system signals an error condition. The system additionally reads information from the module in order to determine if the module has failed since it was first shipped or last repaired. This information was originally written by this or another system upon detection of a failure. If the module has failed since it was first shipped or last repaired, the system signals an error condition. Finally, if no error condition is signaled, the system integrates the module into the computer system. In a variation on the above embodiment, this integration process involves running functional tests on the module, and loading configuration information into the module. Thus, the present invention speeds up the reintegration process by dispensing with the need to manually enter integration commands into the computer system. This creates fewer opportunities for error because a technician is not required to memorize integration commands and will not inadvertently enter the wrong commands. The present invention also fosters proper service practice by encouraging a technician to return a failed module to a service depot, instead of simply cycling the injection switch of a module to xe2x80x9crepairxe2x80x9d the unit. Note that the present invention is not limited to hot maintained or redundant computer systems. It can generally be used in any computer system with a processor that can observe insertion and removal of a module during maintenance.
In another variation, detecting the insertion of the module into the computer system includes receiving information from an electrical circuit that detects the presence of the module in the computer system. In another variation, detecting the insertion of the module into the computer system includes periodically polling the module to determine whether the module is present in the computer system.
In a further variation on the above embodiment, the system allows a human operator to bypass the automatic integration process by receiving integration commands entered manually by the human operator.