The present invention relates to an apparatus and method for fault resilient booting in a multi-processor computer system.
Multi-processor computer systems may experience problems when booting if one or more of the processors fails during a reset. A processor fails by not successfully executing the reset instruction and may not respond to further instructions or may provide erroneous output. Booting involves starting the computer system, for example, by turning on the power to it. In response to the application of power, the processors in the system execute preliminary instructions at a pre-designated address in an attempt to initialize the processors and place them in an operational mode so that they may execute programs or applications. If any of these processors fails during the booting, the entire system may deadlock and be unable to operate. Booting may also involve a warm reset, which involves a software or hardware reset of a processor already running or to which power is already applied.
One of the processors in a multi-processor system is typically pre-designated as a boot strap processor. The boot strap processor functions to initialize the other processors during the booting process. If the boot strap processor fails during booting, the entire system may again deadlock and be unable to operate.
Accordingly, a need exists for an improved apparatus and method for fault resilient booting of a multi-processor system.
A first method consistent with the present invention may be used to boot a computer system having a plurality of processors. The method includes performing a cold reset of the processors and determining if any of the processors failed during the cold reset. The method also includes performing a warm reset of the processors and isolating any of the processors that failed in conjunction with performing the warm reset.
A first apparatus consistent with the present invention boots a computer system having a plurality of processors. The apparatus performs a cold reset of the processors and determines if any of the processors failed during the cold reset. The apparatus also performs a warm reset of the processors and isolates any of the processors that failed in conjunction with performing the warm reset.
A second method consistent with the present invention includes performing a cold reset of a plurality of processors within each of node of a multi-processor system. The cold reset involves attempting to identify one of the processors in each of the plurality of processors as a node-boot strap processor. The method further includes attempting to identify one of the node-boot strap processors as a system boot-strap processor and using the system-boot strap processor to perform a warm reset of the plurality of processors in each of the nodes. In conjunction with performing the warm reset, any of the processors that failed are isolated.
A second apparatus consistent with the present invention performs a cold reset of a plurality of processors within each of node of a multi-processor system. In conjunction with performing the cold reset, the apparatus attempts to identify one of the processors in each of the plurality processors as a node-boot strap processor. The apparatus also attempts to identify one of the node-boot strap processors as a system boot-strap processor and uses the system-boot strap processor to perform a warm reset of the plurality of processors in each of the nodes. In conjunction with performing the warm reset, the apparatus isolates any of the processors that failed.