1. Technical Field
The present invention relates to an improved data processing system and, in particular, to a method and system for data processing system reliability, and more specifically, for location of faulty components.
2. Description of Related Art
As computers become more sophisticated, diagnostic and repair processes have become more complicated and require more time to complete. Diagnostic procedures generally specify several possible solutions to an error or problem in order to guide a service engineer to a determination and subsequent resolution of the problem. The service engineer may perform several corrective steps for each diagnostic procedure while attempting to resolve the problem. The service engineer may xe2x80x9cchasexe2x80x9d errors through lengthy diagnostic procedures in an attempt to locate one or more components that may be causing errors within the computer.
For example, a diagnostic procedure may indicate an installed component or field replaceable unit (FRU) that is a likely candidate for the error, and the installed FRU may be replaced with a new FRU. The reported problem may be considered resolved at that point. If, after further testing of the previously installed FRU, the FRU is later determined to be reliable, the original problem has not actually been resolved and may remain unresolved until the next error is reported.
Diagnosing errors during initial program load (IPL) is especially difficult because the operating system, which may contain sophisticated error logging functions, has not yet been loaded at that stage of system initialization, and the IPL code is purposefully devoid of most diagnostic functions in order to keep the IPL code efficient. If the system suffers from a freeze or hang condition in which the system simply stops responding during IPL, the only solution to diagnosing the error may be directing the service engineer to replace one FRU at a time and then rebooting the system to see if the system successfully completes the IPL.
The potential for misdiagnosis is compounded if the system has multiple, identical FRUs and the diagnostic procedure indicates that any one of the multiple FRUs could be a likely candidate for the error. For example, in a multiprocessor system, any one of the processor FRUs with associated IPL code may cause an error. In this situation, the service engineer may attempt, through trial and error, to resolve a problem by replacing each FRU in turn and then retesting the system. In the worst case, the time required for diagnosing the problem is multiplied by the number of identical FRUs. Isolating defective FRUs through trial and error is time consuming and costly. In addition to paying for unnecessary components, a business must also pay for the recurring labor costs of the service engineer and lost productivity of the user of the error-prone system.
Therefore, it would be advantageous to provide a method and apparatus for efficiently diagnosing problems during IPL within multiprocessor data processing systems.
A method and apparatus for detecting an error condition during initialization of a multiprocessor data processing system is provided. A master processor identification indicator is initialized to an initial value by a service processor in the data processing system. The master processor identification indicator may be a location in nonvolatile RAM to protect data integrity. One of the plurality of processors in the multiprocessor system is selected to be the master processor by being released by the service processor and winning the xe2x80x9crace conditionxe2x80x9d to fetch the first instruction from memory for program execution. This processor then sets the master processor identification indicator to a unique processor identification value. The initial value may be a spoof number indicating whether the master processor has yet written its unique processor identification value. At some later point in time, the service processor detects a freeze or hang condition in the data processing system. The service processor reads the value of the master processor identification indicator and reports the value of the master processor identification indicator to indicate which processor among the plurality of processors in the data processing system was selected as the master processor prior to the detection of the hang condition.