1. Field of the Invention
This invention relates to computer systems and more particularly to processor failure detection and recovery techniques employed within multi-processing computer systems.
2. Description of the Relevant Art
Computer systems which employ multiple processing units hold a promise of economically accommodating performance capabilities that surpass those of current single-processor based systems. Within a multi-processing environment, rather than concentrating all the processing for an application within a single processor, tasks are divided into groups or "threads" that can be individually handled by separate processors. The overall processing load may thus be distributed among several processors, and the distributed tasks may be executed simultaneously in parallel. The operating system software divides various portions of the program code into the separately executable threads, and typically assigns a priority level to each thread.
FIG. 1 is a simplified block diagram of a so-called symmetrical dual processing system 10 including a pair of processing units 12A and 12B. Processing units 12A and 12B are each coupled to a main memory 20 via a processor bus 22. An I/O device 24 is further coupled to processor bus 22.
The multi-processing system 10 is symmetrical in the sense that both processing units 12A and 12B share the same memory space (i.e., such as main memory 20) and access memory space using the same address mapping. The multi-processing system 10 is further symmetrical in the sense that both processing units 12A and 12B share equal access to the same I/O subsystem.
In general, a single copy of the operating system software as well as a single copy of each user application file is stored within main memory 20. Each processing unit 12A and 12B executes from these single copies of the operating system and user application files. Although processing units 12A and 12B may be processing instructions simultaneously, it is noted that only one of the processing units 12A or 12B may assume mastership of the processor bus 22 at a given time. Thus, a bus arbitration mechanism (not shown) is typically provided to arbitrate concurrent bus requests of the processing units and to grant mastership to one of the processing units based upon a predetermined arbitration algorithm. A variety of bus arbitration techniques are well known. Each processing unit 12A and 12B is also typically associated with a dedicated internal cache memory subsystem, the operation and function of which are also well-known.
For the dual processing system of FIG. 1, one of the processing units 12A or 12B is designated as a lead-off master processor. The lead-off master processor is the first processor to execute code upon system reset, and is otherwise essentially identical to the other processor which is referred to as a slave processor. For dual processing systems based on Pentium model P54C microprocessors, the master is designated by a pin strapping option referred to as "CPUTYPE". Upon system reset, each processor detects the logic level applied at its respective CPUTYPE pin and responsively assumes operation as either master or slave depending upon the detected logic level. A low logic level invokes operation as master while a high logic level invokes operation as slave. The designated master processor thereafter executes code to begin initialization of the system. At a certain point in the initialization code, a wake-up call (i.e., an interrupt) is provided to the slave processor to thus initiate dual processing operations.
Within such a dual processing system, if the master processor fails to reset or experiences a hard failure during operation, the system may be incapable of resuming operation under control of the slave processor. Thus, the user is left with a dead system in such situations even though a perfectly functional slave processor may remain within the system. Although the system administrator could power clown the system upon such failure, remove the faulty processor, and replace it with the functional processor from the slave socket, the functional processor could be damaged during handling. Furthermore, the system must be powered down while the system administrator responds to the problem.
Another solution to this problem employs a jumper block which may be used to select the master processor. If the system fails, the system administrator could remove the cover of the machine and move a selection jumper to interchange the designations of the slave and master processors. However, this solution still requires the attention of the system administrator to address the problem.