This invention relates to computer central processors and, more particularly, the swapping of physical processors when one is found defective without having to reboot the operating system.
As personal computers and workstations have become more and more powerful, makers of mainframe computers have undertaken to provide features which cannot readily be matched by these smaller machines in order to stay viable in the market place. One such feature may be broadly referred to as fault tolerance which means the ability to withstand and promptly recover from hardware faults and other faults without the loss of crucial information. The central processing units (CPUs) of mainframe computers typically have error and fault detection circuitry, and sometimes error recovery circuitry, built in at numerous information transfer points in the logic to detect and characterize any fault which might occur.
The CPU(s) of a given mainframe computer comprises many registers logically interconnected to achieve the ability to execute the repertoire of instructions characteristic of the CPU(s). In this environment, the achievement of genuinely fault tolerant operation, in which recovery from a detected fault can be instituted at a point in a program immediately preceding the faulting instruction/operation, requires that one or more recent copies of all the software visible registers (and supporting information also subject to change) must be maintained and constantly updated. This procedure is typically carried out by reiteratively sending copies of the registers and supporting information (safestore information) to a special, dedicated memory or memory section.
When a fault occurs and analysis determines that recovery is possible, the safestore information is used to reestablish the software visible registers in the CPU with the contents held recently before the fault occurred so that restart can be instituted or tried from the corresponding place in program execution.
Typically, when one processor in a data processing system fails, at best, the process running on that processor is aborted. In many cases, including the case where the operating system (OS) had control of the processor when it crashed, the entire operating system crashes. When the system recovers, typically after a reboot, it will run in degraded mode, with that failed processor being disabled until it can be replaced or repaired. Obviously, if this is the only processor in the data processing system, the system is down until the repair or replacement can be accomplished. In all cases though, the loss of that failed processor results in degraded performance.
It would be advantageous then for a data processing system to be able to recover from the failure of a single processor. In particular, it would be advantageous if the data processing system could recover so that no processes are lost nor is any performance lost.