Various computer manufacturers have what could be considered to be an interest in high availability systems. Typically, these systems implement a hardware error recovery mechanism to automatically, and transparently, recover from most transient errors. However, this error recovery will not be successful in most cases of solid, or non-transient, errors. Various mechanisms developed within IBM such as Processor Availability Facility (PAF), Concurrent CP Sparing, System Assist Processor (SAP) Reassignment provide for the recovery of a failed processor's work on a different processor. All the above prior mechanism have limitations.
Note that Amdahl has used the term "dynamic" in conjunction with their CP Sparing. However, to the best of our knowledge their implementation is more analogous to a combination of our the IBM Processor Availability Facility (PAF) and IBM's (IBM and S/390 are trademarks of International Business Machines Corporation) Concurrent CP Sparing as currently implemented on the IBM 9672 G4 than what is being described here as transparent processor sparing.
IBM's S/390 division, Hitachi, and Fujitsu (Amdahl) are those companies which are very active in this arena currently, but other competitors such as those who may attempt to use other kinds of processors, such as HP and Intel, may be interested in employing our development once they understand it if they attempt to produce mainframe-class systems. When a CP in a multiprocessor system encounters an error and enters a checkstop state, it is very desirable to not lose the work being done on that processor but instead move that work to another processor that is still operating in the system. In an S/390 system, several methods have been previously used to attempt to solve this problem:
Processor Availability Facility (PAF) moves the S/390 architected state of the failed processor to another currently operating (on-line) processor in the system with the help of the Operating System (OS). However, it has a several major limitations: 1) Since the mechanism uses the OS to perform the function, the customer is aware that the incident occurred, 2) if the CP happened to be executing in millimode at the time of the checkstop, it is not possible to invoke PAF since PAF only works at the S/390 architected state, not the micro-architected state which is a capability of G4 type S/390 systems (see e.g. U.S. Pat. No. 5,584,617) and 3) the customer has still lost the use of one of his CPs. PA1 Concurrent CP Sparing as currently implemented on the IBM 9672 G4 models use a spare processor so that the customer does not lose access to one of his CPs when a checkstop occurs. It is used in conjunction with PAF. However, the customer is fully aware that a processor had a problem and it requires customer intervention (VARY a CP online) in some environments. It also may not work in some Logical Partition (LPAR) environments where certain processors are dedicated to certain partitions. Finally, it is based upon PAF for the application recovery and PAF will not be successful if the CP checkstop occurred while the processor was executing in millimode. PA1 Although not directly related to preserving CP function, IBM's System Assist Processor (SAP) Re-assignment as currently implemented on the IBM 9672 G4 models use a spare processor to take over when a System Assist Processor (SAP) encounters an error. This mechanism can not be used for normal, non-SAP, CPs. PA1 They do not work if a normal CP (non-SAP) was executing in millimode at the time of the failure. PA1 All the above solutions are visible to the customer who then may be concerned that his hardware is "unreliable". PA1 Concurrent CP Sparing may not work in certain LPAR environments (e.g. dedicated uni-processor environments). PA1 They will not work for uni-processor configurations even if a spare CP is available.
So to summarize, the mechanisms stated above work well as a whole but have limitations.