1. Technical Field
The present invention is generally related to an improved data processing system, and particularly to a method and apparatus for dealing with failed hardware in a data processing system.
2. Description of Related Art
Modern computer systems with high availability requirements use many design methods to recover from hardware errors and resume normal system operation. Some of the errors can be recovered with no effect to the operating system or user applications, and some with small or minimal effect as described in pending US patent application docket number AUS920010117US1, “Method and Apparatus for Parity Error Recovery.” Hardware errors which cannot be recovered without customer data integrity exposure result in system termination. To recover from system termination, a method of automatic system reboot recovery has been described, for example, in U.S. Pat. No. 5,951,686. To prevent hardware with errors from further affecting system operation after reboot, methods have been devised for persistent deconfiguration of the processor and memory in a computer system, such as that taught in U.S. Pat. Nos. 6,223,680, and 6,234,823.
However, the existing persistent deconfiguration methods only handle errors which are internal to the processor or memory subsystems only. The existing methods do not work with hardware errors on the interface bus between subsystems in the computer. Therefore, during automatic system reboot recovery after error, a thorough diagnostic testing of the system hardware is required to ensure that the system can be rebooted successfully. A thorough hardware testing during system recovery lengthens recovery time, thus reducing system availability. Also, some of the hardware errors in the computer system are intermittent in nature, therefore a brief diagnostic testing during automatic system recovery may not always detect these errors. The same error may reappear again and cause another system outage.
Therefore, it would be beneficial to have a way to identify all hardware errors after system termination and fence off those errors from the system configuration.