1. Technical Field
The present invention relates generally to data processing systems and in particular to a system and method for recovering from an internal processor failure. More particularly, the present invention relates to a processor failure recovery technique applicable in a multiprocessor environment employing system management and predictive failure analysis techniques.
2. Description of the Related Art
Computer failures can result from malfunctioning disk drives, memory or processors, conflicts between hardware components, and software errors, among other things. Solutions to such failures have included, for example, Predictive Failure Analysis (PFA) which provides autonomous monitoring of specified system parameters or failure conditions to predict and issue alerts warning of actual or imminent device failures. This allows a system administrator to either hot-swap the faulty component or schedule downtime at low-impact periods for the component to be fixed or replaced.
While PFA has provided substantial gains in preventing data loss and minimal runtime interruption for disk drive systems such as RAID systems, neither PFA nor other system failure warning or recovery techniques have adequately addressed data loss and system interruption caused by internal processor failures. Since processors provide the fundamental processing functions of a system including those required for system recovery, runtime protection facilities such as PFA have been limited to issuing alerts and/or automatically resetting (rebooting) the system responsive to detected processor performance degradation.
The lack of runtime processor recovery solutions that would allow preservation of current state and unsaved data and enable the system to continue operating with minimal interruption is evident from recently proposed processor error recovery solutions. Current processor-specific PFA, for example, monitors processor-related faults such as L2 cache error corrections, and responsive to the frequency of such errors exceeding a specified threshold, a system management processor generates an alert that may then by utilized by a system administrator to schedule processor replacement as part of a maintenance cycle. Another very current example of the dearth of autonomic recovery solutions to processor failures is exemplified by U.S. Patent application Ser. No. 20040034816 A1, which discloses a computer failure recovery and notification system. The recovery described therein generally comprises use of a timer mechanism that monitors the relative activity or “heartbeat” from the operating system. The absence of the periodic heartbeat signal is interpreted by the system as a system hang or failure and the recovery action taken in response thereto is to reboot the system thus resulting in a loss of state operating data and an interruption of runtime processing. Other recently proposed solutions involve using dedicated error handling hardware in a multi-processor environment to monitor and record internal processor errors. Responsive to an error status reported for one or more of the multi-processors, the non-functional processors are disabled and, similar to the system described in U.S. Patent application Ser. No. 20040034816 A1, the recovery further includes restarting the system.
In summary, the present state of the art of systems addressing internal processor errors fails are largely either operating system reliant and/or result in the present operating state of a failing processor being lost such as via a system restart. Accordingly, there remains a need for improved processor recovery system and method that addresses these and other problems unaddressed by the prior art.