This application contains subject matter which is related to the subject matter of the following applications, each of which is assigned to the same assignee as this application and filed on the same day as this application. Each of the below listed applications is hereby incorporated herein by reference in its entirety:
xe2x80x9cINPUT/OUTPUT RECOVERY METHOD WHICH IS BASED UPON AN ERROR RATE AND A CURRENT STATE OF THE COMPUTER ENVIRONMENT,xe2x80x9d by Fitzpatrick et al., Ser. No. 09/137,947; and
xe2x80x9cINPUT/OUTPUT RECOVERY SYSTEM WHICH IS BASED UPON AN ERROR RATE AND A CURRENT STATE OF THE COMPUTER ENVIRONMENT,xe2x80x9d by Fitzpatrick et al., Ser. No. 09/138,104.
This invention relates, in general, to processing within a computer environment and, in particular, to determining error conditions within the computer environment and to recovering from those error conditions.
Increasing pressure to provide highly available and continuously available computer systems places a great deal of emphasis on error detection and recovery. It is very important for errors to be detected and for recovery to be performed before the computer system crashes or is otherwise seriously impacted.
There are various types of errors and even more types of recovery processes. For example, missing interrupts and hot input/outputs (I/Os) are just two types of error conditions recognized by the Multiple Virtual Storage (or OS/390) operating system offered by International Business Machines Corporation.
A missing interrupt is an error that indicates that an input/output request has been initiated, but no response has been received for the request. A missing interrupt can be symptomatic of many different types of problems and there are different recovery processes to cover those different types of problems.
A hot I/O condition occurs when there are continuous unsolicited I/O interrupts. These interrupts are typically caused by an I/O device, control unit or channel path. Thus, recovery processes are provided to isolate and try to recover the cause of the interrupts.
There are also other types of errors that do not fall within the above categories. These errors, as well as the above errors, may cause critical system resources to become exhausted, thereby causing the computer system to crash. This is particularly devastating when several computer systems are coupled to one another and all of the systems crash.
Therefore, a need exists for an enhanced recovery capability that takes into account different types of errors. Further, a need exists for a recovery capability that monitors critical system resources, and takes action to avoid exhaustion of those resources. A yet further need exists for a recovery capability that provides enhanced system availability.
The shortcomings of the prior art are overcome and additional advantages are provided through the provision of at least one program storage device readable by a machine, tangibly embodying at least one program of instructions executable by the machine to perform a method of recovering from errors in a computer environment. The method includes, for example, determining whether an error rate is above a predefined threshold; determining whether there is at least a potential shortage of a resource of the computer environment; and performing a recovery action when the error rate is above the predefined threshold and the at least potential shortage exists.
As examples, the resource being monitored is storage, and the error rate is associated with a subsystem of the computer environment. As a further example, the method further includes computing the error rate.
In one embodiment, the recovery action includes simulating status of an error detected for the subsystem and/or slowing down activity to the subsystem.
In yet a further embodiment, the simulation of status includes performing one or more functions depending on the type of error. For instance, if the error is a channel error, a permanent error condition is indicated. Similarly, if the error is a unit check, a permanent error condition is indicated, and a selective reset is issued at a device of the subsystem. Further, if the error is an unsolicited error, an unsolicited device end indicator is set, and an isolation routine for a component of the subsystem is invoked.
In another aspect of the present invention, at least one program storage device readable by a machine, tangibly embodying at least one program of instructions executable by the machine to perform a method of recovering from errors in a computer environment is provided. The method includes, for example, determining whether an error rate is above a predefined threshold; determining whether a resource of the computer environment is below a predetermined threshold; and performing a recovery action when the error rate is above the predefined threshold and the resource is below the predetermined threshold.
In yet another aspect of the present invention, an article of manufacture, which includes at least one computer usable medium having computer readable program code means embodied therein for causing the recovery from errors of a computer environment, is provided. The computer readable program code means in the article of manufacture includes, for instance, computer readable program code means for causing a computer to determine whether an error rate is above a predefined threshold; computer readable program code means for causing a computer to determine whether there is at least a potential shortage of a resource of the computer environment; and computer readable program code means for causing a computer to perform a recovery action when the error rate is above the predefined threshold and the at least potential shortage exists.
In a further aspect of the present invention, an article of manufacture, including at least one computer usable medium having computer readable program code means embodied therein for-causing the recovery from errors of a computer environment, is provided. The computer readable program code means in the article of manufacture includes, for example, computer readable program code means for causing a computer to determine whether an error rate is above a predefined threshold; computer readable program code means for causing a computer to determine whether a resource of the computer environment is below a predetermined threshold; and computer readable program code means for causing a computer to perform a recovery action when the error rate is above the predefined threshold and the resource is below the predetermined threshold.
The error recovery capability of the present invention advantageously takes into account different types of error conditions. Additionally, it monitors critical system resources, and takes action to avoid exhaustion of those resources. The error recovery capability of the present invention advantageously uses a statistical threshold of the number of errors over time for deciding when a device is abnormally disrupting the computer environment. Further, the present invention is able to quiesce activity at a subsystem level. Additionally, the present invention advantageously limits any outages to those applications and subsystems using the devices in error. Thus, the present invention provides enhanced system availability.
Additional features and advantages are realized through the techniques of the present invention. Other embodiments and aspects of the invention are described in detail herein and are considered a part of the claimed invention.