This invention relates, in general, to processing within a computer environment and, in particular, to determining error conditions within the computer environment and to recovering from those error conditions.
Increasing pressure to provide highly available and continuously available computer systems places a great deal of emphasis on error detection and recovery. It is very important for errors to be detected and for recovery to be performed before the computer system crashes or is otherwise seriously impacted.
There are various types of errors and even more types of recovery processes. For example, missing interrupts and hot input/outputs (I/Os) are just two types of error conditions recognized by the Multiple Virtual Storage (or OS/390) operating system offered by International Business Machines Corporation.
A missing interrupt is an error that indicates that an input/output request has been initiated, but no response has been received for the request. A missing interrupt can be symptomatic of many different types of problems and there are different recovery processes to cover those different types of problems.
A hot I/O condition occurs when there are continuous unsolicited I/O interrupts. These interrupts are typically caused by an I/O device, control unit or channel path. Thus, recovery processes are provided to isolate and try to recover the cause of the interrupts.
There are also other types of errors that do not fall within the above categories. These errors, as well as the above errors, may cause critical system resources to become exhausted, thereby causing the computer system to crash. This is particularly devastating when several computer systems are coupled to one another and all of the systems crash.
Therefore, a need exists for an enhanced recovery capability that takes into account different types of errors. Further, a need exists for a recovery capability that monitors critical system resources, and takes action to avoid exhaustion of those resources. A yet further need exists for a recovery capability that provides enhanced system availability.
The shortcomings of the prior art are overcome and additional advantages are provided through the provision of a system of recovering from errors in a computer environment. The system includes, for instance, means for determining whether an error rate is above a predefined threshold; means for determining whether there is at least a potential shortage of a resource of the computer environment; and means for performing a recovery action when the error rate is above the predefined threshold and there exists at least a potential for the shortage.
In one example, the error rate is associated with a subsystem of the computer environment, and the system includes means for computing the error rate.
In a further example, the recovery action includes one of the following: simulating status of an error detected for the subsystem, in which the simulating is devoid of a need for a large amount of the resource; and slowing down activity to the subsystem.
In yet a further embodiment, the simulating includes means for performing one or more functions depending on the type of error. For instance, if the error is a channel error, a permanent error condition is indicated. Similarly, if the error is a unit check, a permanent error condition is indicated, and a selective reset is issued at a device of the subsystem. Further, if the error is an unsolicited error, an unsolicited device end indicator is set, and an isolation routine for a component of the subsystem is invoked.
In a further aspect of the present invention, a system of recovering from errors in a computer environment is provided. The system includes, for example, means for determining whether an error rate is above a predefined threshold; means for determining whether a resource of the computer environment is below a predetermined threshold; and means for performing a recovery action when the error rate is above the predefined threshold and the resource is below the predetermined threshold.
In yet a further aspect of the present invention, a system of recovering from errors in a computer environment is provided, in which the system includes, for instance, a processing unit adapted to determine whether an error rate is above a predefined threshold and to determine whether there is at least a potential shortage of a resource of the computer environment. Further, the processing unit is further adapted to perform a recovery action when the error rate is above the predefined threshold and the at least potential shortage exists.
In yet a further aspect of the present invention, a system of recovering from errors in a computer environment is provided. The system includes, for instance, a processing unit adapted to determine whether an error rate is above a predefined threshold, to determine whether a resource of the computer environment is below a predetermined threshold, and to perform a recovery action when the error rate is above the predefined threshold and the resource is below the predetermined threshold.
The error recovery capability of the present invention advantageously takes into account different types of error conditions. Additionally, it monitors critical system resources, and takes action to avoid exhaustion of those resources. The error recovery capability of the present invention advantageously uses a statistical threshold of the number of errors over time for deciding when a device is abnormally disrupting the computer environment. Further, the present invention is able to quiesce activity at a subsystem level. Additionally, the present invention advantageously limits any outages to those applications and subsystems using the devices in error. Thus, the present invention provides enhanced system availability.
Additional features and advantages are realized through the techniques of the present invention. Other embodiments and aspects of the invention are described in detail herein and are considered a part of the claimed invention.