This invention relates, in general, to processing within a computer environment and, in particular, to determining error conditions within the computer environment and to recovering from those error conditions.
Increasing pressure to provide highly available and continuously available computer systems places a great deal of emphasis on error detection and recovery. It is very important for errors to be detected and for recovery to be performed before the computer system crashes or is otherwise seriously impacted.
There are various types of errors and even more types of recovery processes. For example, missing interrupts and hot input/outputs (I/Os) are just two types of error conditions recognized by the Multiple Virtual Storage (or OS/390) operating system offered by International Business Machines Corporation.
A missing interrupt is an error that indicates that an input/output request has been initiated, but no response has been received for the request. A missing interrupt can be symptomatic of many different types of problems and there are different recovery processes to cover those different types of problems.
A hot I/O condition occurs when there are continuous unsolicited I/O interrupts. These interrupts are typically caused by an I/O device, control unit or channel path. Thus, recovery processes are provided to isolate and try to recover the cause of the interrupts.
There are also other types of errors that do not fall within the above categories. These errors, as well as the above errors, may cause critical system resources to become exhausted, thereby causing the computer system to crash. This is particularly devastating when several computer systems are coupled to one another and all of the systems crash.
Therefore, a need exists for an enhanced recovery capability that takes into account different types of errors. Further, a need exists for a recovery capability that monitors critical system resources, and takes action to avoid exhaustion of those resources. A yet further need exists for a recovery capability that provides enhanced system availability.
The shortcomings of the prior art are overcome and additional advantages are provided through the provision of a method of recovering from errors in a computer environment. In one embodiment, the method includes determining whether an error rate is above a predefined threshold, determining whether there is at least a potential shortage of a resource of the computer environment, and performing a recovery action when the error rate is above the predefined threshold and there exists at least a potential for a shortage.
In one example, the resource is storage and the determining of whether at least a potential shortage exists comprises checking a storage indicator indicative of a level of available storage.
In a further embodiment, the error rate is associated with a subsystem of the computer environment and, in one example, the subsystem is an input/output subsystem.
In one example, the recovery action includes at least one of the following: simulating status of an error detected for the subsystem, in which the simulating is devoid of a need for a large amount of the resource, and slowing down activity to the subsystem.
In another embodiment of the invention, a method of recovering from errors in a computer environment is provided. The method includes, for instance, determining whether an error rate is above a predefined threshold, determining whether a resource of the computer environment is below a predetermined threshold, and performing a recovery action when the error rate is above the predefined threshold and the resource is below the predetermined threshold.
In one example, the recovery action to be performed is based upon a severity level of the predetermined threshold.
The error recovery capability of the present invention advantageously takes into account different types of error conditions. Additionally, it monitors critical system resources, and takes action to avoid exhaustion of those resources. The error recovery capability of the present invention advantageously uses a statistical threshold of the number of errors over time for deciding when a device is abnormally disrupting the computer environment. Further, the present invention is able to quiesce activity at a subsystem level. Additionally, the present invention advantageously limits any outages to those applications and subsystems using the devices in error. Thus, the present invention provides enhanced system availability.
Additional features and advantages are realized through the techniques of the present invention. Other embodiments and aspects of the invention are described in detail herein and are considered a part of the claimed invention.