In mainframe commercial computing, recovery considerations must change when going from a batch to a transaction environment. With batch it is usually preferable to take whatever time is needed to recover. However, with on-line transactions there can be time constraints which make it better to complete a transaction, or part of a transaction, unsuccessfully but on time rather than successfully but late. An example is an on-line banking system.
Avoiding transaction delays when errors occur necessarily involves trading off full recovery actions for on time response. Where errors involve hardened data, there is potential (on a WRITE operation) for loss of data. This problem can sometimes be alleviated by duplexing of data, which many computer installations do for availability in any case. This allows abbreviated recovery actions without loss of data and so allows transactions to complete normally and on time. However, there are environments where it is preferable to return an error rather than return late.
While transaction time constraints can affect how recovery is done in every area, I/O requests are a particularly important area to address. First, I/O requests are a major component of transaction response time, and second, I/O requests are susceptible to delays due to hardware and/or software retries or other factors. For example, retry operations internal to a DASD Direct Access Storage Device control unit can take up to 4 seconds or more. Multiple software retries down multiple paths can compound this causing recovery of a DASD error to take minutes in some cases.
Some control programs (e.g., IBM's MVS/ESA) have a Missing Interrupt Handler (MIH) function which is responsible for detecting device failures and taking corrective actions to recover active I/O requests. A "fault" time limit is specified on a device basis and the MIH will interrupt an active I/O request at the device if the time limit is exceeded. The detection is accurate to 1/2 the fault value (on average). After the timed out I/O request is interrupted, it can often be retried. Since any software retry represents a newly active I/O request, it is allowed a full new MIH interval.
A known method for solving the problem of excessive recovery of I/O requests delaying transactions, involves an automated operations method. An automation program can be written to detect MIH console messages. The device in the message can be checked against a list of key devices, and if there is a match, the program can issue an operator command to force the device off-line. This effectively terminates all I/O requests to the device from that point on. An undesirable side effect is that the original software and/or hardware error diagnostics are lost. Also, the device was subsequently unable to process subsequent I/O requests or to return asynchronous hardware diagnostics about the original error because forcing the device off-line makes it unavailable to the system.
It is an object of the present invention to provide a mechanism for controlling the total time allowed for I/O requests.
It is a further object of this invention to track I/O request time across I/O error recovery scenarios.
It is a further object of this invention to track I/O requests across their entire "life".
It is a further object of this invention to provide for setting time limits for I/O requests on an I/O request basis, on a device basis, on a dataset basis, or on a workload (address space) basis.
It is a further object of this invention to provide for improved transaction timer accuracy over that provided in such conventional mechanisms as those used in MIH processing.
It is a further object of this invention to provide for a threshold time value beyond which I/O retries will not be attempted.
It is a further object of this invention to provide for an I/O request time limit that cannot be overridden by device recovery support code.