In all computing environments, interrupts from various I/O devices may go unrecognized by an OS which requested an operation by the device, either because: the I/O device failed to present the interrupt due to a component in the path from the device to the OS failing to properly transfer the interrupt, or the OS failing to recognize the interrupt when presented.
Failure to detect a missing interrupt may cause operations in a data processing system to slow down and ultimately cease when a critical resource cannot be released until the interrupt occurs or the processing associated with the missing interrupt is terminated. Such detection failure may require an unscheduled system restart (IPL) to make the resource available again. Manual attempts to locate the request for the resource on some queue usually take longer than the requesting customer can afford to wait.
To reduce the catastrophic impact of a missing interrupt to the system, a method was developed to detect lost interrupts and allow failed operations to be terminated, recovery mechanisms deployed, and failed operations restarted or to terminate the job that initiated the operation with an error. This mechanism is called the Missing Interrupt Handler (MIH) and has the ability to `time` I/O operations that are in progress. Actually, this is not a time measurement but rather a limit on the length of time that is considered `normal` for the longest possible I/O operation to the device. This does not mean that all operations should take a long time but rather, all operations that exceed this time are to be considered abnormal. The Missing Interrupt Handler is therefore a `safety net` under the system to shield the host from the effects of a lost interrupt.
Initially, only a couple of timer values were established to differentiate between slower devices (unit record) and faster devices (DASD) which allowed only limited ability to tailor timer values for different machines. Two timer values were not adequate and additional individual MIH timer values have been implemented which can be adjusted to meet the needs and response requirements of various devices.
Over time, the capabilities of the MIH component have been expanded to allow dynamic modification of the MIH timeouts, including the ability to place time limits across all I/O request processing including queuing time and error recovery procedure (ERP), instead of just active time as originally implemented. However, today's Missing Interrupt Handler components of computer operating systems have deficiencies.
In today's operating systems there are various default MIH intervals based on device class (i.e. DASD, TAPE, etc . . . ). However, within each device class there is a great disparity between the recommended MIH times for different device types. For example, on tape devices the recommended MIH detection interval for the different model tape devices varies from 3 minutes for S/390 3420 tape device to 20 minutes for a S/390 3490E tape device. This variation in MIH time is due to the varying amounts of capacity a tape can contain and the maximum physical speed that the medium can be moved.
The MIH detection interval must be greater than the time to execute the longest command at the device (e.g. forward space file, rewind/unload, etc . . . ). Another example is for DASD devices. The MVS (Multiple Virtual Storage, IBM's premier operating system for S/390 machines) operating system has a 15 second default MIH time for DASD devices. This usually only needs adjustment by a system operator due to special characteristics of the work load or applications using a particular device. For example, a JES2 (job entry system 2) checkpoint data set may get reserved for long periods of time during initialization, but high availability applications need to be notified after only a few seconds if their I/O has not completed in order for the application to attempt an alternate device and still make transaction time requirements. However, new DASD characteristics further complicate the issue of choosing an MIH detection interval. The IBM 3990 DASD has internal error recovery functions that can take 30 seconds to complete. If an MIH condition is detected during this recovery, the host recovery actions can cause severe problems at the control unit. Thus, it is recommended for the 3990 DASD that system operators set a 30 second MIH interval. Additionally, other devices may be defined to the system as if they are DASD devices. An example of this may be the IBM 3995 optical devices; some operations on these devices require the mechanical removal and mounting of optical media, which can take several minutes. Complicating matters further, any time new devices are added to a computer system, the existing MIH customization information may need to be updated to insure proper operation of such devices.
An additional problem with operating system MIH handlers is that the MIH times are too long. As discussed above, MIH times need to be set by the customer for each device type based on the characteristics of that device. If the longest commands that can be executed are expected to take 20 minutes (i.e. a rewind/unload) then all hang conditions are detected after the 20 minute interval, even though most simple data transfer commands can be expected to execute in seconds. Elongated error detection times impact the customer, in that they degrade system reliability and availability. If a new device with improved technology is substituted for the old device the MIH time must be manually adjusted to meet the new requirements.
Additionally, with today's devices and MIH capability, all commands are timed at the same MIH interval. This applies also to special control and recovery commands that are used by the operating system during recovery and reconfiguration actions (i.e. set path-group-ID, sense path-group-ID, reset allegiance, assign/unassign, control access, etc . . . ). When these commands are issued by the operating system, critical system resources may be held which may delay the execution of other normal customer work. The addition of special timer code for the recovery of these commands is extremely costly to the development of the system and increases the cost of the product.
As described above, computer systems require the customer to manually set the MIH times based on the physical characteristics of the device. For example, the customer is responsible for knowing that S/390 3995 Optical Library device is really defined as an S/390 DASD 3380 device and that the MIH intervals must be set high enough so that MIH conditions are not detected for normal staging/destaging of the optical media. Additionally, if a set of tape drives are added or upgraded, the MIH times need to be adjusted based on the speeds and capacities of the tape drives. This manual process is error prone. If accidentally omitted false MIH conditions are detected, jobs may fail.
Also, MIH specifications need to be synchronized with physical I/O configuration definitions, and updated across system configuration changes done both dynamically as well as by system restart. If the MIH times are not updated correctly, system RAS will be degraded. The fact that customers have to be aware of the MIH detection for different devices adds to the cost of systems management and the overall cost of computing.
System Environment: FIG. 5a shows a multiplicity of hosts (510). Each host is a general purpose computer system containing one or more central processing units (CPU (511)), responsible for executing programs consisting of central processor instructions, and an I/O channel sub-system (512) responsible for executing channel programs and managing the transfer of information over one or more channel paths (513) between the host (510) and one or more I/O subsystems (520). In the preferred embodiment, host computers are IBM S/370 or IBM S/390 computer systems attached to I/O devices via ESCON or OEMI I/O channel interfaces. However, the computers may be of any type and may in fact be a multiplicity of types. Similarly, the channel paths may be a multiplicity of types, provided the interface is supported by the attaching host and I/O subsystem. The topology of the channel paths is potentially unique to the type of I/O interface.
Each I/O subsystem (520) consists of a control unit (521) responsible for managing one or more devices (530) connected to one or more hosts (510). I/O devices are attached to the control units via one or more device paths (531) that are supported by the devices and the control units for the communication of information. In general, the control unit adapts the I/O interface supported by the device (i.e. device paths (531)) to the I/O interface supported by the host (i.e. channel paths (513)).
Internal to each control unit (521) are facilities which are used to manage the interaction between the multiplicity of hosts and the multiplicity of devices. Each channel path is attached to a channel adapter (524) within the control unit which contains the facilities required to communicate on the associated channel path. A shared memory (522) is present in each I/O subsystem (520) that is accessible by I/O processing elements within the I/O sub-system that control the channel adapters (524). This shared memory contains a block of information associated with each device which is referred to as "device n lock data" (523).
Within each host (510), an OS program is executed by any of its CPUs (511) which performs the operations to cause the channel subsystem (512) to issue I/O signals to a selected device (530) attached to a selected channel path or set of channel paths (513). The OS program is designed such that it monitors the duration of the I/O operation from the time the request is presented to the channel subsystem (512) until a response is received from the channel subsystem (512) indicating that the I/O operation has completed. If the elapsed time of an I/O operation exceeds some threshold, the program detects a missing-interrupt timeout as described in prior U.S. Pat. No. 5,388,254 assigned to the same assignee. This MIH program function is intended to detect I/O operations that have failed to complete due to some unreported condition, thereby avoiding an indefinite suspension of processes that depend on the completion of the I/O operation.
Within the channel subsystem, an I/O request can be queued for a selected device. The I/O request causes a communication to be initiated over a channel path between the channel subsystem and a selected control unit as a result of queuing the I/O request while waiting for a requested device to perform the request. While an I/O request is being processed by the channel subsystem, the OS program is allowed to continue execution of other work. At the completion of the I/O operation, the channel subsystem interrupts the OS program to present the status of the completed I/O operation.
The control unit manages concurrent requests to each device it controls. If the control unit decides to allow an I/O request made by a host, the command is accepted from the channel and is processed for the device selected by the channel. If the control unit decides to not allow an I/O request made by a host because of concurrent activity, the command is rejected with a "busy" indication causing the I/O request to be queued in the channel subsystem. When the control unit determines that it can perform the command after having previously presented it with a busy indication, the control unit presents a "no-longer busy" indication to cause the channel subsystem to reissue the queued I/O request. The requesting OS program is not aware of this interaction except to the extent that its I/O request has not been signalled as having been completed. The algorithm normally used by a control unit to present a busy indication is discussed subsequently. This invention describes enhancements which increase system efficiency by allowing a reduction in the time limit used by the OS program to reliably detect for missing interrupt signals when one or more OSs are making concurrent I/O requests to the same control unit.
Management of Concurrent I/O Requests: The control unit determines the number of concurrent I/O requests that can be in progress at the control unit for a given device. Often, devices have a requirement that I/O requests be serialized to ensure predictable results on a medium handled by the device. Other design constraints within the control unit may also place limits on the number of requests allowed to concurrently be performed for a device.
FIG. 5B shows a process that can be employed to limit to one the number of concurrent I/O requests accepted, which causes a serialized execution of concurrent requests at the device. This policy is enforced by a device lock protocol used by the channel adapters, in which the adaptor performs an atomic "test and set operation" on a lock associated with the requested device in a lock data block 523 for the device. This lock test and set protocol is performed before beginning any I/O operation by the device. If the test and set operation is successful (finds the device is available), the I/O operation is accepted and processed by the device. If the test and set operation is unsuccessful (the device is not available), the I/O request of a channel adapter is presented with a busy indication. A channel adapter is successful when it obtained a lock for the needed device (by setting a lock bit associated with the device). At the completion of the device operation, each channel adapter that was signaled a busy indication is then signalled a "no longer busy" indication, so that I/O requests queued in the channel subsystem can again be reissued for the device. This cross channel adapter communication is indicated by the dotted line in FIG. 5B. The implementation of the presentation of the "no longer busy" indication may require consideration of "fairness" mechanisms to prevent certain hosts from continually preventing other hosts from accessing the device. Variations of this method may be provided for different interface architectures (e.g. SCSI untagged and tagged queuing) where a serialization queue is built in the control unit instead of using a channel subsystem queuing capability.
Bounding of Queuing Durations: A problem that arises with the busy/no longer busy method previously discussed for serialized concurrent I/O requests is that the time required to execute an I/O operation from the host's perspective is NOT a function of time needed for execution of the requested I/O operation at the control unit (CU). That is, the execution time from the host OS perspective is the actual CU/device execution time plus waiting time (during which the CU is executing other intervening I/O operations for other hosts). In effect, without OS knowledge of all I/O operations in the queue for the device, it is not possible to determine the duration of a requested I/O operation for the purpose of determining an appropriate missing-interrupt-timeout value. Given some degree of fairness in the resolution of concurrent accesses and some bound on the number of concurrent requests, a statistical analysis can be performed to pick a duration which will have a high probability of ensuring that the failure to detect the completion of an I/O operation is due to some failure condition and not as a result of concurrent access requests.
For example, assume a given disk device normally executes any command in less than 10 milliseconds. If most I/O requests have no more than 10 commands (10 I/O channel instructions), and there are generally no more than 10 hosts that will get relatively equal service, then multiplying the 10 milliseconds * 10 commands * 10 hosts gives an expectation that an I/O request should take no longer than about 1 second. This number would then be increased by some factor to handle exceptional conditions within some high degree of probability, say to 15 seconds.
The problem is compounded when there is a wide variation in the expected execution durations of the I/O operations for a given device because the statistical analysis for predicting a resolution of concurrent requests must consider the worst case I/O operation execution times, further increasing the discrepancy between the duration of a "short" I/O request and the missing-interrupt timeout.
For example, assume that for a given disk device, a typical command normally executes in less than 100 microseconds, but an outboard copy command (copy the content of this disk to another disk) executes in less than 5 minutes. If we determine statistically that an I/O request will not be queued for longer than eight concurrent requests (i.e. this host gets a turn at least once out of every eight I/O requests processed), then we could estimate that, worst case, the queuing time is 7.times.5 minutes=35 minutes. If this host is executing a typical command, the missing interrupt timeout estimated as the sum of the queuing time and the command execution time would be 35 minutes and 0.0001 seconds, or alternatively, about 35 minutes. One might also consider the probability of having seven different hosts issue seven concurrent outboard copies and arrive at a conclusion that a smaller timeout limit than 35 minutes is possible, say n minutes where n&lt;35, based on the probability of all the sharing systems initiating a full copy at the same time.
Often, the program must perform additional I/O when a missing interrupt is detected. These I/O requests may be just as likely to encounter a queuing problem and consequently must have the same missing interrupt timeout applied. For the case where the device is in fact broken and is no longer capable of responding to the host, it is easy to see where it could take tens of minutes before the program comes to the conclusion that the job has failed and must be rerun.
In certain environments, the program may not be able to wait for the duration of time prescribed by such statistical methods and still meet its requirements for real time processing. In other cases, the missing interrupt timeout and resulting recovery is of such duration as to create operational difficulties (e.g. processing does not complete within required windows).
Prior MIH detection systems have not worked well for single OS data processing systems in which an MIH process can be the source of error indications when certain scenarios happen. For example, a false indication would occur in a scenario where a long I/O command (a command requiring a long period of I/O operation) was issued to an I/O control unit as the last command of an I/O program, and before the last I/O command is completed, another I/O program is attempted to be initiated by the OS with a short command to the same device. In the prior system, the long command would signal partial completion when the long command is accepted at the control unit. This caused the next request to the I/O device to wait until the device completed operations for the first request. (To those skilled in the prior S/390 I/O architecture, this is known as redriving on primary status, which allows prompt termination of a job and the initiation of a new job following a tape rewind/unload command). If the second I/O request were allowed to start before the first request finished, and the operating system tried to adjust the MIH timeout for the presumed active short command, a false indication would be detected because of the lack of OS knowledge concerning the execution time for the previous long command. This short timeout then would falsely indicate a missing interrupt. No interrupt was actually missing because the short command was not yet started by the device due to the device still performing the prior long command (and neither of these commands could yet provide any completion interrupt).