1. Field of the Invention
The present invention relates to a method, system, and article of manufacture for recovering from ungrouped logical path failures.
2. Description of the Related Art
In certain computing environments, a host computer may communicate with a storage control unit, where the storage control unit controls physical storage. The physical storage that is controlled by the storage control unit may be represented logically as a plurality of logical path resources within the storage control unit. Applications in the host computer may perform input/output (I/O) operations with respect to the logical path resources of the storage control unit. For example, an application in the host computer may write to logical path resources of the storage control unit. The storage control unit may maintain a correspondence between the logical path resources and storage media in the physical storage via logical and physical volumes. While data may be physically written to the storage media in the physical storage under the control of the storage control unit, as far as an application in the host computer is concerned, the application performs write operations with respect to the logical path resources in the storage control unit.
Logical path resources may be added, deleted, or otherwise modified within the storage control unit. Certain modifications to the logical path resources of the storage control unit, such as addition of a logical path resource when no path resources are available, may cause a failure of I/O operations that are sent from the host computer to the storage control unit.
There are instances in which logical paths are not grouped. For example, at system initial program load (IPL) time, not all logical paths from a host being loaded are grouped. It has been observed that in the cases of ungrouped logical paths, a single-point-of-failure, where a single logical path fails, can prevent an operating system (such as the z/OS operating system available from International Business Machines, Inc.) from loading even if the other defined logical paths are stable.
To IPL a z/OS system attached to a storage controller, a customer often must have a physical path infrastructure in place between the host and the storage controller. The z/OS operating system is a multipath-capable operating system and so there are generally between two and eight logical paths to any given device on a storage controller subsystem.
To IPL a host, a customer previously attaches a storage controller to a processor using several physical paths and then proceeds to the processor hardware management console (HMC) seeking to initiate IPL. The customer selects a single system residence volume (SYSRES) and Input Output Definition File (IODF) device accessible in their Input Output Configuration Data Set (IOCDS), then customer provides these individual devices into a Load address and Load parameters, and the customer actuates LOAD to IPL the system.
Nucleus initialization processing starts executing and the z/OS host selects the first logical path of its available logical paths to start the IPL process. The host uses the logical path to access the production IODF device that contains the I/O configuration data that the host uses to IPL the system.
As long as the logical path is available during IPL of the host, the system loads properly. However, if there is a logical path failure, the host enters a failure mode of operation. Due to the nature of logical path failure, a logical path failure may be temporary or permanent. A temporary logical path failure may last between a few milliseconds to one or two seconds. For direct connect links, any error that lasts under 1.5 seconds is considered a nonpermanent error. When a loss of light condition is detected, the channel starts a 1.5-second timer. If the link comes back within 1.5 seconds, the logical paths are not removed.
For switched links, the time-out period is the time it takes for the state change to be propagated to the host from the switch. For example, hosts (such as the 390 hosts available from International Business Machines) then wait for 2 seconds before they will remove logical paths.
A permanent logical path error lasts forever, and the consequence is the removal of the logical path. For direct connect links, if the link is down for over 1.5 seconds, the channel will remove all logical paths on that physical link. For switched links, the time-out period is approximately 2 seconds before the channel will begin removing logical paths. One result of a temporary or permanent failure is the inability of a host to access the IODF device through the failed logical path. Since the host does not know the failure type, the host retries the I/O. For temporary failures that last a few milliseconds, the host might be able to retry the I/O successfully, and the host can continue its IPL process. For temporary failures that last seconds, a host may run out of retries within the failure window, and the host stops its IPL process. For permanent failures, a host may run out of retries, and the host stops its IPL process. After the host recovery is exhausted, the host aborts its IPL process and enters a disabled wait state.
After an aborted IPL, the customer often must then spend time analyzing the wait state. One current solution to address an aborted IPL is to simply retry the IPL. There is a chance however that the IPL will fail again because of this same I/O error. This presents an issue because after two failures the customer will be hesitant to try a third time without initiating a customer support contact, which can dramatically prolong their outage. Another possible solution is to identify the failing logical path, configure the logical path off-line, and retry the IPL. An issue with this solution is that the customer may be required to generate a stand-alone dump and rely on support to analyze the dump and inform the customer which logical path is causing the problem. Again, this process could extend the system downtime.
To address this issue, certain customers have installed automated solutions such as the Geographically Dispersed Parallel Sysplex/synchronous mirroring technology (GDPS/PPRC) available from International Business Machines, which uses business continuity plan 2 (BCPii) type automation to IPL systems. However, automated IPL solutions can also fail because of a single point of failure. Which could then require the customer to identify the problem and manually intervene, having suffered both an extended outage and having lost faith in the automated software solutions since manual intervention was necessary.