1. Field of the Invention
The present invention relates to a method, system, and article of manufacture for recovering from grouped logical path failures.
2. Description of the Related Art
In certain computing environments, a host computer may communicate with a storage control unit, where the storage control unit controls physical storage. The physical storage that is controlled by the storage control unit may be represented logically as a plurality of logical path resources within the storage control unit. Applications in the host computer may perform input/output (I/O) operations with respect to the logical path resources of the storage control unit. For example, an application in the host computer may write to logical path resources of the storage control unit. The storage control unit may maintain a correspondence between the logical path resources and storage media in the physical storage via logical and physical volumes. While data may be physically written to the storage media in the physical storage under the control of the storage control unit, as far as an application in the host computer is concerned, the application performs write operations with respect to the logical path resources in the storage control unit.
Logical path resources may be added, deleted, or otherwise modified within the storage control unit. Certain modifications to the logical path resources of the storage control unit, such as addition of a logical path resource when no path resources are available, may cause a failure of I/O operations that are sent from the host computer to the storage control unit.
It is known for a host to use logical paths to communicate with a storage controller. A host usually has multiple paths to access devices in a storage controller. The multiple path capability of a host comes into play after the host system performs an initial program load (IPL) operation, and the logical paths are grouped per each device in a logical subsystem. A host may group between two and eight logical paths to any given device of a logical subsystem of a storage controller.
As long as the logical paths are available during a host input/output (I/O) operation, there is no problem. However, if a logical path failure occurs, the host enters into a path failure mode of operation. A logical path failure can be temporary or permanent. A temporary logical path failure may last between a few milliseconds to one or two seconds. In certain systems, for direct connect links, any error that lasts less than one and a half seconds is considered to be a nonpermanent error. When a loss of a light condition is detected, the channel starts a timer. If the link returns to operational within 1.5 seconds, the logical paths associated with that link are not removed. For switched links, the time-out period is the time needed for the state change to be propagated to the host from the switch. The hosts, such as a system 390 type host, then wait for 2 seconds before removing logical paths from the available paths.
When a permanent logical path error is identified, this condition essentially lasts forever as far as the host is concerned. The consequence is thus removal of the logical path from the available logical paths. For direct connect links, if the link is in a failure condition for over one and half seconds, the channel removes all logical paths on that physical link. For switched links, the time-out period is approximately two seconds before the channel will begin removing logical paths. One result of a temporary or permanent failure is an inability of a host to access devices via the failed logical path. Because the host does not have any knowledge of the failure type, the host retries the I/O operation. For temporary failures, the host might be able to retry the I/O operation successfully and the host can continue performing I/O operations to the device. For temporary failures or for permanent failures, a host may exceed a predetermined number of allowed retries within the failure window, and the host removes the logical path from its working logical path mask.
When a host detects a logical path failure, the host enters a path discovery mode of operation. If the host continues to detect a logical path failure while in the path discovery mode of operation, the host removes the logical path from its logical path mask, and the host does not use the logical path again. For each failure the host detects on a logical path, the host enters the path discovery mode of operation, and path removal from its mask if the logical path fails in the discovery process. It is possible, and it has been observed, that a loss of access to the device may occur because the host loses access to a device via all the logical paths of a path group. In a System 390 type environment this case is called boxed device.
In a zSeries type environment, if a boxed device occurs on a system pack (for example the IPL device), this condition can result in an outage for the host and can result in requiring another IPL operation. The IPL operation also clears the boxed device condition if paths are physically available. If not, the IPL operation fails and can result in an extended outage of the computing environment. If the boxed device occurs on an application volume, often the device must be unboxed manually by the operator and the application must be recovered. Unboxing a device can be accomplished on a z/OS type system via, e.g., a VARY PATH or VARY ON-LINE command if the paths are physically available.