1. Field of the Invention
This invention relates to computers and computer system complexes, and operating systems for controlling them. More particularly, this invention describes a system and method for detecting errors in switching actions affecting I/O devices connected to main processors, and recovering from such errors.
2. Background Art
As data processing needs of system users grow, the number of peripheral devices supported by a data processing system also grows. Multiple data processing applications require a plurality of various peripheral devices. The term "device", as used herein, includes such components, whether or not directly addressable, as control units, peripheral cache memories, communications apparatus, data storage units such as direct access storage devices (DASD), tape recorders, and the like. It also includes separately maintainable portions thereof as will become apparent. These devices occasionally need maintenance of a type requiring the device to be temporarily disconnected from the data processing system.
The maintenance/switching of peripheral devices has become more difficult as data processing systems become more complex. Often peripheral devices are in rooms or on floors of a building separated from the connected central processing units and device controllers. The maintenance of a particular peripheral device or of a section of devices under the control of one control unit requires the coordination of the operator at the system console together with the maintenance personnel at the control units and/or devices. When a maintenance action is required on a device, the central processing units (CPUs) must first be informed that the maintenance is to take place. Information about the extent of maintenance must be provided to the central processing unit such that the processors can take the necessary action to quiesce, i.e., cease communication with the use of subchannels and channel paths that relate to the portion of the device that is to be maintained "off-line". This action is necessary to maintain data integrity.
The data integrity exposure arises because communication with a device or device path must be terminated logically before the customer disables channel interfaces, restarts storage paths, re-IMLS control units, or performs any maintenance action that will cause a reset to occur. If this is not done, a device reservation may be lost (the hardware feature that allows multiple systems to serialize the use of data on a device). If this happened, a partially updated record/data set could be accessed by a sharing system thus allowing the data to be corrupted. An integrity exposure could also arise if a system used a certain device number/UCB (Unit Control Block) to refer to two different physical devices at two different times without having the system operator take the paths off-line (conventionally with a VARY command) before switching--causing successive I/O operations to go to different devices.
The IBM Dynamic Pathing feature does not eliminate the integrity exposure due to the system reset. The Dynamic Pathing feature of, for example, the IBM 3880 Control Unit and 3380 devices, and the IBM 3990 Control Unit and 3380 devices (reference: IBM 3990 Storage Control Reference, GA32-0099) allows MVS to establish system related reserves to devices. This means that every on-line path is part of a path group and assumed to contain the reserve whenever the device is reserved. MVS takes advantage of this feature during its channel path recovery actions. When the software must issue a system reset to a channel path, the operator does not have to be told to stop the sharing processors if all the devices on a channel path have dynamic pathing active and an alternative on-line path that contains the reserves. If there are no alternative paths the operator is given the opportunity to stop the sharing systems before the data can be corrupted.
If maintenance is to be performed on a device or control unit, the operator, or CE, must explicitly issue the proper reconfiguration commands to each attached system. This tells the systems that the path is not available for use and that the path does not belong to the path group and does not contain the reserve. If this is not done, the software may issue a system reset to recover a channel and not give the operator the opportunity to stop all the sharing processors. The reserve will then be lost, and the data corrupted.
Several steps must be taken to notify all of the central processing units or host systems of the maintenance action and to determine when the action can be performed. First, a service representative or other maintenance person determines the correlation between the physical portions of the device to be maintained and the device numbers and channel path identifiers that are affected for each attached CPU. Next, the service representative goes from CPU to CPU and enters appropriate reconfiguration commands at each CPU to quiesce the specified channel paths and I/O devices. Once a particular device has been electrically disconnected or logically isolated from the system of CPUs, the service representative then performs the required maintenance. Finally, upon completing the maintenance, the service representative goes from CPU to CPU and enters appropriate reconfiguration commands at each central processing unit to give notification that the just-maintained device is again available.
U.S. Pat. No. 4,195,344 discloses the user of a monitor center that permits the automatic supervision of the configuration of a data processing system. This patent is representative of the relevant art in that if it is necessary to disconnect or reconnect the devices during the operation of a data processing system for the purpose of maintenance, inspection, or repair, it was necessary to inform the operating system of the connection or disconnection by the use of the identification number of the device. The operator communicates with the data processing system to report the disconnection and to order reconfiguration of the devices for enabling the data processing system to continue operation without the devices. This patent disclosed a means for communicating system configuration to an operator to facilitate recognition of configuration errors. However, the criteria for defining an error are not stated, and the recognition is presumably to be done by the operator.
A technique used in some instances (such as the taking of a device off-line for maintenance by an operator, followed by its reintroduction) by which a system could verify that all paths to a device actually accessed the proper device was to reread the volume label on the device. (It would be initially read by MVS when the device was placed on-line, or made logically available for use.) This technique was limited in that it was appropriate for DASD devices only--and was used to insure that the correct DASD volume was still mounted; it therefore fell into disuse with the advent of nondemountable DASD devices such as the IBM 3380. The technique also depended on the installation preventing duplicate DASD labels.