1. Field of the Invention
This invention relates to data storage subsystems and improvements providing data integrity within these subsystems. More particularly, this invention comprises methods for detecting, recording and properly treating data that has been corrupted before it reaches the subsystem.
2. Description of the Related Art
In recent years, the direction of the data processing industry has placed particular emphasis on on-line workstations, distributed processing and the introduction of information processing technology into many new application areas. As a result, there has been a corresponding increase in the use of on-line database systems and a growth in the requirement for storage capacity and increased reliability and flexibility in data storage devices.
This need has been met by Direct Access Storage Device (DASD) subsystems including storage controllers such as the IBM 3990 Model 3. The storage controller is installed between host processors and the DASD devices themselves. The controller acts not only as a path director for data flowing between the host and the DASDs but also as a performance enhancer for the data progressing system as a whole. This second activity is accomplished through the use of cached memory within the storage controller.
The IBM 3990 Model 3 is an example of a storage controller having a cache function. This controller can attach to 370, 370-XA and ESA/370channels which are all well known in the art. When the controller is operating with 370 channels, it provides path-independent device allocation. When, on the other hand, the 3990 is operating with 370-XA and ESA/370channels, it provides both path independent device allocation and dynamic path reconnection.
Communication between the host processor and the storage controller is accomplished through the use of a set of commands sent through channel devices which direct the controller to process specific data at specific locations. For example, in the IBM System/360 and System/370 host environments, the CPU issues a series of commands identified in 360/370 architecture as Channel Command Words (CCWs) which control the operation of the associated DASD through the storage controller.
A data transfer operation is initiated by the host CPU generating a START I/O instruction which is passed to the channel and causes control to be relinquished to a chain of CCWs. The CCW chain is then sent over the channel to the storage controller so that control operations can be effected, the proper storage device can be selected so that data transfer can occur.
Each CCW is separately resident in the CPU main store and must be fetched by the channel program, decoded and transferred to the storage controller. The CCW specifies the command to be executed and the storage area, if any, to be used.
When data is transferred between a host computer and a group of DASD devices through a storage controller, the data is transferred in a specific format comprising variable length records separated by fixed time gaps between the records and between the fields within the records. Each record typically includes several fields.
One of the problems inherent in large data processing systems is data integrity. This is especially true in large systems containing a storage subsystem with a DASD controller. Although various schemes, most often involving Error Checking and Correction (ECC), have been employed to detect data invalidity internal to various components (i.e. to the storage controller or to the host processor), there has, heretofore, been no satisfactory solution to data integrity problems that are created by data transfers between system components.
For example, when the host processor transfers data records through the channel to the storage controller for eventual storage on DASD, it is entirely conceivable that during that data transfer, a fault can occur. This can be due to a loss of power to the host processor or various system or software operating system failures such as a failed retry. Any of these events can cause the channel to transmit a Halt I/O, Selective Reset or System Reset command to the storage subsystem during data transfer of CCWs with or without user data.
If such a fault occurs during the transmission of user data, the user data can be truncated or improperly transmitted at the time of the fault. Worse yet, the data is never marked as bad so that during a subsequent read of the invalid data, neither the storage subsystem nor the host processor, when it receives the data, will be aware of any problem with the data.
It should also be noted that various other events can take place within the data storage subsystem because of an external problem-that would have the effect of producing invalid data within the storage subsystem without any indication that the data was, indeed, invalid. Such events include a situation where a retry fails or is not attempted such as in the case of a channel overrun or a Bus Out Parity Check. Moreover, in the case of a Dual Copy under error, where the storage controller is directed to proceed with a dual copy even if the data can not be read without error from the primary device, an exception can occur which will have the effect of damaging data.
It thus becomes necessary for the storage controller to handle these exception conditions in such a way that permits orderly control and recording of data defects as a result of exceptions. It is also desired that the overhead associated with the detection and recordation of invalid data be minimal both in terms of processing efficiency and freedom from hardware modifications. Moreover, it is important that the method be accurate in detection of invalid data.
This disclosure presents two solutions to the above mentioned problem, each described as a preferred embodiment. The first solution requires some hardware modification due to modification of data formatting, but it provides accurate and efficient identification of bad data. Furthermore, it allows the storage controller to notify the host processor, via the channel interface, of corrupted data which is transmitted as a result of a read request by the host.
The second solution requires only a microcode change within the storage controller. It uses the storage controller's ability to operate asychronously in order to solve the above mentioned problem.