The present invention relates to an apparatus for recovering errors and a method thereof for a subsystem which are intended to recover the device errors of a magnetic tape unit provided as an input/output unit. More particularly, the present invention relates to an apparatus for recovering errors and a method thereof for a subsystem employing a magnetic tape unit in which a host computer executes dynamic device reconfiguration to cope with a device error which occurs while data is being written in the magnetic tape unit.
In a subsystem employing a magnetic tape unit which is used as an external storage of a host computer, a magnetic tape control unit receives the input/output commands issued by the host computer, and performs writing or reading on the magnetic tape units which are subordinate thereto. Transfer of commands and data between the host computer and the magnetic tape control unit is executed asynchronously with transfer of commands and data between the magnetic tape control unit and the magnetic tape unit.
When an error occurs while the magnetic tape control unit is executing writing on the magnetic tape unit based on the writing commands sequentially issued by the host computer, the host computer receives an input/output interrupt of a machine check based on the error generation, and executes the error recovering process known as dynamic device reconfiguration.
In this error recovering process, the host computer saves, in the main storage, the data which has not been written and hence remains in the buffer memory of the magnetic tape control unit, discharges a cassette from the magnetic tape unit which has generated an error, and then designates another normal magnetic tape unit to which the cassette is to be mounted. Once the operator has manually mounted the cassette on the designated normal magnetic tape unit, the host computer rewrites the data which has been saved in the main storage in the mounted magnetic tape.
In recent years, the data transferred to the magnetic tape control unit is compressed and then stored in the buffer memory. When the data is read, it is expanded and then transferred from the buffer memory to the host computer. When such a compression or expansion of data is performed, since there is a limitation to the size of the save area reserved in the main storage for error recovery, it may be impossible to reserve an enormous save area required to transfer the compressed data remaining in the buffer memory after the data is expanded. Alternatively, execution of other jobs may be limited due to the reservation of an enormous save area.
FIG. 1 illustrates a conventional subsystem which incorporates magnetic tape units. Magnetic tape control units 30-1 and 30-2 are connected, via channel buses, to channels 20-1 and 20-2 of a host computer 10, respectively. Magnetic tape units 32-1 through 32-8, indicated by device ID #1 through #8, and magnetic tape units 32-9 through 32-18, indicated by device ID #8 through #16, are respectively connected to a device path extended from the magnetic tape control unit 30-1 and a device path extended from the magnetic tape control unit 30-2.
The host computer 10 has an input/output processing function implemented by an operating system (0$) 34 to issue input/output commands to the magnetic tape control units 301 and 30-2. Assuming that the operating system 34 is sequentially issuing, to the magnetic control unit 30-1, writing commands WR1 through WR4 aimed to the magnetic tape unit 32-1 having device ID #1, the writing commands WR1 through WR4 are stacked in a command queue of the magnetic tape control unit 30-1, and the data (in units of blocks) are stored in a buffer memory 62-1. The operating system 34 ends the input/output commands by issuing a channel end at the end of the transfer. Asynchronously with this host computer's access to the magnetic tape control unit 30-1, the magnetic tape control unit 30-1 fetches the write commands WR1 through WR4 in sequence from the command queue, and executes writing in the magnetic tape unit 32-1 having device ID #1.
As an effective means of recovering an error which occurs during data transfer between the magnetic tape control unit and the magnetic tape unit based on the writing commands or reading commands, a function, called dynamic device reconfiguration, is provided in the operating system 34 of the host computer 10. A machine check input/output interruption is generated when an error that cannot be recovered by the subsystem has occurred in the magnetic tape unit 32-1 while the magnetic tape control unit 30-1 is performing writing on the magnetic tape unit 32-1 based on the successive writing commands WR1 through WR4. That dynamic device reconfiguration is executed when the operating system 34 receives error information through the machine check input/output interruption. In this dynamic device reconfiguration, the data present on the buffer memory 62-1 of the magnetic tape control unit 30-1 and corresponding to, for example, write commands WR2 through WR4 is first read and saved in a main storage 38. After the cartridge has been manually shifted by the operator from the magnetic tape unit 30-1 which has generated an error to a magnetic tape unit subordinate to another magnetic tape control unit 30-2, e.g., the magnetic tape unit 32-9 indicated by device ID #9, the data is rewritten from the main storage 38 to the magnetic tape unit 32-9 to which the cartridge has been shifted. Consequently, all the data corresponding to the writing commands WR1 through WR4 are written in the magnetic tape medium of the shifted cartridge, achieving an error recovery.
Dynamic device reconfiguration executed to recover an error which occurs while data is being read from the magnetic tape unit requires only shift of the cartridge from the magnetic tape unit which has generated an error to another magnetic tape unit and does not require save of the data in the main storage.
In the subsystem employing the magnetic tape units which is available in recent years, the data transferred from the host computer to the magnetic tape control unit for writing is compressed and stored in the buffer memory, and the compressed data is written in the magnetic tape unit in order to increase the storage capacity thereof. Regarding the data which is read out from the magnetic tape unit and stored in the buffer memory of the magnetic tape control unit, it is expanded and then transferred from the buffer memory to the host computer.
In such a subsystem in which compression and expansion of the data are performed in the magnetic tape control unit, the data remaining in the buffer memory is expanded and the expanded data is saved in the main storage in the dynamic device reconfiguration conducted to recover an error which occurs during writing.
Assuming that the compressed data of 512K bytes, compressed at a compression rate of about 2% of the original data, are present in the buffer memory for writing, reading of such compressed data in the buffer memory after expansion requires an enormous main storage area of 256M bytes. If the data present in the buffer memory is not the compressed data, reading of that data from the buffer memory for saving requires the same size of the main storage area, which is 512K bytes.
When an enormous area is reserved in the main storage for dynamic device reconfiguration, other jobs may be interrupted during the error recovery process because they cannot use the main storage, thus reducing the processing capability of the entire system. Further, if an enormous main storage area to be used for reading the data cannot be reserved, the error recovery process is terminated abnormally (abend). In that case, the job which has generated an error must be rerun. Rerunning of the job generally covers several cartridges, and requires an error recovery work which lasts for many hours.