This invention relates to a method for supporting recovery from a failure of a storage device in a computer system where a batch job is executed.
Generally, a batch job is executed on a computer system collectively processing a large quantity of data. The minimum data-set required for re-execution of jobs is retained on media, such as a magnetic tape. However, for reducing operation loads, for shortening the operation time, and for saving resources, such as a magnetic disk or a magnetic tape, a great part of a data-set is deleted or remains in external storage devices after processing.
After occurrence of a failure in the external storage device, in general, it is impossible to confirm the record contents of the external storage device. In order to select a procedure for recovering the external storage device, therefore, it is necessary to pursue a job control language (JCL) list that represents an execution history of a batch job or a list of the assignment medium of the data-set and comprehensively grasp the relationship between an input-output data-set among plural jobs. Because, ordinarily, such processing is performed by a person, there are the following problems. First, if an error arises halfway in pursuit and arrangement, it might be mistaken in a grasp of the relationship between an executed job and the transition of a data-set after that. Accordingly, if a failure occurs after execution of a lot of jobs, it is substantially impossible to recover from the failure. Secondly, it is typical for the backup of the contents of external storage devices to be applied to a magnetic tape in a predetermined cycle in order to use it for recovery when a failure has occurred. However, the contents of external storage devices at a time of failure are changed from the contents present at the time of backup which are acquired because of execution of a batch job. As a result, if the contents in external storage devices are restored by using a backup tape, data-set having the same name is created plural times, or a data-set that has been deleted is restored. These factors cause recovery from a failure to be delayed. If a backup is acquired for a unit of a necessary data-set to resolve these problems, this is a cause of the operation time to linger and a cause of increasing operation load and/or a worsening of the maintainability at a time of addition and/or a change of the data-set and job. Further, procedures, such as pursuing a transition of the data-set which is changed by executing a batch job and grasping a correlation among a plurality of jobs cannot be omitted. Therefore, there is little contribution to the shortening of the recovery time.
It is an object of the present invention to ensure that recovery from a failure of an external storage device can be easily realized regardless of the executed number of jobs.
To achieve the above object, the present invention provides a method for supporting a recovery from a failure of a storage device in a computer system, where a batch job is executed on a central processing unit, and input, output and deletion of a data-set at the storage device are performed as a result of execution of the batch job. According to the present invention, for each executed job, a data-set operated on by the jobs and the operation type thereof are inspected on the basis of information that concerns executed jobs and a data-set operated on by these jobs included in transition history information acquired with execution of the batch job. As a result of the inspection, jobs that should be executed in re-execution processing are extracted as direct re-execution jobs. As to a respective data-set which has been operated on by a job extracted as a direct re-execution job, an operation type of the operation by the job extracted as a direct re-execution job and an operation type of the operation by other jobs are inspected. As a result of this inspection, a job that is necessary for execution of a direct re-execution job is executed as an indirect re-execution job. On each data-set that has been operated on by a job extracted as a direct re-execution job or a job extracted as an indirect re-execution job and that is managed in generation, a restoration generation number for each operation of the data-set is determined on the basis of the final generation of the data-set and generation of the data-set in the relevant data-set operation. Then, as to a data-set that has been finally deleted, it is inspected whether or not the data-set has been operated on by a job extracted as a direct re-execution job or a job extracted as an indirect re-execution job, and whether or not storage devices that store the data-set include a failed storage device. The manner and timing of deletion of the data-set are determined in accordance with a result of this inspection. Further, the operation history for each data-set that has been operated on by a job that is either a job extracted as a direct re-execution job or a job extracted as an indirect re-execution job is inspected. Based on a result of this inspection, a data-set to be individually restored from a backup in advance of re-execution of the jobs extracted as a direct re-execution job and as an indirect re-execution job are determined. Thereafter, in accordance with the results of such processing, information required to recover the failed storage device is outputted.
In one preferable embodiment of the present invention, the method includes a step of generating a jobxe2x80x94data-set table that stores information relating to a data-set for respective jobs, a job information table that stores information relating to execution of respective jobs as a batch job, and a data-set operation table that stores, for respective data-set operations, information relating to the data-set operation. Each processing for the extraction of a direct re-execution job, the extraction of an indirect re-execution job, the determination of a restoration generation number for a data-set, the determination of the manner and timing of deletion of a data-set, and the determination of a data-set to be restored are executed by referring to at least one of the jobxe2x80x94data-set table, job information table, and data-set operation table.
In the step of extracting a direct re-execution job, more specifically, a job that has outputted a data-set, which has not been deleted by any succeeding jobs, to the failed storage device is extracted as a direct re-execution job.
The step of extracting an indirect re-execution job is preferably carried out as follows. On the basis of information relating to an operation to a data-set that has been inputted or outputted by a job extracted as a direct re-execution job, as to each data-set to which an input operation has been performed by the direct re-execution job, the operation type of each operation that has been done before or after the input operation to a data-set having the same name and the same generation with the data-set is inspected. If at least one output operation has been performed before the input operation and if a deletion operation has been performed after the input operation, a job that has executed an output operation lastly before the input operation is extracted as an indirect re-execution job. If output operations have been performed before and after the input operation, a job that has executed an output operation lastly before the input operation and a job that has executed the last output operation are extracted as an indirect re-execution job. If no output operation has been performed before the input operation and if at least one output operation has been performed after the input operation and the data-set outputted by the output operation has not been deleted, a job that has executed an output operation lastly is extracted as an indirect re-execution job. Further, as to each data-set on which an output operation has been performed by a job extracted as a direct re-execution job, a data-set operation on a data-set having the same data-set name and the same generation with the data-set is inspected. If any other output operation that has been performed after the relevant output operation exists and if the data-set having the same name and the same generation has not been deleted, a job that has executed an output operation lastly is extracted as an indirect re-execution job.
In the step of determining a restoration generation number, a restoration generation number is determined by determining, as to a respective operation on a data-set that has been operated on by a job extracted as a direct re-execution job or an indirect re-execution job, the difference between a final generation of the data-set and a generation relevant to the operation under inspection.
In the step of determining the manner and timing of deletion, each data-set that has not been operated on by a job extracted as a direct re-execution job or an indirect re-execution job, that has been outputted into the failed storage device, and that has been finally deleted, is determined to be a data-set that can be deleted immediately. On the other hand, data-set that has been operated on by a job extracted as a direct re-execution job or an indirect re-execution job and that has been finally deleted is determined to be a data-set that can be deleted after re-execution.
In the step of determination of a data-set to be restored, the operation type of respective operations on a data-set are inspected in every generation of each data set. Then, a data-set related to an input operation is determined to be a data-set to be restored when neither the output operation nor the deletion operation has been performed before the input operation and at least one of the output and deletion operations has been performed after the input operation.
A further understanding of the nature and advantages of the invention herein may be realized by reference to the remaining portions of this specification and the attached drawings.