Industry and commerce have become so dependent on computer systems with online or interactive applications that an interruption of only a few minutes in the availability of those applications can have serious consequences. Outages of more than a few hours can sometimes threaten a company's or an institution's existence. In some cases, regulatory requirements can impose fines or other penalties for disruptions or delays in services that are caused by application outages.
As a consequence of this growing intolerance for application outages, there is a keen interest in improving the availability of these applications during normal operations and in decreasing the amount of time needed to recover from equipment failure or other disastrous situations.
A common approach for disaster recovery is to use whatever remains after the disaster to provide a system with all necessary applications having access to a recent copy of the data needed to support the applications. A number of copy techniques are known that can provide a copy of the data needed to support an application after a disaster or other abnormal event. These techniques are known by a variety of names and differ in a number of respects but they are similar in that they all copy data that is stored on one or more primary recording devices onto one or more secondary recording devices.
Practical implementations of disaster recovery mechanisms usually define pairs of primary and secondary recording devices and provide processes that facilitate operations with the pairs. Normal computer center operations typically require changes to the definition of one or more pairs and to the operational status of devices in the pairs so that copying between respective primary and secondary recording devices can be suspended or resumed. For example, changes in pair definitions may be needed to allow data to be moved from one recording device to another to improve system performance or to recover from a hardware failure. Changes in operational status may be needed to suspend copying so that data stored by the secondary recording devices can be accessed by a batch application, for example, or be copied to tertiary recording devices to provide an additional level of security for recovery purposes.
In conventional mainframe computer systems, changes to copy pairs are initiated by software executing in a mainframe computer in response to commands specified by an operator. This approach is not attractive because errors are likely in the operator input required to specify these commands and because considerable time is needed to specify all of the parameters and other information needed by the commands. In addition, the time needed to carry out each command is excessive. For example, the time required in conventional systems to change the operational status of one copy pair can be on the order of ten seconds. In situations where an operational change must be carried out for thousands of recording devices, the total time needed to complete a requested change can easily exceed an hour. The needs of some applications require frequent changes in operational status; however, the time required to change status severely restricts when and how often such changes can be made.
The disadvantages of the conventional approaches are compounded if the recovery mechanism includes a complex of multiple computer systems. Commands and parameters used to define copy pairs may be valid for only one of the computer systems in the complex. Additional commands and parameters must be specified for use in other computer systems in the complex to access either primary or secondary recording devices unless care is taken to ensure all relevant hardware and software identifications of the recording devices are the same for all of the computer systems. This is generally very difficult if not impossible to achieve. As a result, changes in one computer system that affect copy pair definitions must be reflected in changes to parameters for not only that particular computer system but also for all other computer systems in the complex. This additional requirement increases the likelihood that a mistake or oversight will introduce errors that can adversely affect the ability to recover from a disaster or other event.