1. Field of the Invention
This invention relates in general to data storage systems that use redundant data backup, and more particularly to a method, apparatus and program storage device for allowing continuous availability of data during volume set failures in a mirrored environment.
2. Description of Related Art
Due to advances in computer technology, there has been an ever-increasing need for data storage in data processing networks. In a typical data processing network, there has been an increase in the number of volumes of data storage and an increase in the number of hosts needing access to the volumes.
Fortunately for computer users, the cost of data storage has continued to decrease at a rate approximating the increase in need for storage. For example, economical and reliable data storage in a data network can be provided by a storage subsystem. However, as people's reliance upon machine readable data increases, they are more vulnerable to damage caused by data loss. Large institutional users of data processing systems which maintain large volumes of data such as banks, insurance companies, and stock market traders must and do take tremendous steps to insure back up data availability in case of a major disaster. These institutions recently have developed a heightened awareness of the importance of data recovery and back-up in view of world events. Consequently, data backup systems have never been more important.
Generally, data backup systems copy a designated group of source data, such as a file, volume, storage device, partition, etc. If the source data is lost, applications can use the backup copy instead of the original, source data. The similarity between the backup copy and the source data may vary, depending upon how often the backup copy is updated to match the source data.
Currently, data processing system users often maintaining copies of their valuable data on site on either removable storage media, or in a secondary “mirrored” storage device located on or within the same physical confines of the main storage device. If the backup copy is updated in step with the source data, tile copy is said to be a “mirror” of the source data, and is always “consistent” with the source data. Should a disaster such as fire, flood, or inaccessibility to a building occur, however, both the primary as well as the secondary or backed up data will be unavailable to the user. Accordingly, more data processing system users are requiring tile remote storage of back up data.
Some competing concerns in data backup systems are cost, speed, and data consistency. Systems that guarantee data consistency often cost more, and operate more slowly. On the other hand, many faster backup systems typically cost less while sacrificing absolute consistency. One conventional technique for recovering backup data involves the maintenance of data in “duplex pairs.” In a duplex pair configuration, each time data is written on a disk or some other storage media, a duplicate copy is written on a backup disk as well.
One example of a data backup system is the Extended Remote Copy (“XRC”) system, sold by International Business Machines Corp (“IBM”). In addition to the usual primary and secondary storage devices, the XRC system uses a “data mover” machine coupled between primary and secondary devices. The data mover performs backup operations by copying data from the primary devices to the secondary devices. Storage operation in the XRC system are “asynchronous,” since primary storage operations are committed to primary storage without regard for whether the corresponding data has been stored in secondary storage.
The secondary devices are guaranteed to be consistent with the state of the primary devices at some specific time in the past. This is because the XRC system time stamps data updates stored in the primary devices, enabling the secondary devices to implement the updates in the same order. Time stamping in the XRC system is done with a tinter that is shared among all hosts coupled to primary storage. Since the secondary devices are always consistent with a past state of the primary devices, a limited amount of data is lost if the primary devices fail.
A different data backup system is IBM's Peer-to-Peer Remote Copy (“PPRC”) system. The PPRC approach does not use a data mover machine. Instead, storage controllers of primary storage devices are coupled to controllers of counterpart secondary devices by suitable communications links, such as fiber optic cables. The primary storage devices send updates to their corresponding secondary controllers. With PPRC, a data storage operation does not succeed until updates to both primary and secondary devices complete. In contrast to the asynchronous XRC system, PPRC performs “synchronous” backups.
In many backup systems, recovery involves a common sequence of operations. First, backup data is used to restore user data to a known state, as of a known date and time. Next, “updates” to the primary storage subsystem that have not been transferred to the secondary storage subsystem are copied from the “log” where they are stored at the primary storage subsystem, and applied to the restored data. The logged updates represent data received after the last backup was made to the secondary storage subsystem, and are usually stored in the same chronological order according to when they were received by the primary storage subsystem. After applying the logged updates, the data is considered to be restored, and the user's application program is permitted to access the restored data.
Although many of the foregoing technologies constitute significant advances, and may even enjoy significant commercial success today, engineers are continually seeking to improve the performance and efficiency of today's data backup systems. One area of possible focus concerns remote mirroring. Remote mirroring provides a large amount of additional data protection above and beyond what is available in a standard RAID configuration. This includes remote copies of a user's data that can be used at a later point to recover from certain types of failures, including complete loss of a controller pair. The problem with these recovery scenarios is that the user does not have access to their data at the site that is being recovered until the recovery is complete. This can be a large period of time during which the operations are running at the remote site.
One of the more common failures in an array is the loss of a physical drive due to some sort of drive failure. Once a single drive has been lost, it then opens up the array to potential data loss in the event of a second drive failure This window of time for a potential data loss continues to grow as drives increase in size. Currently, in the event of a failure, the user must fail the host systems over to start using the hosts attached to the remote mirror controllers. However, this is a disruption of the data center and may have performance and other unintended consequences. Thus, all of the data must be restored to the failed volume set before any access is allowed to those volume sets by the hosts.
It can be seen then that there a need for a method, apparatus and program storage device that allows the primary array to continue to service host I/O requests even while the volume set of the primary array has been marked OFFLINE.
It can also be seen that there is a need for a method, apparatus and program storage device for allowing continuous availability of data during volume set failures in a mirrored environment.