1. Field of the Invention
The invention relates generally to storage replication systems in an enterprise and service continuance using the replicated storage. More specifically, the invention relates to methods and structures for improving management of replicated storage systems and use of the replicated storage to continue service in a manner that is transparent to attached host systems.
2. Discussion of Related Art
In computer data storage applications that require high reliability and high availability, it is generally known to provide replication/duplication of the stored data and/or other forms of redundancy to permit continued operations in the presence of certain types of failures. In the context of a single site, a storage system may provide for redundant links between computing systems and storage systems as well as redundancy within the storage system (e.g., multiple storage controllers providing RAID storage management over multiple storage devices).
In a larger context, even with all the above identified lower level redundancy, it is possible for an entire computing site to fail in a catastrophic manner. For example, should a computing center flood or be destroyed in some manner, all the on-site redundancy may be insufficient to assure integrity of the data stored at that computing center. Thus, it is also known to provide for data replication at a larger scale. For example, data for an entire computing center may be replicated at one or more other computing centers to reduce the possibility of total loss of data due to destruction or other loss of one computing center. Disaster recovery plans and systems generally rely on such replication to permit continuation of services at a second site when a first site is destroyed or otherwise disabled.
In a disaster recovery configuration, business critical data in one physical location is remotely replicated to a site which is geographically separated from the first location. The remote data replication technology is volume remote replication provided by storage array vendors across SAN replication links. With this feature, a replication group consisting of a primary volume and one or more mirrored (or secondary) volumes is created. Each of the primary volume and the one or more secondary volumes are created on different sites. As presently practiced, the mirrored (i.e., secondary) volume(s) of the replication group is/are often “write-protected” to avoid data corruption. Both the mirrored volume and the primary volume have unique SCSI device identifiers. Traditionally storage array vendors and disaster recovery solution vendors have supplied additional integration components and required end-user manual actions to overcome the limitations of this setup in a multi-site disaster recovery environment. Even then the existing approaches leave end-users susceptible to short periods of application downtime.
When business critical data in one site becomes unavailable due to disaster, hardware failure, or other unpredictable conditions that disable the site, the business services continue at another site by using the replicated data on one of the replicated, mirrored, secondary volumes. Data management in a multiple-site disaster recovery scenario is a complicated process with existing solutions as presently practiced. The data management tasks for utilizing present replication techniques include:                Service management. The software applications utilizing the replication group must be provisioned on all sites.        Service resource management. Data storage is an application resource. The data storage resource must be provisioned and configured within the application. Since primary volumes and remote replication volumes have different “SCSI device identifiers”, the relationship between a primary volume and replicated volume(s) must be correlated and saved into a site manager database. The correlation of volumes of the replication group requires storage vendor specific management interfaces and integration.        Resource failover/failback management. When an application service must be failed over to another site, its data storage must be failed over as well. The role changes for the volumes in the replication group needs to be managed via storage vendor specific management interfaces and be integrated with the site management and application software.        
Some vendors' site management software products, such as VMWare vCenter Site Recovery Manager (SRM), have automated certain provisioning and management operations. The automation requires that each storage vendor provide SRM plugins which implement the VMWare SRM specification. Though VMWare SRM is popular and relatively common, many other site recovery management paradigms and specifications are known in the industry. Thus, storage system vendors may have to implement different plugins for each disaster recovery management product they want to support.
Most disaster recovery management products are not deeply integrated with storage vendors. Most storage/application provisioning and management operations are essentially manual procedures. The manual nature of the procedures to provision and manage the disaster recovery significantly affects both RTO (Recovery Time Objective) and RPO (Recovery Point Objective)—both being common metrics for disaster recovery products and techniques. The manual procedures also increase the TCO (Total Cost of Ownership) due to the human labor costs in performing the procedures.
For volumes in a replication group, one volume is the source (primary) volume of the replication group and one or more other volumes are the target (secondary) volume(s). A secondary volume is generally either inaccessible from the servers or is protected as read-only. The read-only or inaccessible attributes of the secondary volume creates a number of restrictions for application resource management. With the server virtualization application, the storage volume resources are deeply coupled with virtual machines. If one virtual machine needs to failover, all virtual machines that reply on the same underlying storage resource must be failed-over together. Similarly, if the underlying storage volume needs to be failed over, affected virtual machines must be failed over as well. Since the secondary volume(s) is/are write protected (e.g., read-only or totally inaccessible) the meta-data describing the replication group current configuration and the failover policies may be difficult to update or even impossible to access without manual (operator) intervention.
Thus it is an ongoing challenge to provide simple, cost-effective, management and failover processing for multi-site storage replication environments.