Recent years have seen a proliferation of computers and storage subsystems. Demand for storage capacity grows by over seventy-five percent each year. Early computer systems relied heavily on direct-attached storage (DAS) consisting of one or more disk drives coupled to a system bus. More recently, network-attached storage (NAS) and storage area network (SAN) technology are used to provide storage with greater capacity, higher reliability, and higher availability. The present invention is directed primarily SAN systems that are designed to provide shared data storage that is beyond the ability of a single host computer to efficiently manage.
Mass data storage systems are implemented in networks or fabrics that provide means for communicating data between systems that use data, and the storage systems that implement the physical storage. In many cases, host computers act as storage servers and are coupled to the network and configured with several disk drives that cumulatively provide more storage capacity or different storage functions (e.g., data protection) than could be implemented by a DAS system. For example, a server dedicated to data storage can provide various degrees of redundancy and mirroring to improve access performance, availability and reliability of stored data. Collecting storage sub-systems, where a separate server manages each sub-system, can form a large storage system. More recently, virtualized storage systems such as the StorageWorks® Enterprise Virtual Array announced by Compaq Corporation in October, 2001 provide storage controllers within a fabric or network that present virtualized storage to hosts that require data storage in a manner that enables the host to be uninvolved in the physical configuration, allocation and management of the storage devices. StorageWorks is a registered trademark of Compaq Computer Corporation. In this system, hosts simply access logical units of storage that appear to the host as a range of logical address space. Virtualization improves performance and utilization of storage.
SAN systems enable the possibility of storing multiple copies or “replicas” of data at various physical locations throughout the system. Data replication across multiple sites is desirable for a variety of reasons. To provide disaster tolerance, copies of data stored at different physical locations is desired. When one copy becomes unavailable due to equipment failure, a local network outage, natural disaster or the like, a replica located at an alternate site can allow access to the data. Replicated data can also theoretically improve access in normal operation in that replicas can be accessed in parallel, avoiding bottlenecks associated with accessing a single copy of data from multiple systems. However, prior systems were organized such that one site had a primary role and another site was a replica. Access requests were handled by the primary site until failure, at which time the replica became active. In such architecture, the replica provided little benefit until failure. Similarly, the resources allocated to creating and managing replicas provided minimal load balancing benefit that would enable data access requests to be directed intelligently to replicas such that resources were used more efficiently. Moreover, when multiple replicas are distributed throughout a network topology, it would be beneficial if network delays associated with accessing a topologically remote storage subsystem could be lessened.
In the past, managing a data replication system required significant time and expense. This time and expense was often related to tasks involved in setting up and configuring data replication on a SAN. Physical storage devices between original and replica locations had to be closely matched which could require knowledge at the spindle level to set up a storage site to hold a replica. Similarly detailed knowledge of the physical devices at a storage site were required to set up logging of replication operations. Moreover, the logical structures used to represent, access and manage the stored data had to be substantially identically reproduced at each storage site. Many of these operations required significant manual intervention, as prior data replication architectures were difficult to automate. This complexity made it difficult if not impossible to expand the size of a replicated volume of storage, as the changes on one site needed to be precisely replicated to the other site. A need exists to provide data replication systems in a SAN that enable functions involved in setup and configuration of a replication system to be automated, and allow the configuration to be readily expanded.
It is desirable to provide the ability for rapid recovery of user data from a disaster or significant error event at a data processing facility. This type of capability is often termed ‘disaster tolerance’. In a data storage environment, disaster tolerance requirements include providing for replicated data and redundant storage to support recovery after the event. In order to provide a safe physical distance between the original data and the data to be backed up, the data is migrated from one storage subsystem or physical site to another subsystem or site. It is also desirable for user applications to continue to run while data replication proceeds in the background. Data warehousing, ‘continuous computing’, and enterprise applications all benefit from remote copy capabilities.
Compaq Corporation introduced a data replication management product in its Array Controller Software (ACS) operating on an HSG80 storage controller and described in U.S. patent application Ser. No. 09/539,745 assigned to the assignee of the present application and incorporated herein by reference. This system implemented architecture with redundant storage controllers at each site. Two sites could be paired to enable data replication. While effective, the HSG80 architecture defined relatively constrained roles for the components, which resulted in inflexibility.
For example, each of the controllers comprised one port that was dedicated to user data, and a separate port that was dedicated to data replication functions. Even where redundant fabrics were implemented, for a given controller both of these ports were coupled to a common fabric switch. Despite the fact that each controller had two ports for communicating with other controllers, one of the ports was constrained in the role of handling user data, and the other port was constrained in the role of handling data replication. Failure of either port would be, in effect, a failure of the entire controller and force migration of storage managed by the failed controller to the redundant controller. Similarly, failure of a communication link or fabric coupled to one port or the other would render the controller unable to perform its tasks and force migration to the redundant controller. Such migration was disruptive and typically required manual intervention and time in which data was unavailable.
As another example, prior data replication management solutions simplified the implementation issues by assigning fixed roles to storage locations. A particular storage site would be designated as a primary when it handled operational data traffic, and another site would be designated as a secondary or backup site. Such architectures were unidirectional in that the backup site was not available for operational data transactions until the failure of the primary site. Such rigidly assigned roles limited the ability to share storage resources across multiple topologically distributed hosts. Moreover, configuration of such systems was complex as it was necessary to access and program storage controllers at both the primary and secondary sites specifically for their designated roles. This complexity made it impractical to expand data replication to more than two sites.
This lack of flexible configuration results in constraints imposed on the configuration and functionality of DRM implementations. Most existing data replication solutions have specific constraints around the number of places that the data may be copied, the simultaneity of multi-directional copies. Further, the specific nature of the synchronicity of the data transmission between sites was per controller, not per volume. As a result, all of the copy sets managed by a particular controller had the exact same initiator and target role designations. Also, all copy sets had to go in one direction such that if a controller was an initiator for one copy set, it was an initiator for all copy sets managed by that controller. Further, the replicas were not allowed to vary from the original in any material respect. The target disks were required to have the same size, data protection scheme, and the like as the original. Prior systems could not readily support dynamic changes in the size of storage volumes.
The lack of flexible configuration constrains the number of replicas that can be effectively created. Current systems allow an original data set to be replicated at a single location. While a single replica is beneficial for disaster tolerance, it is of limited benefit to improving performance benefits from migrating or distributing data to locations closer to where the data is used. A need exists for a data replication system that improves the ability to fan-out a larger number of replicas to improve geographic and topological diversity.
Therefore, there remains a need in the art for a data storage system capable of providing flexible data replication services without the direct involvement of the host computer. Moreover, a data storage system is needed that is readily extensible to provide multiple replication, load balancing, and disaster tolerance without limitations imposed by designating rigid roles for the system components.