The present invention relates generally to data consistency in data storage systems, and more specifically, to a method for pipelining a number of write commands between a sending site and a receiving site while providing command ordering during controller-based synchronous or asynchronous copy operations in a remote data replication system using a Storage Area Network.
It is desirable to provide the ability for rapid recovery of user data from a disaster or significant error event at a data processing facility. This type of capability is often termed xe2x80x98disaster tolerancexe2x80x99. In a data storage environment, disaster tolerance requirements include providing for replicated data and redundant storage to support recovery after the event. In order to provide a safe physical distance between the original data and the data to backed up, the data must be migrated from one storage subsystem or physical site to another subsystem or site. It is also desirable for user applications to continue to run while data replication proceeds in the background. Data warehousing, xe2x80x98continuous computingxe2x80x99, and Enterprise Applications all require remote copy capabilities.
Storage controllers are commonly utilized in computer systems to off-load from the host computer certain lower level processing functions relating to I/O operations, and to serve as interface between the host computer and the physical storage media. Given the critical role played by the storage controller with respect to computer system I/O performance, it is desirable to minimize the potential for interrupted I/O service due to storage controller malfunction. Thus, prior workers in the art have developed various system design approaches in an attempt to achieve some degree of fault tolerance in the storage control function. One such prior approach requires that all system functions be xe2x80x9cmirroredxe2x80x9d. While this type of approach is most effective in reducing interruption of I/O operations and lends itself to value-added fault isolation techniques, it has previously been costly to implement and heretofore has placed a heavy processing burden on the host computer.
One prior method of providing storage system fault tolerance accomplishes failover through the use of two controllers coupled in an active/passive configuration. During failover, the passive controller takes over for the active (failing) controller. A drawback to this type of dual configuration is that it cannot support load balancing, as only one controller is active and thus utilized at any given time, to increase overall system performance. Furthermore, the passive controller presents an inefficient use of system resources.
Another approach to storage controller fault tolerance is based on a process called xe2x80x98failoverxe2x80x99. Failover is known in the art as a process by which a first storage controller, coupled to a second controller, assumes the responsibilities of the second controller when the second controller fails. xe2x80x98Failbackxe2x80x99 is the reverse operation, wherein the second controller, having been either repaired or replaced, recovers control over its originally-attached storage devices. Since each controller is capable of accessing the storage devices attached to the other controller as a result of the failover, there is no need to store and maintain a duplicate copy of the data, i.e., one set stored on the first controller""s attached devices and a second (redundant) copy on the second controller""s devices.
U.S. Pat. No. 5,274,645 (Dec. 28, 1993), to Idleman et al. discloses a dual-active configuration of storage controllers capable of performing failover without the direct involvement of the host. However, the direction taken by Idleman requires a multi-level storage controller implementation. Each controller in the dual-redundant pair includes a two-level hierarchy of controllers. When the first level or host-interface controller of the first controller detects the failure of the second level or device interface controller of the second controller, it re-configures the data path such that the data is directed to the functioning second level controller of the second controller. In conjunction, a switching circuit re-configures the controller-device interconnections, thereby permitting the host to access the storage devices originally connected to the failed second level controller through the operating second level controller of the second controller. Thus, the presence of the first level controllers serves to isolate the host computer from the failover operation, but this isolation is obtained at added controller cost and complexity.
Other known failover techniques are based on proprietary buses. These techniques utilize existing host interconnect xe2x80x9chand-shakingxe2x80x9d protocols, whereby the host and controller act in cooperative effort to effect a failover operation. Unfortunately, the xe2x80x9chooksxe2x80x9d for this and other types of host-assisted failover mechanisms are not compatible with more recently developed, industry-standard interconnection protocols, such as SCSI, which were not developed with failover capability in mind. Consequently, support for dual-active failover in these proprietary bus techniques must be built into the host firmware via the host device drivers. Because SCSI, for example, is a popular industry standard interconnect, and there is a commercial need to support platforms not using proprietary buses, compatibility with industry standards such as SCSI is essential. Therefore, a vendor-unique device driver in the host is not a desirable option.
However, none of the above references disclose a disaster tolerant data storage system having a remote backup site connected to a host site via a dual fabric link, where the system replication and error recovery functions are controller-based. Furthermore, none of the above systems allows a number of write commands to be xe2x80x98pipelinedxe2x80x99 (in transit and unacknowledged) between local and remote sites while ensuring the proper ordering of commands on remote media during synchronous or asynchronous operation. In addition, the prior technology fails to provide a mechanism for xe2x80x98tuningxe2x80x99 of links based on distance and performance requirements.
Therefore, there is a clearly felt need in the art for a disaster tolerant data replication system capable of optimally tunable inter-site performance, and which allows commands to be pipelined during operation, where the data replication functions are performed without the direct involvement of the host computer.
Accordingly, the above problems are solved, and an advance in the field is accomplished by the system of the present invention which provides a completely redundant configuration including dual Fibre Channel fabric links interconnecting each of the components of two data storage sites, wherein each site comprises a host computer and associated data storage array, with redundant array controllers and adapters. The present system is unique in that each array controller is capable of performing all of the data replication functions including the handling of failover functions.
In the situation wherein an array controller fails during an asynchronous copy operation, the partner array controller uses a xe2x80x98micro logxe2x80x99 stored in mirrored cache memory to recover transactions which were xe2x80x98missedxe2x80x99 by the backup storage array when the array controller failure occurred. The present system provides rapid and accurate recovery of backup data at the remote site by sending all logged commands and data from the logging site over the link to the backup site in order, while avoiding the overhead of a full copy operation.
An important aspect of the present invention is the concept of a xe2x80x98look-ahead limitxe2x80x99, which, as implemented, allows a large number of xe2x80x98outstandingxe2x80x99 commands to be concurrently be in transit between sites at any given time, while guaranteeing in-order operation of the data replication function. In addition, the present system automatically calculates an average transit/response time for each specific link, which is employed to determine a worst-case response time for effecting failover operations. A parameter representing the number of outstanding commands is user-adjustable, which allows tuning of links based on distance and system performance requirements.
The xe2x80x98mirroringxe2x80x99 of data for backup purposes is the basis for RAID (xe2x80x98Redundant Array of Independent [or Inexpensive] Disksxe2x80x99) Level 1 systems, wherein all data is replicated on N separate disks, with N usually having a value of 2. Although the concept of storing copies of data at a long distance from each other (i.e., long distance mirroring) is known, the use of a switched, dual-fabric, Fibre Channel configuration as described herein is a novel approach to disaster tolerant storage systems. Mirroring requires that the data be consistent across all volumes. In prior art systems which use host-based mirroring (where each host computer sees multiple units), the host maintains consistency across the units. For those systems which employ controller-based mirroring (where the host computer sees only a single unit), the host is not signaled completion of a command until the controller has updated all pertinent volumes. The present invention is, in one aspect, distinguished over the previous two types of systems in that the host computer sees multiple volumes, but the data replication function is performed by the controller. Therefore, a mechanism is required to communicate the association between volumes to the controller. To maintain this consistency between volumes, the system of the present invention provides a mechanism of associating a set of volumes to synchronize the logging to the set of volumes so that when the log is consistent when it is xe2x80x9cplayed backxe2x80x9d to the remote site.
Each array controller in the present system has a dedicated link via a fabric to a partner on the remote side of the long-distance link between fabric elements. Each dedicated link does not appear to any host as an available link to them for data access, however, it is visible to the partner array controllers involved in data replication operations. These links are managed by each partner array controller as if being xe2x80x98clusteredxe2x80x99 with a reliable data link between them.
The fabrics comprise two components, a local element and a remote element. An important aspect of the present invention is the fact that the fabrics are xe2x80x98extendedxe2x80x99 by standard e-ports (extension ports). The use of e-ports allow for standard Fibre Channel cable to be run between the fabric elements or the use of a conversion box to covert the data to a form such as telco ATM or IP. The extended fabric allows the entire system to be viewable by both the hosts and storage.
The dual fabrics, as well as the dual array controllers, dual adapters in hosts, and dual links between fabrics, provide high-availability and present no single point of failure. A distinction here over the prior art is that previous systems typically use other kinds of links to provide the data replication, resulting in the storage not being readily exposed to hosts on both sides of a link. The present configuration allows for extended clustering where local and remote site hosts are actually sharing data across the link from one or more storage subsystems with dual array controllers within each subsystem.
The present system is further distinguished over the prior art by other additional features, including independent discovery of initiator to target system and automatic rediscovery after link failure. In addition, device failures, such as controller and link failures, are detected by xe2x80x98heartbeatxe2x80x99 monitoring by each array controller. Furthermore, no special host software is required to implement the above features because all replication functionality is totally self contained within each array controller and automatically done without user intervention.
An additional aspect of the present system is the ability to function over two links with data replication traffic. If failure of a link occurs, as detected by the xe2x80x98initiatorxe2x80x99 array controller, that array controller will automatically xe2x80x98failoverxe2x80x99, or move the base of data replication operations to its partner controller. At this time, all transfers in flight are discarded, and therefore discarded to the host. The host simply sees a controller failover at the host OS (operating system) level, causing the OS to retry the operations to the partner controller. The array controller partner continues all xe2x80x98initiatorxe2x80x99 operations from that point forward. The array controller whose link failed will continuously watch that status of its link to the same controller on the other xe2x80x98farxe2x80x99 side of the link. That status changes to a xe2x80x98goodxe2x80x99 link when the array controllers have established reliable communications between each other. When this occurs, the array controller xe2x80x98initiatorxe2x80x99 partner will xe2x80x98failbackxe2x80x99 the link, moving operations back to newly reliable link. This procedure re-establishes load balance for data replication operations automatically, without requiring additional features in the array controller or host beyond what is minimally required to allow controller failover.