The present invention relates generally to data consistency in data storage systems, and more specifically, to a system for providing controller-based transaction logging to provide data recovery after an error event in a remote data replication system using a Storage Area Network.
It is desirable to provide the ability for rapid recovery of user data from a disaster or significant error event at a data processing facility. This type of capability is often termed xe2x80x98disaster tolerancexe2x80x99 . In a data storage environment, disaster tolerance requirements include providing for replicated data and redundant storage to support recovery after the event. In order to provide a safe physical distance between the original data and the data to backed up, the data must be migrated from one storage subsystem or physical site to another subsystem or site. It is also desirable for user applications to continue to run while data replication proceeds in the background. Data warehousing, xe2x80x98continuous computingxe2x80x99, and Enterprise Applications all require remote copy capabilities.
Storage controllers are commonly utilized in computer systems to off-load from the host computer certain lower level processing functions relating to I/O operations, and to serve as interface between the host computer and the physical storage media. Given the critical role played by the storage controller with respect to computer system I/O performance, it is desirable to minimize the potential for interrupted I/O service due to storage controller malfunction. Thus, prior workers in the art have developed various system design approaches in an attempt to achieve some degree of fault tolerance in the storage control function. One such prior approach requires that all system functions be xe2x80x98mirroredxe2x80x99. While this type of approach is most effective in reducing interruption of I/O operations and lends itself to value-added fault isolation techniques, it has previously been costly to implement and heretofore has placed a heavy processing burden on the host computer.
One prior method of providing storage system fault tolerance accomplishes failover through the use of two controllers coupled in an active/passive configuration. During failover, the passive controller takes over for the active (failing) controller. A drawback to this type of dual configuration is that it cannot support load balancing, as only one controller is active and thus utilized at any given time, to increase overall system performance. Furthermore, the passive controller presents an inefficient use of system resources.
Another approach to storage controller fault tolerance is based on a process called xe2x80x98failoverxe2x80x99. Failover is known in the art as a process by which a first storage controller, coupled to a second controller, assumes the responsibilities of the second controller when the second controller fails. xe2x80x98Failbackxe2x80x99 is the reverse operation, wherein the second controller, having been either repaired or replaced, recovers control over its originally-attached storage devices. Since each controller is capable of accessing the storage devices attached to the other controller as a result of the failover, there is no need to store and maintain a duplicate copy of the data, i.e., one set stored on the first controller""s attached devices and a second (redundant) copy on the second controller""s devices.
U.S. Pat. No. 5,274,645 (Dec. 28, 1993), to Idleman et al. discloses a dual-active configuration of storage controllers capable of performing failover without the direct involvement of the host. However, the direction taken by Idleman requires a multi-level storage controller implementation. Each controller in the dual-redundant pair includes a two-level hierarchy of controllers. When the first level or host-interface controller of the first controller detects the failure of the second level or device interface controller of the second controller, it re-configures the data path such that the data is directed to the functioning second level controller of the second controller. In conjunction, a switching circuit re-configures the controller-device interconnections, thereby permitting the host to access the storage devices originally connected to the failed second level controller through the operating second level controller of the second controller. Thus, the presence of the first level controllers serves to isolate the host computer from the failover operation, but this isolation is obtained at added controller cost and complexity.
Other known failover techniques are based on proprietary buses. These techniques utilize existing host interconnect xe2x80x9chand-shakingxe2x80x9d protocols, whereby the host and controller act in cooperative effort to effect a failover operation. Unfortunately, the xe2x80x9chooksxe2x80x9d for this and other types of host-assisted failover mechanisms are not compatible with more recently developed, industry-standard interconnection protocols, such as SCSI, which were not developed with failover capability in mind. Consequently, support for dual-active failover in these proprietary bus techniques must be built into the host firmware via the host device drivers. Because SCSI, for example, is a popular industry standard interconnect, and there is a commercial need to support platforms not using proprietary buses, compatibility with industry standards such as SCSI is essential. Therefore, a vendor-unique device driver in the host is not a desirable option.
U.S. patent application, Ser. No. 08/071,710 to Sicola et al., describes a dual-active, redundant storage controller configuration in which each storage controller communicates directly with the host and its own attached devices, the access of which is shared with the other controller. Thus, a failover operation may be executed by one of the storage controller without the assistance of an intermediary controller and without the physical reconfiguration of the data path at the device interface.
U.S. Pat. No. 5,790,775 (Aug. 4, 1998) to Marks et al., discloses a system comprising a host CPU, a pair of storage controllers in a dual-active, redundant configuration. The pair of storage controllers reside on a common host side SCSI bus, which serves to couple each controller to the host CPU. Each controller is configured by a system user to service zero or more, preferred host side SCSI IDs, each host side ID associating the controller with one or more units located thereon and used by the host CPU to identify the controller when accessing one of the associated units. If one of the storage controllers in the dual-active, redundant configuration fails, the surviving one of the storage controllers automatically assumes control of all of the host side SCSI IDs and subsequently responds to any host requests directed to the preferred, host side SCSI IDS and associated units of the failed controller. When the surviving controller senses the return of the other controller, it releases to the returning other controller control of the preferred, SCSI IDS of the failed controller. In another aspect of the Marks invention, the failover is made to appear to the host CPU as simply a re-initialization of the failed controller. Consequently, all transfers outstanding are retried by the host CPU after time outs have occurred. Marks discloses xe2x80x98transparent failoverxe2x80x99, which is an automatic technique that allows for continued operation by a partner controller on the same storage bus as the failed controller. This technique is useful in situations where the host operating system trying to access storage does not have the capability to adequately handle multiple paths to the same storage volumes. Transparent failover makes the failover event look like a xe2x80x98power-on resetxe2x80x99 of the storage device. However, transparent failover has a significant flaw: it is not fault tolerant to the storage bus. If the storage bus fails, all access to the storage device is lost.
However, none of the above references disclose a system having a remote backup site connected to a host site via a dual fabric link, where the system provides in-order operations while a link is down as well as when the link returns to operation while the transaction log is xe2x80x98mergedxe2x80x99 back with the remote site.
Therefore, there is a clearly felt need in the art for a disaster tolerant data storage system capable of rapid recovery from disruptions such as short-term link failure and remote site failover, without the direct involvement of the host computer, wherein both original and backup copies of user data are returned to the same state without incurring the overhead of a full copy operation.
Accordingly, the above problems are solved, and an advance in the field is accomplished by the system of the present invention which provides a completely redundant configuration including dual Fibre Channel fabric links interconnecting each of the components of two data storage sites, wherein each site comprises a host computer and associated data storage array, with redundant array controllers and adapters. The present system is unique in that each array controller is capable of performing all of the data replication functions, and each host xe2x80x98seesxe2x80x99 remote data as if it were local. The array controllers also perform a command and data logging function which stores all host write commands and data xe2x80x98missedxe2x80x99 by the backup storage array during a situation wherein the links between the sites are down, the remote site is down, or where a site failover to the remote site has occurred. Log xe2x80x98unitsxe2x80x99 are used to store, in order, all commands and data for every transaction which was xe2x80x98missedxe2x80x99 by the backup storage array when one of the above system error conditions has occurred. The present system provides rapid and accurate recovery of backup data at the remote site by sending all logged commands and data from the logging site over the link to the backup site in order, while avoiding the overhead of a full copy operation.
The xe2x80x98mirroringxe2x80x99 of data for backup purposes is the basis for RAID (xe2x80x98Redundant Array of Independent [or Inexpensive] Disksxe2x80x99) Level 1 systems, wherein all data is replicated on N separate disks, with N usually having a value of 2. Although the concept of storing copies of data at a long distance from each other (i.e., long distance mirroring) is known, the use of a switched, dual-fabric, Fibre Channel configuration as described herein is a novel approach to disaster tolerant storage systems. Mirroring requires that the data be consistent across all volumes. In prior art systems which use host-based mirroring (where each host computer sees multiple units), the host maintains consistency across the units. For those systems which employ controller-based mirroring (where the host computer sees only a single unit), the host is not signaled completion of a command until the controller has updated all pertinent volumes. The present invention is, in one aspect, distinguished over the previous two types of systems in that the host computer sees multiple volumes, but the data replication function is performed by the controller. Therefore, a mechanism is required to communicate the association between volumes to the controller. To maintain this consistency between volumes, the system of the present invention provides a mechanism of associating a set of volumes to synchronize the logging to the set of volumes so that when the log is consistent when it is xe2x80x9cplayed backxe2x80x9d to the remote site.
Each array controller in the present system has a dedicated link via a fabric to a partner on the remote side of the long-distance link between fabric elements. Each dedicated link does not appear to any host as an available link to them for data access, however, it is visible to the partner array controllers involved in data replication operations. These links are managed by each partner array controller as if being xe2x80x98clusteredxe2x80x99 with a reliable data link between them.
The fabrics comprise two components, a local element and a remote element. An important aspect of the present invention is the fact that the fabrics are xe2x80x98extendedxe2x80x99 by standard e-ports (extension ports). The use of e-ports allow for standard Fibre Channel cable to be run between the fabric elements or the use of a conversion box to covert the data to a form such as telco ATM or IP. The extended fabric allows the entire system to be viewable by both the hosts and storage.
The dual fabrics, as well as the dual array controllers, dual adapters in hosts, and dual links between fabrics, provide high-availability and present no single point of failure. A distinction here over the prior art is that previous systems typically use other kinds of links to provide the data replication, resulting in the storage not being readily exposed to hosts on both sides of a link. The present configuration allows for extended clustering where local and remote site hosts are actually sharing data across the link from one or more storage subsystems with dual array controllers within each subsystem.
The present system is further distinguished over the prior art by other additional features, including independent discovery of initiator to target system and automatic rediscovery after link failure. In addition, device failures are detected by xe2x80x98heartbeatxe2x80x99 monitoring by each array controller. Furthermore, no special host software is required to implement the above features because all replication functionality is totally self contained within each array controller and automatically done without user intervention.
An additional aspect of the present system is the ability to function over two links with data replication traffic. If failure of a link occurs, as detected by the xe2x80x98initiatorxe2x80x99 array controller, that array controller will automatically xe2x80x98failoverxe2x80x99, or move the base of data replication operations to its partner controller. At this time, all transfers in flight are discarded, and therefore discarded to the host. The host simply sees a controller failover at the host OS (operating system) level, causing the OS to retry the operations to the partner controller. The array controller partner continues all xe2x80x98initiatorxe2x80x99 operations from that point forward. The array controller whose link failed will continuously watch that status of its link to the same controller on the other xe2x80x98farxe2x80x99 side of the link. That status changes to a xe2x80x98goodxe2x80x99 link when the array controllers have established reliable communications between each other. When this occurs, the array controller xe2x80x98initiatorxe2x80x99 partner will xe2x80x98failbackxe2x80x99 the link, moving operations back to newly reliable link. This procedure re-establishes load balance for data replication operations automatically, without requiring additional features in the array controller or host beyond what is minimally required to allow controller failover.
Because the present system provides a logging mechanism for storing all commands and data for every transaction that occurs in the failure situations described above, the system is thus is capable of rapid recovery from disruptions such as short-term link failure and remote site failover, without the direct involvement of the host computer, and without incurring the overhead of a full copy operation. Furthermore, the present system""s method of logging upon site failover provides for rapid site failback and resynchronization in situations wherein there is only a temporary downtime at the primary site.