Computer systems are subject to any number of operational and environmental faults, ranging from disk failures and power outages to earthquakes and floods. While repair or replacement of damaged equipment is costly, the interruption of access to critical data may be far more severe. For this reason, businesses are taking great precautions to ensure the availability of their data.
The simplest guard against failure is replication. By replicating a system component, a spare is ready to take over if the primary should fail. Replication can occur at many levels, according to the faults it guards against.
The simplest way to replicate only data is with tape backups. Tape backups are a popular replication strategy because they are simple and inexpensive. They ensure that data is safe if a disk or entire machine is damaged or destroyed. Further, if tapes are taken off-site or stored in a protective vault, tape backups can protect data against site-wide disasters. However, tape backups only guard against the ultimate unavailability--data loss. Restoring data from a tape can take hours, or even days, and all changes since the most recent tape backup are lost.
Replicating disks, through widespread strategies such as RAID, protects against the failure of a single disk. Many vendors offer disk replication solutions that are efficient and easy to manage. With disk replication, recovery from a disk failure can be fast and invisible to applications. However, disk replication does not account for the failure of the host machine or destruction of the entire site. In conjunction with tape backups, data loss can be prevented, but availability will suffer with higher-level failures.
Replication of a server machine protects against hardware and software errors on the data server. Disks can be dual-ported, allowing more than one machine direct access to raw data. Along with disk replication strategies, a replicated server can provide high availability even after single disk and single server failures. Just as with replicated disks, tape backups can guard against data loss in a site-wide failure, but extended downtime will still occur.
Replicating an entire site across extended distances, called "geographic replication," increases data availability by accounting for site-wide faults, such as extended power outages, fires, earthquakes, or even terrorist attacks. In a geographic replication system, normal system operation occurs at a local site. Data is mirrored to a remote site, which can take over system functions if the local site is lost. Geographic replication does not mirror application address spaces or any other volatile memory; only data written to stable storage devices is transmitted to the remote site. Distributing cluster storage across extended distances is complex and time-consuming; consequently, failover to the remote site cannot be performed as efficiently and invisibly as failover to a secondary server or hot-swapping a new disk into a storage array. Geographic replication provides blanket protection for high availability; i.e., when all other techniques fail, a complete site failover can still occur under a geographic replication regime.
A generic geographic replication system 100 is shown in FIG. 1. This system has a local site 102 comprising a file server 104, file storage 106 (e.g., a hard disk drive), and clients 108, 110. Note that the term "local" as used in the present application is relative; i.e., the local site is simply the site whose server normally serves the clients 104. The local site 102 is coupled to a remote site 112, possibly by a wide area network (WAN). The remote site 112 includes a file server 114 and file storage 116. Data is mirrored from the local disk 106 to the remote disk 116 in the course of normal operation of the local server 104 so that, if a failure should occur, the remote server is able to serve file requests from the clients 108 or 110 with minimal or no loss of file system state.
A geographic replication system must be able to capture all state changes (hereafter referred to as writes) to file systems and raw devices. Self-consistency must always be maintained at the remote site. Even if the remote site is not current with the primary site, it must be internally consistent. Geographic replication of data must be invisible to applications. The replication system must support at least two levels of data safety: 1-safe and 2-safe (for more information, see Jim Gray and Andreas Reuter, "Transaction Processing: Concepts and Techniques," Morgan Kaufmann, San Francisco, Calif., 1993, which is entirely incorporated herein by reference).
In 1-safe, or asynchronous, mode, a replication system logs operations at the primary site and periodically replicates the data to the remote site. In 1-safe mode, the log of operations not yet applied to the remote site must be serializable and consistent with the operations applied to the local site. Thus, although the remote site may lag behind the local site, it is almost impossible for an operation to be applied at the remote site that was not applied to the local site, and it is almost impossible for operations to be applied at the remote site in a different order than they were applied at the local site. At start-up, the local and remote must automatically synchronize their data so that any future mutually applied operations result in identical states. The geographic replication system must be compatible with any replication services provided by database (for more information, see Oracle, "Oracle7 Distributed Database Technology and Symmetric Replication," available at: http://www.oracle.com/products/oracle7/server/whitepapers/replication/html /index.html) or other applications. 2-safe, or synchronous, mode copies data to the remote site before an operation on the local site is allowed to complete. The replication system could also support an additional level of data consistency called very safe mode. Very safe mode enhances 2-safe mode, adding a two-phase commit protocol to ensure consistency between the local and remote sites. The synchronization (or resynchronization) of local and remote sites that occurs in very safe mode should not require the local site to be taken off-line. Read-only access to the remote site should be available during normal operation. The replication service should automatically configure and start itself at system boot. This can be accomplished using boot scripts and user-level programs that invoke the replication API. The replication service should provide file deletion protection.
Replicating data across geographically separated sites is not a new idea. Several vendors already offer geographic replication solutions, which are now briefly described.
EMC
EMC supports geographic replication in its Symmetrix product (for more information, see EMC, "Symmetrix 3000 and 5000 ICDA Product Description Guide," available at: http://www.emc.com/products/hardware/enterprise/new5000/new5000.htm and EMC, "SRDF--Symmetrix Remote Data Facility," available at: http://www.emc.com/products/software/buscont/srdf/srdf 2.htm). Symmetrix is a storage hardware unit compatible with Sun servers and the Solaris operating system. The Symmetrix Remote Data Facility (SRDF) provides geographic replication for Symmetrix customers. SRDF requires use of a Symmetrix storage system at both the local and remote sites. The local Symmetrix unit is connected to the remote Symmetrix unit with an ESCON fibre link. Basic ESCON links are limited to 60 kilometers, but with an additional device on the sending and receiving ends, ESCON data can be transmitted over wide area networks.
SRDF is implemented entirely within the Symmetrix unit. Writes are applied to the disk on the local site and transmitted to the remote site along the ESCON link either synchronously or non-synchronously, depending on the mode of operation. SRDF documentation makes no mention of a stable log, meaning that transactions might be lost if a crash occurs before transmission can occur.
Further, SRDF is not well suited for long distances with respect to performance. SRDF supports non-synchronous replication in two ways: semi-synchronous and adaptive copy. In adaptive copy mode, data is transferred from the local site to the remote site with no return acknowledgments. In semi-synchronous mode, an I/O operation is performed at the local site, after which control is returned to the application. The written data is then asynchronously copied to the remote site. No other write requests for the affected logical volume are accepted until the transfer of the initial request has been acknowledged. Since SRDF is implemented in the storage unit, I/O operations are expressed as low-level SCSI or ESCON directives. A write system call could translate to several commands to the storage system, some modifying data and others modifying file system metadata. If each of these individual commands must be acknowledged across a wide area network before the next can proceed, performance at the local site will suffer.
SRDF does include a synchronous mode of operation. Updates are first applied to the local site. The data is then transmitted to the remote site. The operation on the local site cannot return until an acknowledgment has been received from the remote site. This synchronous mode is 2-safe, but not very safe. If the local site were to fail after committing the update but before transmitting it to the remote site, then the two sites would be inconsistent. Further, SRDF provides no log by which to determine the transactions that were lost in a site failure.
Implementing replication at such a low-level has other disadvantages. First, since SRDF connects two Symmetrix storage units, only the storage system is replicated at the remote site. If a disaster incapacitates the local site, a server will have to be bootstrapped at the remote site, reconstructing the file system, before data will be available. A second problem with the low-level approach is that replication occurs on the granularity of entire volumes, rather than files and directories. Also, the hardware for mirrored volumes must be symmetric at the two sites. Finally, SRDF is a mixed hardware and software solution--all components of the storage system must be purchased from EMC.
Uniq
Uniq takes a high-level approach to replication with a new file system called UPFS (for more information, see Uniq Software Services, "UPFS--A Highly Available File System," Jul. 21, 1997, White Paper Available at: http://www.uniq.com.au/products/upts/UPFS-WhitePaper/UPFS-WhitePaper-1.htm l). Based on VFS, UPFS does not require specialized hardware. It transparently manages several file systems in parallel, locally using native file systems and remotely using NFS. Thus, geographic replication is performed using NFS protocols over Unix networking protocols.
Unfortunately, NFS may not be ideally suited for geographic replication. NFS protocols do not provide good utilization of a wide area network. For instance, name lookup occurs one component at a time. Opening a file deep in the directory hierarchy requires a large number of RPCs, incurring a significant latency over an extended distance. Also, every successful write operation returns a complete set of file attributes, consuming precious bandwidth (for more information, see Nowicki, Bill, "NFS: Network File System Protocol Specification," RFC 1094, March 1989, available at: http://www.internic.net/rfc/rfc 1094.txt). Another potential shortcoming of NFS is that it does not support exporting and mounting of raw devices. For efficiency, many databases operate on raw devices rather than files in a structured file system. Since NFS does not support operations on raw devices UPFS cannot provide geographic replication for these products.
In addition, Uniq makes no mention of 2-safe or very safe capabilities. Replication is performed asynchronously to optimize performance on the local site.
Qualix
Qualix implements geographic replication with its DataStar product (for more information, see Qualix, "Qualix DataStar Primer and Product Overview," April, 1997, White Paper available at: http://www.qualix.com/html/datastar.sub.-- wp.html). DataStar uses a special Solaris device driver installed between the file system and regular device drivers to intercept writes to raw and block devices. DataStar logs these writes, and periodically a daemon process transmits the log to the remote site via TCP/IP. The log is chronologically ordered for all disk volumes within user-defined logical groups.
DataStar captures I/O commands below the file system, which controls the layout of data arid metadata on the disk volume. This requires a restriction on the symmetry of the local and remote sites. Specifically, a replicated logical device on the local site must be mapped to a logical device on the remote site, and, of course, the device on the remote site must be at least as big as the device on the local site. The one-to-one mapping is not considerably restrictive until a change is necessary. For instance, enlarging a replicated file system or adding new replicated file systems could require disruptive repartitioning at the backup site.
Qualix makes no mention of 2-safe or very safe modes of operation. However, DataStar logs replicated operations at the local site, allowing a retrieval of the transactions that were lost in a site failure.
DataStar shares another characteristic with other low-level approaches to replication in that decisions must be made on the granularity of entire volumes rather than directories or files.