Businesses employ large-scale data processing systems for processing and reliably storing their business critical data. FIG. 1 shows in block diagram form relevant components of an exemplary data processing system 10. Data processing system 10 includes a primary node PExample in data communication with asynchronous secondary node ASExample and synchronous secondary node SSExample. As will be more fully described below, primary node PExample stores a data volume VExample while asynchronous secondary node ASExample and synchronous secondary node SSExample store replicas RVAExample and RVSExample, respectively. Replicas RVAExample and RVSExample are maintained as real-time (or near real-time) copies of volume VExample using asynchronous and synchronous replication techniques, respectively.
The devices of primary node PExample are reliable in general. Reliable devices are nonetheless subject to failure as a result of natural disasters, acts of terrorism, hardware failure, or software failure. If primary node PExample fails as a result of, for example, an act of terrorism, data volume VExample will be inaccessible. Replication is one mechanism used by many businesses to ensure reliable access to a data volume. Data replication is well known in the art. Essentially, replication is a process of creating real or near real-time copies (replicas) of a data volume such as data volume VExample shown in FIG. 1. Replicas are typically created and maintained at remotely located secondary nodes such as ASExample and SSExample. In the event of failure of primary node PExample as a result of, for example, a natural disaster, requests from client computer systems (not shown) to access data of volume VExample can be redirected to and serviced by one of the secondary nodes ASExample or SSExample. For example, suppose a client computer system generates a request to read data from volume VExample after primary host 12 is rendered inoperable. The read request can be redirected to secondary node ASExample or SSExample using mechanisms well known in the art. In response to receiving the read request, secondary nodes ASExample or SSExample can read and return a copy of requested data from replica RVAExample or RVSExample.
Primary node PExample includes a primary host (e.g., server computer system) 12 coupled to data storage subsystems 16-20 via storage interconnect 22. For purposes of explanation, storage interconnect 22 will take form in a storage area network (SAN) it being understood that the term storage interconnect should not be limited thereto. SAN 22 may include devices (e.g., switches, routers, hubs, etc.) that cooperate to transmit input/output (IO) transactions between primary host 12 and storage subsystems 16-20.
Each of the data storage subsystems 16-20 includes several physical storage devices. For purposes of explanation, the physical storage devices of 16-20 take form in hard disks, it being understood that the term physical storage device should not be limited to hard disks. Data storage subsystems 16-20 may take different forms. For example, data storage system 16 may consist of “just a bunch of disks” (JBOD) connected to an array controller card. Data storage subsystem 18 may consist of an intelligent disk array. Data storage system 20 may consist of a block server appliance. For purposes of explanation, each of the data storage subsystems 16-20 takes form in a disk array, it being understood that the term data storage subsystem should not be limited thereto.
As noted, each of the disk arrays 16-20 includes several hard disks. The hard disk is the most popular, permanent storage device currently used. A hard disk's total storage capacity is divided into many small chunks called physical memory blocks. For example, a 10 GB hard disk contains 20 million physical memory blocks, with each block able to hold 512 bytes of data. Any random physical memory block can be written to or read from in about the same amount of time, without having to first read from or write to other physical memory blocks. Once written, a physical memory block continues to hold data even after the hard disk is powered down.
Primary host 12 includes an application 24 and a storage manager 26. Storage manager 26 includes a replicator component. Storage manager 26 may take form in software instructions executing on one or more processors of primary host 12. Volume Manager™ provided by VERITAS Software Corporation of Mountain View, Calif., is an exemplary storage manager, it being understood that the term storage manager should not be limited thereto. Although many of the examples described herein will emphasize virtualization architecture and terminology associated with the VERITAS Volume Manager™, the software and techniques described herein can be used with a variety of different storage managers and architectures.
Storage managers perform several functions including the creation of virtualized data storage. Storage virtualization is the technique of aggregating hard disks into virtual disks that typically have better characteristics (e.g., higher storage capacity, greater effective data transfer rates, etc.) than individual hard disks. Hereinafter, virtual disks will be referred to as “storage objects.” Storage managers can also aggregate storage objects to create higher level storage objects. Storage objects are abstractions and can be logically viewed as an array of logical memory blocks that store or are configured to store data. While it is said that a logical memory block stores or is configured to store data, in reality the data is stored in at least one physical memory block of a hard disk allocated directly or indirectly to the storage object.
Storage objects aggregated from hard disks can themselves be aggregated to form storage objects called logical data volumes. FIG. 1 illustrates a visual representation of volume VExample employed in primary node PExample. Volume VExample includes nmax logical memory blocks that store or are configured to store data. FIG. 1 also illustrates replicas RVAExample and RVSExample created and employed in secondary nodes ASExample and SSExample, respectively. Replicas RVAExample and RVSExample also contain nmax logical memory blocks that store or are configured to store data. Given that replicas RVAExample and RVSExample are maintained as a real-time (or near real-time) copies of volume VExample, respective logical blocks n of volume VExample and replicas RVAExample and RVSExample store or are configured to store identical data.
Properties of a data volume depend on how its underlying storage objects or hard disks are aggregated. In other words, the method of aggregation determines the storage object type. In theory, there are a large number of possible methods of aggregation. The more common forms of aggregation include concatenated storage, striped storage, RAID storage, or mirrored storage. A more thorough discussion of how storage objects or hard disks can be aggregated can be found within Dilip M. Ranade [2002], “Shared Data Clusters” Wiley Publishing, Inc., which is incorporated herein by reference in its entirety. To illustrate, presume volume VExample is a two-way mirrored storage object created by storage manager 26 from underlying storage objects M1 and M2. Further, presume storage object M1 is formed by concatenating hard disks d1 and d2 (not shown) of disk array 16, and that storage object M2 is formed by concatenating hard disks d3 and d4 (not shown) of disk array 18. Data volume VExample consists of nmax logical memory blocks, and each storage object M1 and M2 consists of nmax logical memory blocks as shown in FIG. 1. Because storage objects M1 and M2 are mirrors of volume VExample, data within blocks n of M1 and M2 store or are configured to store identical data. It is noted that replicas RVAExample and RVSExample need not be created as two-way mirrored volumes like volume VExample. Nonetheless, respective logical blocks n of volume VExample, replica RVAExample, and RVSExample store or are configured to store identical data.
Storage managers typically create storage object descriptions that describe the relationship between storage objects, their underlying storage objects or hard disks, or other storage objects. Storage object descriptions may identify the existence of one or more replicas of the corresponding storage object. Additionally, storage object descriptions often include configuration maps. A configuration map maps a logical memory block of a corresponding storage object to one or more logical memory blocks of one or more underlying storage objects or to one or more physical memory blocks of one or more hard disks. Storage managers use configuration maps to translate IO transactions directed to one storage object into one or more IO transactions that access data of one or more underlying storage objects or hard disks.
Storage manager 26 creates and stores a description for volume V. This description indicates or is updated to indicate that replicas RVAExample and RVSExample of volume VExample exist in secondary nodes ASExample and SSExample, respectively. The description also includes a configuration map for volume VExample. The configuration map maps each logical block n of volume VExample to respective logical memory blocks n storage objects M1 and M2.
Application 24 generates IO transactions to write data to volume VExample not knowing that volume VExample is an aggregation of underlying storage objects M1 and M2 and not knowing of the existence of replicas RVAExample and RVSExample. IO transactions generated by application 24 are provided to storage manager 26. When storage manager 26 receives an IO transaction from application 24 to write data Dnew to logical memory block n of volume VExample, storage manager 26 accesses the description for volume VExample. From the configuration map of the description, storage manager 26 learns that logical memory block n is mapped to logical memory block n in storage objects M1 and M2. Accordingly, storage manager 26 generates first and second IO transactions to write data Dnew to logical memory block n in storage objects M1 and M2, respectively. From the description, storage manager 26 also learns that after data of logical block n is modified, copies of data in logical memory block n of volume VExample should be sent to secondary nodes ASExample and SSExample to enable updating of replicas RVAExample and RVSExample, respectively, and thereby maintain consistency between volume VExample and replicas RVAExample and RVSExample as will be more fully described below.
FIG. 1 shows primary node PExample in data communication with asynchronous secondary node ASExample and synchronous secondary node SSExample via data link 30 and communication network 32. Secondary nodes ASExample and SSExample include secondary hosts (e.g., server computer systems) 38 and 52, respectively. Secondary host 38 is coupled to data storage subsystems 40 and 42 via storage interconnect 44. For purposes of explanation, storage interconnect 44 takes form in a SAN. SAN 44 transmits IO transactions between host 38 and storage subsystems 40 and 42. Likewise, secondary host 52 of secondary node ASExample is coupled to data storage subsystems 54 and 56 via storage interconnect 58. For purposes of explanation, storage interconnect 58 takes form in a SAN. SAN 58 transmits IO transactions between secondary host 52 and storage subsystems 54 and 56. For purposes of explanation, each of the data storage subsystems in secondary nodes SSExample and ASExample will take form in a disk array. Each of the disk arrays in secondary nodes SSExample and ASExample includes several hard disks. The hard disks of secondary node ASExample are allocated to store the contents of replica RVAExample, while the hard disks of secondary node SSExample are allocated to store the contents of replica RVSExample. Secondary hosts 38 and 52 include applications 46 and 60, respectively, each operating on one or more processors. Applications 46 and 60 are configured to generate IO transactions for accessing data in replicas RVSExample and RVAExample, respectively, in response to requests from client computers systems. Lastly, hosts 38 and 52 also include storage managers 48 and 62, respectively.
Replica RVAExample and RVSExample are maintained as a real-time or near real-time copies of volume VExample using asynchronous and synchronous replication techniques, respectively. As noted above, application 24 generates IO transactions to write data to volume VExample not knowing of the existence of replicas RVAExample and RVSExample. In general, when data of a logical memory block n of volume VExample is modified according to an IO write transaction generated by application 24, the primary node PExample operating through the replicator component of storage manager 26, can transmit a copy of the data in block n after modification to each of the secondary nodes ASExample and SSExample. Each of the secondary nodes ASExample and SSExample, in turn, operating through storage managers 62 and 48, respectively, overwrites data existing in logical memory block n of replicas RVAExample and RVSExample, respectively, with the copy received from the primary node PExample. In synchronous replication, the IO transaction that modifies data of logical block n of data volume VExample, is not considered complete until secondary node SSExample acknowledges receipt of the copy of the data of modified logical memory block n. In asynchronous replication, the replicator component of storage manager 26 logs a copy of each data block of the volume VExample that is modified by an IO transaction. Eventually, copies of the logged, modified data blocks are transmitted asynchronously to secondary node ASExample. The IO transaction that modifies data of logical memory block n of volume VExample is considered complete when a copy of the modified data block is logged for subsequent transmission to secondary node ASExample. Asynchronous replication requires ordering of dependent data modifications to ensure consistency of data between replica RVAExample and volume VExample. Synchronous replication does not require ordering.
Copies of data in modified logical memory blocks of volume VExample can be transmitted from primary node PExample to each of the secondary nodes SSExample and ASExample in separate transactions via the data link 30 and communication network 32, respectively. Each of the transactions transmitted to the secondary nodes ASExample and SSExample may include a single modified block of data or multiple modified blocks of data. Either way, each of the secondary nodes ASExample and SSExample receives a copy of each logical block n in volume VExample containing modified data. Existing data in replicas RVAExample and RVSExample are updated with corresponding data received from primary node PExample. In other words, data in logical memory block n of replicas RVAExample and RVSExample are overwritten with copies of data in modified logical memory blocks of volume VExample received from primary node PExample.
The time needed to update replicas RVAExample and RVSExample depends on the bandwidth of data link 30 and communications network 32. Data link 30 is presumed to have a higher data communication bandwidth when compared to communication network 32. Data link 30 may be able to transmit data between primary node PExample and synchronous node SSExample. For purposes of explanation, communication network 32 will take form in the Internet, it being understood that the term “communication network” should not be limited thereto.
Techniques like check-summing or data compression can be used to reduce the amount of data to be transferred from the primary node PExample to the secondary node ASExample via lower bandwidth communication network 32. To illustrate, a Run Length Encoding (RLE) compression algorithm executing on primary host 12 may compress data (e.g., the contents of a logical memory block of volume VExample containing data modified by an IO transaction) before the data is transmitted to asynchronous secondary node ASExample. It is noted that either data of a modified logical memory block n of volume VExample or the differences between the data of a modified logical memory block and the data of the logical memory block before modification, can be compressed and transmitted to asynchronous secondary node ASExample. Unfortunately, techniques like data compression are processor and/or IO intensive and can unduly burden the processing bandwidth of primary host 12. In other words, while primary host 12 is performing data compression host 12 may not be able to respond quickly to requests from client computer systems to read data from or write data to volume VExample. Accordingly, compressing data before it is transmitted to asynchronous secondary node ASExample via communication network 32 may reduce the response time for responding to access requests from client computer systems.