Large scale data processing systems typically include several data storage subsystems (e.g., disk arrays) each containing many physical storage devices (e.g., hard disks) for storing critical data. Data processing systems often employ storage management systems to aggregate these physical storage devices to create highly reliable data storage. There are many types of highly reliable storage. Mirrored volume storage is an example. Mirrored volume storage replicates data over two or more mirrors of equal size. A logical memory block n of a mirrored volume maps to the same logical memory block n of each mirror. In turn, each logical memory block n of the mirrors map directly or indirectly to one or more disk blocks of one or more physical devices. Mirrored volumes provide data redundancy. If an application is unable to access data of one, a duplicate of the data sought should be available in an alternate mirror.
FIG. 1 illustrates relevant components of an exemplary data processing system 10 that employs a two-way mirrored volume. While the present invention will be described with reference to a two-way mirrored volume, the present invention should not be limited thereto. The present invention may find use with other types of redundant storage including, for example, a three-way mirrored storage volume. Data processing system 10 includes a host (e.g., server computer system) 12 coupled to data storage subsystems 14 and 16 via storage interconnect 20. For purposes of explanation, storage interconnect 20 will take form in a storage area network (SAN) it being understood that the term storage interconnect should not be limited thereto. SAN 20 may include devices (e.g., switches, routers, hubs, etc.) that cooperate to transmit input/output (IO) transactions between host 12 and storage subsystems 14 and 16.
Each of the data storage subsystems 14 and 16 includes several physical storage devices. For purposes of explanation, data storage subsystems 14 and 16 are assumed to include several hard disks. The term physical storage device should not be limited to hard disks. Data storage subsystems 14 and 16 may take different forms. For example, data storage subsystem 14 may consist of “Just a Bunch of Disks” (JBOD) connected to an array controller card. Data storage subsystem 16 may consist of a block server appliance. For purposes of explanation, each of the data storage subsystems 14 and 16 will take form in an intelligent disk array, it being understood that the term data storage subsystem should not be limited thereto.
As noted, each of the disk arrays 14 and 16 includes several hard disks. The hard disk is the most popular permanent storage device currently used. A hard disk's total storage capacity is divided into many small chunks called physical memory blocks or disk blocks. For example, a 10 GB hard disk contains 20 million disk blocks, with each block able to hold 512 bytes of data. Any random disk block can be written to or read from in about the same time, without first having to read or write other disk blocks. Once written, a disk block continues to hold data even after the hard disk is powered down. While hard disks in general are reliable, they are subject to occasional failure. Data systems employ data redundancy schemes such as mirrored volumes to protect against occasional failure of a hard disk.
Host 12 includes an application 22 executing on one or more processors. Application 22 generates ID transactions to access critical data in response to receiving requests from client computer systems (not shown) coupled to host 12. In addition to application 22, host 12 includes a storage manager 24 executing on one or more processors. Volume Manager™ provided by VERITAS Software Corporation of Mountain View, Calif., is an exemplary storage manager. Although many of the examples described herein will emphasize virtualization architecture and terminology associated with the VERITAS Volume Manager™, the software and techniques described herein can be used with a variety of different storage managers and architectures.
Storage managers can perform several functions. More particularly, storage managers can create storage objects (also known as virtualized disks) by aggregating hard disks such as those of disk arrays 14 and 16, underlying storage objects, or both. A storage object is an abstraction. FIG. 2 shows a visual representation of exemplary storage objects VExample, M0Example, and M1Example created for use in data processing system 10. Each of the storage objects VExample, M0Example, and M1Example in FIG. 2 consists of an array of nmax logical memory blocks that store or are configured to store data. While it is said that a logical memory block stores or is configured to store data, in reality the data is stored in one or more disk blocks of hard disks allocated directly or indirectly to the logical memory block.
Storage objects aggregated from hard disks can themselves be aggregated to form storage objects called logical data volumes. Logical data volumes are typically presented for direct or indirect use by an application such as application 22 executing on host 12. Thus, application 22 generates IO transactions to read data from or write data to one or more logical memory blocks of a data volume not knowing that the data volume is an aggregation of underlying storage objects, which in turn are aggregations of hard disks. Properties of storage objects depend on how the underlying storage objects or hard disks are aggregated. In other words, the method of aggregation determines the storage object type. In theory, there are a large number of possible methods of aggregation. The more common forms of aggregation include concatenated storage, striped storage, RAID storage, or mirrored storage. A more thorough discussion of how storage objects or hard disks can be aggregated can be found within Dilip M. Ranade [2002], “Shared Data Clusters” Wiley Publishing, Inc., which is incorporated herein by reference in its entirety.
Mirrored volumes provide highly reliable access to critical data. VExample of FIG. 2 is an exemplary two-way mirrored volume. VExample was created by aggregating underlying storage objects (hereinafter mirrors) M0Example and M1Example. Mirrors M0Example and M1Example were created by concatenating disk blocks from hard disks d0Example and d1Example (not shown) in disk arrays 14 and 16, respectively.
Storage managers can create storage object descriptions that describe the relationship between storage objects and their underlying storage objects or hard disks. These storage object descriptions typically include configuration maps. It is noted that storage object descriptions may include other information such as information indicating that a storage object is a snapshot copy of another storage object.
A configuration map maps a logical memory block of a corresponding storage object to one or more logical memory blocks of one or more underlying storage objects or to one or more disk blocks of one or more hard disks. To illustrate, configuration maps CMVExample, CMM0Example, and CMM1Example are created for mirrored volume VExample and underlying mirrors M0Example and M0Example, respectively. Configuration map CMVExample maps each logical memory block n of VExample to logical memory blocks n of mirrors M0Example and M1Example. Configuration map CMM0Example maps each logical memory block n of mirror M0Example to a disk block x in hard disk d0Example, while configuration map CMM1Example maps each logical memory block n of mirror M1Example to a disk block y in hard disk d1Example. Configuration map CMVExample can be provided for use by storage manager 24, while configuration maps CMM0Example and CMM1Exampe can be provided for use by storage managers 34 and 36 executing on one or more processors in disk arrays 14 and 16, respectively.
Storage managers use configuration maps to translate IO transactions directed to one storage object into one or more IO transactions that access data of one or more underlying storage objects or hard disks. To illustrate, presume an IO transaction is generated by application 22 to write data D to logical memory block 3 of data volume VExample. This IO transaction is received directly or indirectly by storage manager 24. In turn, storage manager 24 accesses configuration map CMVExample and learns that logical memory block 3 is mapped to logical block 3 in both mirrors M0Example and M1Example. It is noted storage manager may not receive the exact IO transaction generated by application 22. However, the transaction storage manager 24 receives will indicate that data D is to be written to block 3 of mirrored volume VExample.
Storage manager 24 then generates first and second IO transactions to write data D to logical blocks 3 and mirrors M0Example and M0Example, respectively. The IO transactions generated by storage manager 24 are transmitted to disk arrays 14 and 16, respectively, via SAN 20. Storage managers 34 and 36 of disk arrays 14 and 16, respectively receive directly or indirectly, the IO transactions sent by storage manager 24. It is noted that the IO transactions received by storage managers 34 and 36 may not be the exact 10 transactions generated and sent by storage manager 24. Nonetheless, storage managers 34 and 36 will each receive an IO transaction to write data D to logical block 3 in mirrors M0Example and M1Example, respectively. Storage manager 34, in response to receiving the IO transaction, accesses configuration map CMM0Example to learn that block 3 of mirror M0Example is mapped to, for example, disk block 200 within hard disk d0. In response, a transaction is generated to write data D to disk block 200 within disk d0. Storage manager 36 accesses configuration map CMM1Example to learn that logical block 3 of mirror M1Example is mapped to, for example, disk block 300 within hard disk d1. In response, an IO transaction is generated for writing data D to disk block 300 within hard disk d1.
As noted above, while hard disks are reliable, hard disks are subject to failure. Data may be inaccessible within a failed hard disk. For this reason and others, administrators create mirrored data volumes. Unfortunately, data within mirrors of a mirrored volume may get into an inconsistent state if, for example, there is a server crash, storage power failure, or other problem which prevents data from being properly written to a hard disk. Consider the exemplary mirrored volume VExample. Presume storage manager 24 generates first and second IO transactions to write data D to logical blocks 3 in mirrors M0Example and M1Example in response to the IO transaction generated by application 22. Further, presume host 12 may fail after the first 10 transaction is transmitted to disk array 14, but before the second IO transaction is transmitted to disk array 16. As a result, data D is written to disk block 200 within disk d0 of disk array 14, but data D is not written to disk block 300 within hard disk d1. When host 12 is restarted and exemplary volume VExample is made available again to application 22, mirrors M0Example and M1Example are said to be out of sync. In other words, mirrors M0Example and M1Example are no longer identical since at least block 3 in each contain different data. An IO transaction to read from logical memory block 3 of mirrored volume VExample could return either old or new data depending on whether the data is read from disk block 200 of hard disk d0 or disk block 300 of hard disk d1. Mirrors M0Example and M1Example should be resychronized before either is accessed again.
A brute force method to resynchronize mirrors M0Example and M1Example is simply to presume that one mirror (e.g., M0Example) contains correct data and copy the contents of the one mirror to the other (e.g., M1Example). It can take hours to resynchronize using this method. A smarter resynchronization technique is possible, but it requires some preparation. This alternate technique involves using what is called a dirty region map. FIG. 2 illustrates a visual representation of an exemplary dirty region map 40 consisting of nmax entries corresponding to the nmax logical memory blocks within mirrors M0Example and M1Example. Each entry of the dirty region map 40 indicates whether the corresponding blocks in mirrors are considered synchronized. For example, if entry n is set to logical 1, then blocks n in the mirrors are considered out of synchronization, and if n is set to logical 0, blocks n in the mirrors are considered in synchronization. Entry n in dirty region map 40 is set to logical 1 when the application 22 generates a transaction for writing data to logical block n of volume VExample. Entry n is maintained in the logical 1 state until acknowledgement is received from both disk arrays 14 and 16 that data D has been written successfully to the disk blocks allocated to logical memory block n. However, if, using the example above, the first IO transaction to write data D to block 3 of mirror M0Example succeeds while the second 10 transaction to write to block 3 of M1Example fails, then entry 3 and dirty region map 40 will be maintained as a logical 1 indicating that logical blocks 3 in the mirrors are out of synchronization. Mirrors M0Example and M1Example can be resynchronized by copying data from M0Example to mirror M1Example, but for only logical memory blocks corresponding to dirty region map entries set to logical 1.