The disclosed embodiments of the present invention relates to unified data sets stored on a first number of disk array devices that are mirrored to a second number of disk array devices in order to allow immediate failover in the event of failure of one or more disk array devices. In the described embodiment, the disk devices that store the unified data set together compose a unified data set device group (“UDSDG”), the individual disk drives of which may reside in one or more disk arrays. Therefore, background information about disk and disk-array technologies is provided below.
FIG. 1 is a block diagram of a standard disk drive. The disk drive 101 receives I/O requests from remote computers via a communications medium 102 such as a computer bus, fibre channel, or other such electronic communications medium. For many types of storage devices, including the disk drive 101 illustrated in FIG. 1, the vast majority of I/O requests are either read or WRITE requests. A READ request requests that the storage device return to the requesting remote computer some requested amount of electronic data stored within the storage device. A WRITE request requests that the storage device store electronic data furnished by the remote computer within the storage device. Thus, as a result of a read operation carried out by the storage device, data is returned via communications medium 102 to a remote computer, and as a result of a write operation, data is received from a remote computer by the storage device via communications medium 102 and stored within the storage device.
The disk drive storage device illustrated in FIG. 1 includes controller hardware and logic 103 including electronic memory, one or more processors or processing circuits, and controller firmware, and also includes a number of disk platters 104 coated with a magnetic medium for storing electronic data. The disk drive contains many other components not shown in FIG. 1, including read/write heads, a high-speed electronic motor, a drive shaft, and other electronic, mechanical, and electromechanical components. The memory within the disk drive includes a request/reply buffer 105, which stores I/O requests received from remote computers, and an I/O queue 106 that stores internal I/O commands corresponding to the I/O requests stored within the request/reply buffer 105. Communication between remote computers and the disk drive, translation of I/O requests into internal I/O commands, and management of the I/O queue, among other things, are carried out by the disk drive I/O controller as specified by disk drive I/O controller firmware 107. Translation of internal I/O commands into electromechanical disk operations, in which data is stored onto, or retrieved from, the disk platters 104, is carried out by the disk drive I/O controller as specified by disk media read/write management firmware 108. Thus, the disk drive I/O control firmware 107 and the disk media read/write management firmware 108, along with the processors and memory that enable execution of the firmware, compose the disk drive controller.
Individual disk drives, such as the disk drive illustrated in FIG. 1, are normally connected to, and used by, a single remote computer, although it has been common to provide dual-ported disk drives for use by two remote computers and multi-port disk drives that can be accessed by numerous remote computers via a communications medium such as a fibre channel. However, the amount of electronic data that can be stored in a single disk drive is limited. In order to provide much larger-capacity electronic data-storage devices that can be efficiently accessed by numerous remote computers, disk manufacturers commonly combine many different individual disk drives, such as the disk drive illustrated in FIG. 1, into a disk array device, increasing both the storage capacity as well as increasing the capacity for parallel I/O request servicing by concurrent operation of the multiple disk drives contained within the disk array.
FIG. 2 is a simple block diagram of a disk array. The disk array 202 includes a number of disk drive devices 203, 204, and 205. In FIG. 2, for simplicity of illustration, only three individual disk drives are shown within the disk array, but disk arrays may contain many tens or hundreds of individual disk drives. A disk array contains a disk array controller 206 and cache memory 207. Generally, data retrieved from disk drives in response to READ requests may be stored within the cache memory 207 so that subsequent requests for the same data can be more quickly satisfied by reading the data from the quickly accessible cache memory rather than from the much slower electromechanical disk drives. Various elaborate mechanisms are employed to maintain, within the cache memory 207, data that has the greatest chance of being subsequently re-requested within a reasonable amount of time. The disk WRITE requests, in cache memory 207, in the event that the data may be subsequently requested via READ requests or in order to defer slower writing of the data to physical storage medium.
Electronic data is stored within a disk array at specific addressable locations. Because a disk array may contain many different individual disk drives, the address space represented by a disk array is immense, generally many thousands of gigabytes. The overall address space is normally partitioned among a number of abstract data storage resources called logical units (“LUNs”). A LUN includes a defined amount of electronic data storage space, mapped to the data storage space of one or more disk drives within the disk array, and may be associated with various logical parameters including access privileges, backup frequencies, and mirror coordination with one or more LUNs. LUNs may also be based on random access memory (“RAM”), mass storage devices other than hard disks, or combinations of memory, hard disks, and/or other types of mass storage devices. Remote computers generally access data within a disk array through one of the many abstract LUNs 208–215 provided by the disk array via internal disk drives 203–205 and the disk array controller 206. Thus, a remote computer may specify a particular unit quantity of data, such as a byte, word, or block, using a bus communications media address corresponding to a disk array, a LUN specifier, normally a 64-bit integer, and a 32-bit, 64-bit, or 128-bit data address that specifies a LUN, and a data address within the logical data address partition allocated to the LUN. The disk array controller translates such a data specification into an indication of a particular disk drive within the disk array and a logical data address within the disk drive. A disk drive controller within the disk drive finally translates the logical address to a physical medium address. Normally, electronic data is read and written as one or more blocks of contiguous 32-bit or 64-bit computer words, the exact details of the granularity of access depending on the hardware and firmware capabilities within the disk array and individual disk drives as well as the operating system of the remote computers generating I/O requests and characteristics of the communication medium interconnecting the disk array with the remote computers.
In many computer applications and systems that need to reliably store and retrieve data from a mass storage device, such as a disk array, a primary data object, such as a file or database, is normally backed up to backup copies of the primary data object on physically discrete mass storage devices or media so that if, during operation of the application or system, the primary data object becomes corrupted, inaccessible, or is overwritten or deleted, the primary data object can be restored by copying a backup copy of the primary data object from the mass storage device. Many different techniques and methodologies for maintaining backup copies have been developed. In one well-known technique, a primary data object is mirrored. FIG. 3 illustrates object-level mirroring. In FIG. 3, a primary data object “O3” 301 is stored on LUN A 302. The mirror object, or backup copy, “O3” 303 is stored on LUN B 304. The arrows in FIG. 3, such as arrow 305, indicate I/O write operations directed to various objects stored on a LUN. I/O write operations directed to object “O3” are represented by arrow 306. When object-level mirroring is enabled, the disk array controller providing LUNs A and B automatically generates a second I/O write operation from each I/O write operation 306 directed to LUN A, and directs the second generated I/O write operation via path 307, switch “S1” 308, and path 309 to the mirror object “O3” 303 stored on LUN B 304. In FIG. 3, enablement of mirroring is logically represented by switch “S1” 308 being on. Thus, when object-level mirroring is enabled, any I/O write operation, or any other type of I/O operation that changes the representation of object “O3” 301 on LUN A, is automatically mirrored by the disk array controller to identically change the mirror object “O3” 303. Mirroring can be disabled, represented in FIG. 3 by switch “S1” 308 being in an off position. In that case, changes to the primary data object “O3” 301 are no longer automatically reflected in the mirror object “O3” 303. Thus, at the point that mirroring is disabled, the stored representation, or state, of the primary data object “O3” 301 may diverge from the stored representation, or state, of the mirror object “O3” 303. Once the primary and mirror copies of an object have diverged, the two copies can be brought back to identical representations, or states, by a resync operation represented in FIG. 3 by switch “S2” 310 being in an on position. In the normal mirroring operation, switch “S2” 310 is in the off position. During the resync operation, any I/O operations that occurred after mirroring was disabled are logically issued by the disk array controller to the mirror copy of the object via path 311, switch “S2,” and pass 309. During resync, switch “S1” is in the off position. Once the resync operation is complete, logical switch “S2” is disabled and logical switch “S1” 308 can be turned on in order to reenable mirroring so that subsequent I/O write operations or other I/O operations that change the storage state of primary data object “O3,” are automatically reflected to the mirror object “O3” 303.
A unified data set (“UDS”) is a set of data stored within a group of I/O devices, such as disk drives, or a group of portions of different disk drives, that is accessed by host computers as a seamless, internally consistent data set with regard to I/O requests. In other words, the host computer can execute I/O requests against the unified data set as if the unified data set were stored on a single I/O device, without concern for managing I/O request sequencing with respect to multiple I/O devices and corresponding mirror devices. A unified data set device group (“UDSDG”) is a group of I/O devices, such as disks, that are identified by a host computer to the controller of an array of devices, such as a disk array, as a group of devices which the host intends to treat as a UDSDG. The controller of the disk array then manages and maintains data consistency of the UDSDG at an I/O request level, and, along with a second disk array, maintains data consistency among devices to which the UDSDG is mirrored.
FIG. 4 shows a block diagram of an example hardware platform that supports a unified data set. In FIG. 4, a host computer 402 exchanges data via a communications medium 404 with a local disk array 406. Four disk drives 408–411 of the local disk array 406 are shown in FIG. 4 within a box 412 with boundaries denoted by dashed lines. This box 412 represents a UDSDG comprising the four disk drives 408–411. Incoming I/O requests to the disk array 406 are received by the disk array controller 414 and queued to an input queue 416. The disk array controller 414 removes I/O requests queued to the I/O request queue 416 and passes each dequeued I/O request to an appropriate disk drive from among disk drives 408–411. The local disk array controller 414 presents a LUN-based interface to the host computer 402, but additionally manages I/O-request execution with respect to the disk drives 408–411 so that the UDS distributed among them is consistent with respect to the order of I/O requests directed to the UDS by the host computer. The local disk array controller 414 also mirrors WRITE requests directed to the UDS, and other I/O requests and commands that may result in updating or changing the UDS, to a second, remote disk array 418 that includes four disk devices 420–423 that together compose a mirror UDSDG 426 within the second disk array 418. Should one or more disk drives 408–411 of the UDSDG 412 fail, equivalent data can be obtained by the local disk array controller 414 from the mirror UDSDG 426 on the remote disk array 418. Should the entire local disk array 406 fail, the host computer 402 may, in some cases, establish a communications path directly to the remote disk array 418 that contains the mirror UDSDG 426, and proceed by using the mirror UDSDG 426 and remote disk array 418 as the UDSDG and local disk array, respectively.
For many critical applications, such as database management systems (“DBMSs”), the consistency of data stored within an inter-array unified data set (“IUDS”) distributed across two or more local disk arrays, and the consistency of the data of the IUDS mirrored to one or more remote disk arrays, is extremely important. In general, it is acceptable for updates to the mirrored data set stored on the one or more remote disk arrays to lag behind updates to the IUDS stored within two or more local disk arrays. Should a catastrophic failure in communication systems or disk-array hardware occur, a host computer may fail-over to the mirrored data stored on the one or more remote disk arrays and resume operation using a somewhat stale, or, in other words, not completely updated data set. However, in such cases, it is vital that the mirrored data set, although somewhat stale, be consistent, or, in other words, be equivalent to a previous or current data state of the IUDS.
Data inconsistencies may arise from processing I/O requests in an order different from the order in which they are issued by a host computer or from failure to execute one or more I/O requests within a series of I/O requests issued by a host computer. FIGS. 5A–I abstractly illustrate data inconsistency problems arising from out-of-order execution of a sequence of operations and from omission of operations within a sequence of operations. FIGS. 5A–I employ representational conventions illustrated in FIG. 5A. The contents of a portion of a disk drive are represented by a large square 502 that includes smaller, internal subsquares 503–506. A command to alter the contents of the portion of the disk represented by large square 502 is represented by a smaller square 508. Each subsquare 503–506 of the portion of the disk represented by square 502 can have one of two states, a first state represented in FIG. 5A–I by a darkened, or filled, subsquare and and a second state represented by an empty subsquare. Initially, the portion of the disk has the state shown in FIG. 502, namely all small subsquares have the state “empty.” A command, such as command 508, has small internal subsquares 509–512 that represent subsquares 503–506 of the portion of the disk represented by large square 502. When a subsquare within a command is darkened, or filled, the command directs the corresponding large subquare within the portion of the disk 502 to be placed in the darkened, or filled, state. Thus, command 508 directs filling of subsquare 504. When a small subsquare within a command is cross-hatched, such as small internal subsquares 509 and 511, the state of the corresponding large subsquares is unchanged by the command. Finally, when a small internal subsquare of the command is empty, or unfilled, such as subsquare 512, the command directs the corresponding large internal subsquare of the portion of the disk represented by large square 502 to be empty.
FIGS. 5B–D show application of three successive commands 514–516 to an initially empty portion of a disk as represented by large square 502 in FIG. 5A. Application of command 514 to large square 502 of FIG. 5A produces the modified large square 518 in FIG. 5B. Application of command 515 to modify large square 518 produces modified large square 520 in FIG. 5C, and, similarly, application of command 516 to modified large square 520 produces the final state represented by large square 522 in FIG. 5D. For the series of commands 514–516, modified large squares 518, 520, and 522 represent successive consistent states.
FIG. 5E represents the application of commands 514–516 to an initially empty data state, with the final data state 522 shown at the right of FIG. 5E. Thus, FIG. 5E encapsulates the separate operations illustrated in FIGS. 5B–D.
FIG. 5F illustrates application of commands 514–516 in an order different from the order of application shown in FIG. 5E. In FIG. 5F, command 515 is first applied, followed by 514, and finally by command 516. Note that the final data state 524 is different from the final data state 522 produced by invocation of the commands in-order, as illustrated in FIG. 5E. Thus, by changing the order in which commands 514–516 are applied, a different final data state is obtained. Similarly, FIG. 5G illustrates omission of command 515 from the sequence of commands. In FIG. 5G, command 514 is first applied, followed by command 516. Note that the final data state 526 is different from the final data state 522 produced by the in-order application of the commands illustrated in FIG. 5E. Two final examples, shown in FIGS. 5H and 5I, respectively, illustrate that commands may be applied out of order, or commands may be omitted, without changing the final data state. In FIG. 5H command 514 is first applied, followed by command 516 and finally by command 515, and yet the final data state 528 is identical to the final data state 522 produced by in-order application of the commands, as illustrated in FIG. 5E. FIG. 5I shows that omission of command 514 does not alter the final data state 530.
In order to achieve consistent and meaningful data states, DBMSs normally enforce serialization of primitive database operations, resulting in a corresponding serialization of corresponding WRITE requests generated by the DBMS to update the state of the database stored on one or more disk drives. The state of the database, as stored on the one or more disk drives, normally changes from one fully consistent data state to another fully consistent data state. Mirrored disks can be used to ensure that a redundant fully consistent data state is contained within the mirror disks, generally lagging, to some small degree, the most current state of the database stored on local disks. In simplified terms, referring to FIGS. 5B–D, the mirror disk may lag the local disks by some number of operations, so that, for example, the mirror disk, at a particular instant in time, may have data state 518 while the local disks may have, at the same instant in time, data state 522. Thus, in this example, commands 515 and 516 have been applied to the local disk before they have been applied to the mirror disk. However, both the mirror disk and the local disk, at any given instant in time, have consistent data states. In other words, the instantaneous state of the data is a data state that would be expected to occur by in-order processing of a series of commands. If the local disk fails, the mirror disk is consistent, although some number of commands may need to be re-issued in order to bring the mirror disk up to the state of the local disk at the time of failure. However, if commands are issued in different orders to the local disk and the mirror disk, then either the local disk or the mirror disk may end up in an inconsistent state. For example, if commands 514–516 are applied to the local disks in order, as illustrated in FIG. 5E, and are applied to the mirror disk out-of-order, as illustrated in FIG. 5F, then the mirror disk will have a state 524 inconsistent with the data state resulting from application of commands to the local disk 522.
A DBMS may produce extremely complex database states, with millions or billions of data entities organized in extremely complex interrelationships. The DBMS manages these data states so that all the data interdependencies are consistent and reflect the result of actual database operations carried out over time in a real-world system. The consistency of a database depends on WRITE operations issued by the DBMS to the disk drives containing the database being carried in a specified order, without omissions and without mistakes. If the WRITE operations are carried out in a different than that specified by the DBMS, or WRITE operations are omitted or fail and the omissions and failures go undetected by the DBMS, the database state may quickly become inconsistent, and the intricate interdependencies of the data contained within the database may quickly become hopelessly and unrecoverably confused.
Data consistency problems may arise within a UDSDG group that are analogous to the above-described data consistency problems. The local disk array controller 414 can overcome such data consistency problems by ensuring that WRITE requests, and other I/O requests and commands that can update or change data within the UDS, are executed by the local disk array in precisely the order in which they are issued by the host computer, and that WRITE requests, and other I/O requests and commands that can update or change data of the UDS, are executed by the remote disk array in precisely the order that they are issued by the disk array controller of the local disk array. In the following discussion, the term “WRITE request” indicates either a WRITE request or any other type of I/O request or command that can update or change data within a UDS.
Guaranteeing execution of WRITE requests by the local disk array 406 in FIG. 4 in the same order as the WRITE requests are issued by the host computer 402 in FIG. 4 is generally accomplished within the communications protocol through which data is exchanged via the communications medium 404 in FIG. 4, or by the host computer waiting for acknowledgement for each WRITE request before issuing a subsequent WRITE request.
These I/O-request-sequencing techniques are sufficient to maintain data consistency within and between the UDSDG 412 and the mirror UDSDG 426. However, it may be desirable to implement UDSDGs having alternative configurations. FIG. 6 is a block diagram of a distributed, inter-array unified data set device group (“IUDSDG”). FIG. 6 shows an IUDSDG 602a and 602b distributed between two local disk arrays 604a and 604b. The IUDSDG 602a–b is mirrored to a mirror UDSDG 606 on a remote disk array 608. As will be discussed in detail, below, the sequencing techniques described above for the UDSDG configuration illustrated in FIG. 4 are inadequate to guarantee data consistency within and between the IUDSDG 602a–b and mirror UDSDG 606. Various disk array manufacturers and distributed data storage providers have attempted to provide data consistency solutions that would allow for an IUDSDG such as that shown in FIG. 8. Many of these techniques have attempted to rely on time-stamping of WRITE requests issued by local disk arrays 604a–b to provide sequencing of WRITE requests received by remote disk array 606. Unfortunately, problems associated with maintaining time stamp consistency between two different disk arrays have proved to be extremely difficult to solve, and the prior-art techniques have not provided reliable data consistency in IUDSDGs such as the IUDSDG shown in FIG. 6. For this reason, manufacturers of disk arrays and distributed data storage solution providers have recognized a need for a method and hardware platform to guarantee data consistency within IUDSDGs.