The present invention is related to copying logical units within mass-storage devices that provide multiple logical units, such as disk arrays. An embodiment of the present invention, discussed below, involves disk-array mass-storage devices. To facilitate that discussion, a general description of disk drives and disk arrays is first provided.
The most commonly used non-volatile mass-storage device in the computer industry is the magnetic disk drive. In the magnetic disk drive, data is stored in tiny magnetized regions within an iron-oxide coating on the surface of the disk platter. A modem disk drive comprises a number of platters horizontally stacked within an enclosure. The data within a disk drive is hierarchically organized within various logical units of data. The surface of a disk platter is logically divided into tiny, annular tracks nested one within another. FIG. 1A illustrated tracks on the surface of a disk platter. Note that, although only a few tracks are shown in FIG. 1A, such as track 101, an actual disk platter may contain many thousands of tracks. Each track is divided into radial sectors. FIG. 1B illustrates sectors within a single track on the surface of the disk platter. Again, a given disk track on an actual magnetic disk platter may contain many tens or hundreds of sectors. Each sector generally contains a fixed number of bytes. The number of bytes within a sector is generally operating-system dependent, and normally ranges from 512 bytes per sector to 4096 bytes per sector. The data normally retrieved from, and stored to, a hard disk drive is in units of sectors.
The modern disk drive generally contains a number of magnetic disk platters aligned in parallel along a spindle passed through the center of each platter. FIG. 2 illustrates a number of stacked disk platters aligned within a modem magnetic disk drive. In general, both surfaces of each platter are employed for data storage. The magnetic disk drive generally contains a comb-like array with mechanical READ/WRITE heads 201 that can be moved along a radial line from the outer edge of the disk platters toward the spindle of the disk platters. Each discrete position along the radial line defines a set of tracks on both surfaces of each disk platter. The set of tracks within which ganged READ/WRITE heads are positioned at some point along the radial line is referred to as a cylinder. In FIG. 2, the tracks 202-210 beneath the READ/WRITE heads together comprise a cylinder, which is graphically represented in FIG. 2 by the dashed-out lines of a cylinder 212.
FIG. 3 is a block diagram of a standard disk drive. The disk drive 301 receives input/output (“I/O”) requests from remote computers via a communications medium 302 such as a computer bus, fibre channel, or other such electronic communications medium. For many types of storage devices, including the disk drive 301 illustrated in FIG. 3, the vast majority of I/O requests are either READ or WRITE requests. A READ request requests that the storage device return to the requesting remote computer some requested amount of electronic data stored within the storage device. A WRITE request requests that the storage device store electronic data furnished by the remote computer within the storage device. Thus, as a result of a READ operation carried out by the storage device, data is returned via communications medium 302 to a remote computer, and as a result of a WRITE operation, data is received from a remote computer by the storage device via communications medium 302 and stored within the storage device.
The disk drive storage device illustrated in FIG. 3 includes controller hardware and logic 303 including electronic memory, one or more processors or processing circuits, and controller firmware, and also includes a number of disk platters 304 coated with a magnetic medium for storing electronic data. The disk drive contains many other components not shown in FIG. 3, including READ/WRITE heads, a high-speed electronic motor, a drive shaft, and other electronic, mechanical, and electromechanical components. The memory within the disk drive includes a request/reply buffer 305, which stores I/O requests received from remote computers, and an I/O queue 306 that stores internal I/O commands corresponding to the I/O requests stored within the request/reply buffer 305. Communication between remote computers and the disk drive, translation of I/O requests into internal I/O commands, and management of the I/O queue, among other things, are carried out by the disk drive I/O controller as specified by disk drive I/O controller firmware 307. Translation of internal I/O commands into electromechanical disk operations in which data is stored onto, or retrieved from, the disk platters 304 is carried out by the disk drive I/O controller as specified by disk media read/write management firmware 308. Thus, the disk drive I/O control firmware 307 and the disk media read/write management firmware 308, along with the processors and memory that enable execution of the firmware, compose the disk drive controller.
Individual disk drives, such as the disk drive illustrated in FIG. 3, are normally connected to, and used by, a single remote computer, although it has been common to provide dual-ported disk drives for concurrent use by two computers and multi-host-accessible disk drives that can be accessed by numerous remote computers via a communications medium such as a fibre channel. However, the amount of electronic data that can be stored in a single disk drive is limited. In order to provide much larger-capacity electronic data-storage devices that can be efficiently accessed by numerous remote computers, disk manufacturers commonly combine many different individual disk drives, such as the disk drive illustrated in FIG. 3, into a disk array device, increasing both the storage capacity as well as increasing the capacity for parallel I/O request servicing by concurrent operation of the multiple disk drives contained within the disk array.
FIG. 4 is a simple block diagram of a disk array. The disk array 402 includes a number of disk drive devices 403, 404, and 405. In FIG. 4, for simplicity of illustration, only three individual disk drives are shown within the disk array, but disk arrays may contain many tens or hundreds of individual disk drives. A disk array contains a disk array controller 406 and cache memory 407. Generally, data retrieved from disk drives in response to READ requests may be stored within the cache memory 407 so that subsequent requests for the same data can be more quickly satisfied by reading the data from the quickly accessible cache memory rather than from the much slower electromechanical disk drives. Various elaborate mechanisms are employed to maintain, within the cache memory 407, data that has the greatest chance of being subsequently re-requested within a reasonable amount of time. The disk saves recent WRITE requests, in cache memory 407, in the event that the data may be subsequently requested via READ requests or in order to defer slower writing of the data to physical storage medium.
Electronic data is stored within a disk array at specific addressable locations. Because a disk array may contain many different individual disk drives, the address space represented by a disk array is immense, generally many thousands of gigabytes. The overall address space is normally partitioned among a number of abstract data storage resources called logical units (“LUNs”). A LUN includes a defined amount of electronic data storage space, mapped to the data storage space of one or more disk drives within the disk array, and may be associated with various logical parameters including access privileges, backup frequencies, and mirror coordination with one or more LUNs. LUNs may also be based on random access memory (“RAM”), mass-storage devices other than hard disks, or combinations of memory, hard disks, and/or other types of mass-storage devices. Remote computers generally access data within a disk array through one of the many abstract LUNs 408-415 provided by the disk array via internal disk drives 403-405 and the disk array controller 406. Thus, a remote computer may specify a particular unit quantity of data, such as a byte, word, or block, using a bus communications media address corresponding to a disk array, a LUN specifier, normally a 64-bit integer, and a 32-bit, 64-bit, or 128-bit data address that specifies a LUN, and a data address within the logical data address partition allocated to the LUN. The disk array controller translates such a data specification into an indication of a particular disk drive within the disk array and a logical data address within the disk drive. A disk drive controller within the disk drive finally translates the logical address to a physical medium address. Normally, electronic data is read and written as one or more blocks of contiguous 32-bit or 64-bit computer words, the exact details of the granularity of access depending on the hardware and firmware capabilities within the disk array and individual disk drives as well as the operating system of the remote computers generating I/O requests and characteristics of the communication medium interconnecting the disk array with the remote computers.
In many computer applications and systems that need to reliably store and retrieve data from a mass-storage device, such as a disk array, a primary data object, such as a file or database, is normally backed up to backup copies of the primary data object on physically discrete mass-storage devices or media so that if, during operation of the application or system, the primary data object becomes corrupted, inaccessible, or is overwritten or deleted, the primary data object can be restored by copying a backup copy of the primary data object from the mass-storage device. Many different techniques and methodologies for maintaining backup copies have been developed. In one well-known technique, a primary data object is mirrored. FIG. 5 illustrates object-level mirroring. In FIG. 5, a primary data object “O3” 501 is stored on LUN A 502. The mirror object, or backup copy, “O3” 503 is stored on LUN B 504. The arrows in FIG. 5, such as arrow 505, indicate I/O write operations directed to various objects stored on a LUN. I/O write operations directed to object “O3” are represented by arrow 506. When object-level mirroring is enabled, the disk array controller providing LUNs A and B automatically generates a second I/O write operation from each I/O write operation 506 directed to LUN A, and directs the second generated I/O write operation via path 507, switch “S1” 508, and path 509 to the mirror object “O3” 503 stored on LUN B 504. In FIG. 5, enablement of mirroring is logically represented by switch “S1” 508 being on. Thus, when object-level mirroring is enabled, any I/O write operation, or any other type of I/O operation that changes the representation of object “O3” 501 on LUN A, is automatically mirrored by the disk array controller to identically change the mirror object “O3” 503. Mirroring can be disabled, represented in FIG. 5 by switch “S1” 508 being in an off position. In that case, changes to the primary data object “O3” 501 are no longer automatically reflected in the mirror object “O3” 503. Thus, at the point that mirroring is disabled, the stored representation, or state, of the primary data object “O3” 501 may diverge from the stored representation, or state, of the mirror object “O3” 503. Once the primary and mirror copies of an object have diverged, the two copies can be brought back to identical representations, or states, by a resync operation represented in FIG. 5 by switch “S2” 510 being in an on position. In the normal mirroring operation, switch “S2” 510 is in the off position. During the resync operation, any I/O operations that occurred after mirroring was disabled are logically issued by the disk array controller to the mirror copy of the object via path 511, switch “S2,” and pass 509. During resync, switch “S1” is in the off position. Once the resync operation is complete, logical switch “S2” is disabled and logical switch “S1” 508 can be turned on in order to reenable mirroring so that subsequent I/O write operations or other I/O operations that change the storage state of primary data object “O3,” are automatically reflected to the mirror object “O3” 503.
FIG. 6 illustrates a dominant LUN coupled to a remote-mirror LUN. In FIG. 6, a number of computers and computer servers 601-608 are interconnected by various communications media 610-612 that are themselves interconnected by additional communications media 613-614. In order to provide fault tolerance and high availability for a large data set stored within a dominant LUN on a disk array 616 coupled to server computer 604, the dominant LUN 616 is mirrored to a remote-mirror LUN provided by a remote disk array 618. The two disk arrays are separately interconnected by a dedicated communications medium 620. Note that the disk arrays may be linked to server computers, as with disk arrays 616 and 618, or may be directly linked to communications medium 610. The dominant LUN 616 is the target for READ, WRITE, and other disk requests. All WRITE requests directed to the dominant LUN 616 are transmitted by the dominant LUN 616 to the remote-mirror LUN 618, so that the remote-mirror LUN faithfully mirrors the data stored within the dominant LUN. If the dominant LUN fails, the requests that would have been directed to the dominant LUN can be redirected to the mirror LUN without a perceptible interruption in request servicing. When operation of the dominant LUN 616 is restored, the dominant LUN 616 may become the remote-mirror LUN for the previous remote-mirror LUN 618, which becomes the new dominant LUN, and may be resynchronized to become a faithful copy of the new dominant LUN 618. Alternatively, the restored dominant LUN 616 may be brought up to the same data state as the remote-mirror LUN 618 via data copies from the remote-mirror LUN and then resume operating as the dominant LUN. Various types of dominant-LUN/remote-mirror-LUN pairs have been devised. Some operate entirely synchronously, while others allow for asynchronous operation and reasonably slight discrepancies between the data states of the dominant LUN and mirror LUN.
In certain cases, LUN mirroring is launched immediately after configuring the LUNs, so that, from the very first unit of data written to a mirror pair, each LUN of the mirror pair is identically updated. In such cases, LUN copying, or replication, is generally not needed. In other cases, a non-mirrored primary LUN is first replicated, to create a copy LUN, and then the primary LUN and copy LUN are paired together as a mirror pair, which can be later split for independent access and updating. In still other cases, rather than using mirror LUNs, a system achieves reliability by periodically creating full, or snapshot, copies of a primary LUN to serve as consistent, backup copy LUNs that can be used to restore the primary LUN in case that the physical-storage devices containing the primary LUN fail or become inaccessible. LUN replication is thus a fundamental operation in many, and perhaps most, high-availability and disaster-recovery-capable mass-storage-device systems.
In general, mass-storage devices, such as disk arrays, support two basic types of LUN replication: (1) full copy; and (2) snapshot copy. A full copy involves faithfully copying each sector within a primary LUN to a copy LUN. In general, the copy LUN is statically allocated in advance of the replication operation. While the replication operation is underway, the primary, LUN continues to provide READ and WRITE access, while the copy LUN remains in a READ-only state until the replication operation is complete. For large LUNs, the replication operation may take considerable amounts of time, and create considerable, although temporary, performance bottlenecks within a mass-storage device such as a disk array.
A snapshot copy, by contrast, can be essentially immediately created. In a snapshot copy, a cache-resident meta-data map indicates whether a particular unit of data, such as a sector or block, of the copy LUN is resident in the primary LUN or in a delta-data LUN used to store data written to the copy LUN following a split of the primary-LUN/copy-LUN pair. Initially, the copy LUN is identical to the primary LUN, and so each reference within the meta-data map points to a sector, or block, of data within the primary LUN. After the primary LUN is split from the copy LUN, data written to the copy LUN begins to fill the delta-data LUN, and corresponding meta-data references point to blocks, or sectors, within the delta-data LUN. The snapshot copy is thus a virtual copy of the primary LUN implemented as a combination of the primary LUN, the meta-data map, and the delta-data LUN. Unlike a full copy, a snapshot copy cannot be used to restore a failed primary LUN following a primary-LUN catastrophe. However, snapshot copies need allocation of less internal resources than full copies, and can be used for creating virtual copies to offload accesses from the primary LUN for purposes such as backing-up the snapshot data to a secondary mass-storage device.
As the demands on mass-storage, such as disk arrays, increase, and with increasing demands on high-availability and fault-tolerant systems, the shortcomings of full copies and snapshot copies have become less tolerable to high-availability and fault-tolerant systems designers, manufacturers, and users. For disaster-recovery-capable and high-availability systems, snapshot copies cannot be used for robust backups because much of the data virtually associated with a snapshot-copy LUN actually remains stored on the primary LUN. Thus, the primary LUN remains a single point of failure. However, the significant overhead in time and resource utilization attendant with a full LUN copy may seriously impact the performance of a high-availability system. For these, and additional reasons to be discussed below, designers, manufacturers, and users of mass-storage devices, and of high-availability and fault-tolerant systems build around them, have recognized the need for a more efficient LUN copy operation that provides the robustness of a full LUN copy.