Many approaches have been developed for protecting critical data stored in digital data systems against loss resulting from incidents such as power failure or power transients, equipment malfunction, human error or other events of comparable effect. In one approach, normal operations on a data processing system are stopped so that all or selected portions of the data stored on a particular drive (e.g., a disk drive) can be transferred to a backup medium, such as a magnetic tape, thereby to backup the memory system by providing a "snapshot" of the memory system at the time of the backup. Successive backups may then either copy onto the backup media the entire contents of the memory system or only the incremental changes that have been made since the prior backup.
This approach is still used in data processing systems. However, even in personal computer systems, the time to complete such a backup may require an hour or more. It may also take a significant time to restore the information from the backup medium following a failure of the primary storage system, particularly if a storage system, such as a disk drive, fails completely. While such approaches may be acceptable for providing redundancy in home and small office systems, in recent years there has arisen another category of data processing systems that requires essentially full-time availability of the data and that incorporates large memory systems. Conventional backup procedures simply cannot be used with such systems without introducing significant service interruptions that can lead to unacceptable intervals during which the data processing system is not available for its normal operations.
In such systems, the data storage system often includes multiple disk controllers, each having the capability of controlling multiple disk drives or other storage units. In some prior art systems, not only is a data file written to a specific disk drive, as a primary disk drive, through its corresponding disk controller, but also the file is written to another disk, as a secondary disk drive, connected to the same or another disk controller. This provides full redundancy. However, the "host" data processing system serviced by this mass storage subsystem must perform two writing operations instead of one. If there is a single communications path between the host system and the mass storage subsystem, these write operations must be performed sequentially. The need to execute sequential operations can affect both the performance and operation of the data processing system. For example, each copy of the data to be stored may be written randomly on each disk using the available parts of the media, as a result of which the file can become significantly fragmented. This condition, in turn, can produce undesirably long retrieval times. Moreover, in such systems, all normal reading operations involve the primary disk drive. No attempt is made to read from the secondary disk drive unless a problem occurs in the primary disk drive. This is somewhat dangerous in as much as the condition of the data on the secondary disk drive is unknown until it is needed and if it is not error-free at that time, there is no other source from which to retrieve the needed file.
U.S. Pat. No. 5,390,313 issued to Yanai, et al., and assigned to the assignee of this application, discloses a data storage system with data storage redundancy. The system includes at least one pair of disk storage devices. Each device has a plurality of generally identical data records. These are "mirrored" disks or storage media. Each medium includes position indicators for providing one or more indications of rotational position of each of the rotating data storage media with respect to its associated fixed position read/write mechanism. A position monitor receives the rotational position indications from each rotating data storage medium and computes and monitors the rotational position of each rotating storage medium with respect to its associated read/write mechanism. After receiving a request for access to one or more data records stored on the pair of rotating data storage media, the system computes projected data access times for retrieving the requested data record on each of the rotating data storage media and commands retrieval of the requested data record to the rotating data storage medium having the shortest projected data access time based upon rotational position in state of the respective data storage medium. Consequently, unlike the previously discussed file copy systems, data can be, and is, read from either of the mirrored memories.
U.S. Pat. No. 5,212,784 issued to Sparks, discloses another type of automated backup system in which separate logical buses couple at primary controllers to release a set of paired mirrored memories or shadowed primary data storage devices. A backup device controller attaches to one of the logical buses and a backup device. In normal operation, the primary controller writes data to both the primary data storage devices to produced mirrored copies. The backup device controller transfers data that is read from a designated one of the primary data storage devices to the backup storage device. After backup is complete, the primary controller re-synchronizes the primary data storage devices so that data that has been written on the continuously operational data storage device is copied onto the designated data storage device. In an alternative embodiment, separate logical buses couple the primary controller to at least a set of triplet or quadruplet mirrored or shadowed primary data storage devices. Triplet devices permit backup operation while retaining the redundancy characteristic of the mirrored storage devices. Quadruplet devices permit continuous backup operations of two alternating storage devices retaining the redundance characteristic of mirrored storage devices.
U.S. Pat. No. 5,423,046 issued to Nunnelley et al. discloses a high capacity data storage system with a large array of small disk files. Three storage managers control (1) the allocation of data to the array, (2) access to the data and (3) the power status of disk files within the disk array. More specifically, the allocation manager controls, inter alia, the type of protection desired to include redundancy by mirroring. The access manager interprets incoming read requests to determine the location of the stored data. That is, the access manager determines which cluster or clusters in the data memories contain the requested data set and then passes that cluster list to the power manager. The power manager determines which disk files must be activated to fulfill the request.
U.S. Pat. No. 5,392,244 issued to Jacobson et al. discloses memory systems with data storage redundancy utilizing both mirroring and parity redundancy. The memory system places more critical data in the parity area. Consequently the system effectively tunes the storage resources of the memory system according to the application or user requirements. Alternatively the tuning can be made on the basis of accesses to the data such that the mirrored areas store recently accessed data while the parity raid area stores the remaining data.
U.S. Pat. No. 5,432,922 issued to Polyzois et al. discloses a storage system using a process of alternating deferred updating of mirrored storage disks. Data blocks or pages to be written are accumulated and sorted into an order for writing on the disk efficiently, The individual disks of a mirrored pair arc operated out of phase with each other so that while on disk is in the read mode the other is in the write mode. Updated blocks arc written out to the disk that is in the write mode in sorted order. Read performance is provided by directing all read operations to the other disk, that is in the read mode. When a batch of updates has been applied to one disk of a mirrored pair, the mirrored pair switch their modes and the other disk, that had been in the read mode is updated.
U.S. Pat. No. 5,435,004 issued to Cox et al. discloses yet another redundant storage variant. A computerized data backup system dynamically preserves a consistent state of primary data stored in a logical volume of a disk volume management system. A file system command invokes a cloning of the logical volume, thereby reserving a portion for shadow-paged blocks. A read/write translation map establishes a correspondence between incited and shadowed pages in a reserved portion. Upon generating a read command for a page in a logical volume, a map search detects that a shadowed page is allocated to the shadowed page blocks corresponding to the page and effects the read. Backup occurs while the system is operating, thereby facilitating reading from the non-shadow page blocks during such a backup.
In still another system that has been utilized by the assignee of this invention, each of two mirrored individual disk drives, as physical disk volumes, are divided into blocks of consecutive tracks in order. Typically the number of tracks in each block is fixed and is not dependent upon any boundary with respect to any file or data stored on the blocks. A typical block size might include four tracks. Assume for purposes of explanation that the blocks were numbered consecutively (i.e., 0, 1, 2, . . . ), with block 0 comprising tracks 0 through 3; block 1, tracks 4 through 7; etc. During each reading operation, the data system reads all data from odd-numbered blocks (i.e., blocks 1, 3, . . . ) from the first mirrored physical disk drive and all the even-numbered blocks (i.e., blocks 0, 2, 4 . . . ) from the second mirrored physical disk drive. However, when a read operation recovers a data block that resides on consecutive blocks of tracks, for example, track blocks 1 and 2, the reading operation from the first physical disk drive must stop at track 7. Then the second disk drive must move its head to the appropriate track, track 8 on this example, to retrieve the next block. This interval, or "seek time", and a corresponding "latency", that represents the time required for the beginning of a track to reach a read/write head, determines the total access time. By contrast, continuing the reading operation with the first disk drive might introduce a one-track seek time and one-revolution latency. Such a total access time will interrupt the transfer and can significantly affect the overall rate at which data is transferred from the physical disk drives.
Collectively the foregoing prior art discloses various approaches for minimizing the risk of data loss in a data processing system, particularly through the use of mirrored memory devices. This prior art also discloses various approaches for enabling reading operations from both physical disk drives in a mirrored pair. However, in these systems the decision on which of the mirrored pair will be used during a reading operation rests generally on the physical attributes of the disk drive rather than the data content of the drive. For example, the assignee's prior art system divides the physical drive into arbitrary blocks of continuous disk tracks and then interleaves the reading operations according to the location of the data on a particular track. Another of the assignee's system selects a particular one of the mirrored physical disk pairs based upon the time it will take to initiate an actual transfer. Still others make a determination based upon whether one or the other of the mirrored disk pair is involved in a backup operation, in which case the reading operation is caused to occur from the other physical disk drive. Experience is demonstrating that while these approaches work effectively in some environments, they can actually slow the effective transfer rate of a particular block of data as defined in a file or in a like block in other environments that are now becoming more prevalent in commercial applications.
In yet another system that has been utilized by the assignee of this invention, physical disk drives in a mirrored pair are divided into logical volumes such that the mirrored logical volumes have identical data structures within the physical disks. A memory controller responds to the read command and includes a correspondence that assigns to each logical volume the identity of one of the first and second physical disk drives from that logical volume. A data transfer controller responds to a read command by transferring the data in the logical volume from the identified physical disk drive that the correspondence assigns to the logical volume.
In accordance with another aspect of that system, there is provided a data processing system which includes, as components, at least one host adapter, a system memory including buffer memory, a command memory and a memory manager, first and second disk drives from which data is read, and first and second device controllers for controlling transfers with the first and second disk drives and interconnecting the first and second disk drives. A system thus interconnects these components. The host adapter includes a system memory manager the effects the transfer of a read command to the command memory over the system bus. Each of the first and second disk drives is divided identically into a plurality of logical volumes comprising a number of contiguous tracks, whereby the first and second disk drives are mirrors of each other. Each device controller includes a memory manager for controlling transfers between the corresponding device controller and the system memory. A buffer in each device controller stores data being transferred with the disk drive and a control connects to the buffer for controlling transfers between the disk drive and buffer. A correspondence table comprises an entry for each logical volume connected to the device controller. Each entry includes a read mode field and the control responds to the receipt of a read command by identifying a logical volume by using the corresponding table to connect the drive control for affecting a transfer from the connected one of the mirrored disk drives when the read mode field has a first value and for excluding any response when the read mode field has a second value.
In general, therefore, there will be seen to be two goals addressed with respect to the performance of such mass storage systems: (1) to increase reliability of data storage and retrieval and (2) to improve data availability--i.e., to reduce the time required to access and retrieve or store data. With these goals in mind, let us consider specifically the most appropriate of the architectures discussed above.
Turning to FIG. 1, there is shown a so-called RAID (standing for redundant array of independent disks) level 1 approach (see The RAID Primer: An Introduction to RAID Technology (First ed.), The Raid Advisory Board, Inc., 13 Marie lane, St. Peter, Minn. 56082-9423, March, 1994, incorporated by reference herein). In a RAID level 1 system, such as that shown at 10, a pair of drives is provided but presented to the host 14 as but a single drive. The write data is written to both drives and can be read from either drive. That is, each drive is mirrored to another (in the Figure, the only other) drive, the mirroring drive being "invisible" to the host processor which is storing data to or retrieving data from the mirrored drive. Thus, the host system sees the storage subsystem as a single "black box;" the drive mirroring is accomplished out of view of the host, inside the black box. The host issues only one write command or one read command; it does not have to manage the mirrored drives separately. The drive controller (also called a drive adaptor) manages the drives for the host.
In FIG. 1, an (optional) read/write memory cache 16 is interposed between the host processor 14 and the mass storage subsystem 10. All read and write operations are funneled through the cache; indeed, through a single cache location in common for both drive members of a mirroring pair. This is illustrated figuratively in FIG. 1 by the dashed lines inside the cache, indicating the write path passes through the cache 16 and the cache then sends a copy of the write data to both of drives 11 and 12 (either sequentially or concurrently). It will be understood by those skilled in the art that a drive controller must be present, also, to control each drive; but the drive controllers are not shown to simplify the discussion.
Note that this system is vulnerable to a single failure at several points, including the host/drive interface, the drive or system controller, and the cache memory. Since, as stated above, one of the principle objectives of mirroring is to increase reliability, a corollary is that reliability is that it is desirable to eliminate such single-point failure possibilities.
With attention to FIG. 2, this is achieved by systems which provide a redundant interconnection between the host system 14 and the storage subsystem 20, allowing the host to access directly each of the (mirrored and mirroring) drives (22-1 through 22-T, where there are T drives present) in the storage subsystem. Each interconnection is accomplished via a system adaptor, or controller, labeled SA, 24 and 25 (only one exemplary controller being shown). Indeed, each drive may be controlled by its own, separate controller, DA, as well. Failure of one controller or one interconnection will not cause the entire mass storage system to fail. In this type of arrangement, the host system itself typically will control and effectuate the mirroring operation. That is, when the host has a block of data to be stored in the mass storage system, it separately writes the data to each of the two or more mirrored drives, first writing to one of the drives via a first connection, first system adaptor or controller, SA, and first path through cache memory (if cache is employed) to a first disk controller; then via a separate interconnection and separate controllers to the other disk drive(s) in the mirroring arrangement. The host has the responsibility of monitoring and maintaining the individual drive conditions within the storage subsystem.
Thus, if a write operation fails with respect to a specific block of data for a particular drive in a mirrored pair of drives, the host must make another attempt to write to the drive that failed the operation. If appropriate, the host may have to first read the data from another drive. Pending resolution of the failure, when the host desires to read that data from the mass storage system, it must ensure that the data block is read from the one of the drives that had correctly executed the write operation. The host, therefore, must keep track of the data that is valid and invalid on each drive. It also must, when a failed mirror drive or controller is replaced, initiate and supervise the process of writing to that mirror drive the information which the system expects to be present there. This may require that the host read the missing data from another of the paired mirror drives so that it can then be written to the drive whose contents must be updated.
This arrangement, therefore, imposes considerable overhead on the host processor, on the various controllers involved in the operations, and on the communications interface between the host and the storage subsystem. When the storage subsystem includes a cache which is duplicated for each mirrored drive, further overhead may be created: for example, two write operations to the cache will result in two writes pending in the cache and to be executed and cleared separately. That is, overhead and performance have been sacrificed somewhat to achieve higher reliability.
Accordingly, there exists a need for a drive-mirroring mass storage system with both high performance and high reliability, and achieving reduced operational overhead. This system should be usable with RAID architectures as the same are becoming popular and widely employed.