The term "disk array" is used herein to refer to a class of computer systems in which multiple magnetic disks are used, in parallel, to improve performance and reliability in the storage of data. The prior art has suggested that redundant arrays of inexpensive disks (RAID) offer an attractive alternative to large expensive disk drives and promise performance improvements of an order of magnitude. Patterson et al. in "A Case for Redundant Arrays of Inexpensive Disks (RAID)", ACM Sigmod Conference, Chicago, Ill. Jun. 1-3, 1988 pages 109-116, describe multiple levels of RAID systems to enable data redundancy and improve the reliability of disk array systems. The concept of "striping" is described by Patterson et al. and refers to the interleaving of data across a plurality of disk drives. The interleaving may be by bit, byte, word or block, with succeeding data elements placed upon succeeding disk drives in a "stripe" arrangement. In block striping systems, each block is written to a single disk, but subsequent blocks are scattered to other disks. Striping techniques improve performance, but not reliability. Increased reliability may be obtained through the storage of redundant error correcting codes on multiple drives of the disk array. Should an individual disk fail, such codes may be used to reconstruct the lost data, which data is then written to a replacement disk when one becomes available. The RAID system arrangements described by Patterson et al. describe several examples of redundancy that enable data recovery. One such RAID system employs "mirroring" and another employs "parity". Mirroring systems improve reliability by writing each block of data on at least two separate disk drives. Should one drive fail, there is then at least one other drive containing the same data and the system may continue to run with the remaining drive or else duplicate data may be reconstructed by copying the data to a replacement drive.
Parity disk systems provide redundancy by grouping several disks into a "parity group". All but one of the disk drives contain ordinary disk blocks, whereas blocks on the remaining disk drive are written to contain the bitwise exclusive OR sum (modulo 2) of the data in the corresponding blocks on the other drives. Then, if any single drive is lost, its data may be reconstructed by exclusive ORing the data on the remaining drives. Updates may be made by writing new data in the appropriate positions on the data disk drives and adding the differences, using an exclusive OR function, of the old and new data to the corresponding block on the parity disk drive.
In such mirror and parity systems, a synchronization problem arises upon the occurrence of a power failure and a subsequent power-up. Data is said to be "synchronized" across redundant disks when identical data is present on redundant disks. The data is said to be "unsynchronized" if data, which should be identical, is not. The problem on power-up is to determine, across redundant disks, which data is synchronous and which data is not.
Error correction techniques employing mirroring and parity systems therefore work well only if the data and the error correction codes are written consistently. It must never be the case, particularly after a system failure, that data and the error correction code which protects the data, are inconsistent. At the very least, it must be possible for the system to efficiently identify and correct any inconsistencies.
If all data is written by a central shared controller and if that controller obtains sufficient early warning that power is to be lost, synchronization may be insured by the controller prior to loss of power. However, shared controllers of this sort tend to represent a single point of failure, thereby limiting the overall reliability of the system. They also limit flexibility in sharing standby replacement drives among a large group of operational disks.
The problem becomes more complex in architectures wherein multiple disks are written asynchronously, typically by separate controllers which can reside on separate processing nodes that are connected by a communication network. Where such disks are used for transaction processing systems, the prior art has made provision for using high level software transactions logs to enable resynchronization of the various disks, following a system failure.
High speed random access memories (RAMs) have been used as buffers to improve the performance of disk systems. Such buffers, or caches, are often allocated from the main memory of the central processor controlling the disk drive or, they may be contained in and private to the disk controller circuitry. In either case, such buffers eliminate the need to repeatedly re-read frequently accessed data from the disk. Data is placed in the high speed buffer when first accessed and retained there as long as possible. When data is altered, the high speed buffer is modified and the changed data written immediately to disk to avoid the possibility of losing the data should the system, as a whole, fail. These types of data caches are know as "write through" caches.
Random access memories now exist whose contents are retained in the event of system failures. Certain semiconductor memories may be used as non-volatile memories through the use of battery back up or other reliable power sources. Other semiconductor memories are naturally stable when power is removed and will retain stored data in the event of a power failure without battery back up. The term "non-volatile RAM" (NVRAM) will be used herein to refer to all such memory devices. The IBM 3990 disk controller, for example, uses NVRAM to cache data for an extended period before writing it to disk. Should the system fail and be restarted, any information in the NVRAM cache is presumed to be more current than corresponding data on the disk itself.
The patent prior art contains a number of teachings concerning parity protected disk arrays and other redundant disk arrangements. U.S. Pat. No. 4,761,785 to Clarke et al. introduces the concept of parity protected disk arrays and associated optimizations. Clarke et al. also describe a technique whereby version information stored in data "headers" on disk support detection of version mismatches between data and associated parity blocks. Correction of such mismatches, in the event of a power failure and a subsequent power-up, is not considered. In U.S. Pat. No. 4,654,819 to Stiffler et al., redundant RAM is used in the implementation of a fault tolerant computer system. Updates to main memory are buffered, with changes being stored in a special cache unit until the program context switches or the cache fills. At that point, a two phase update protocol is employed to update first one, and then a second main memory, always leaving sufficient information that the operation can be restarted or aborted if either memory fails.
U.S. Pat. No. 4,942,579 to Goodlander et al. describe various RAID arrangements of striping across multiple disk drives and techniques for reconstruction of data should one such drive fail. The architecture described by Goodlander et al. employs a single, battery-backed "cache memory" for caching and fast writing. No provision is made, however, for back-up of the "cache" memory in the event of its failure.
In U.S. Pat. No. 5,051,887 to Berger et al., a mainframe disk controller is described that includes mirroring functions. NVRAM is used therein as a fast write and synchronization mechanism. The Berger et al. system includes a single NVRAM as a fast write buffer, along with a second volatile cache. The system insures that data resides in NVRAM and cache or NVRAM and one disk, before acknowledging a fast write. This protects against loss of a cache or NVRAM while the system operates and provides recovery should power be lost. The Berger et al. system does not provide any teaching as to how recovery is possible if NVRAM or a drive is lost during a power outage. Further, there is no teaching in Berger et al. regarding a provision for providing synchronizing data in the event of a failure in various RAID arrangements.
U.S. Pat. No. 4,603,406 to Miyazaki et al. teaches the resynchronizing of two memories, each of which has separate battery back-up and normally store the same data. If battery back-up is lost at any point during a power outage, the contents of the corresponding memory are not trustworthy. When the system later reacquires memory, a means is provided for noting which memories have lost their contents.
In U.S. Pat. No. 4,530,054 to Hamstra et al. a disk controller is described with a central cache wherein updates are accomplished in a store-in manner (the updates are retained in the cache for some period before being written to a permanent backing store). The Hamstra et al. system indicates to a host processor approximately how much data is at risk in cache should there be a power failure.
Other teachings regarding the performance of redundant data storage can be found in the following documents: Defensive publication T932,005 of V. J. Kruskal; U.S. Pat. Nos. 4,493,083 to Kinoshita, 4,393,500 to Imezaki et al., 5,043,871 to Nishigaki et al., 4,419,725 to George et al., 4,697,266 to Finley, 4,410,942 to Milligan et al.; PCT International Application WO 90/06550 to Hotle; and Japanese Patent 62-194557.
Accordingly, it is an object of this invention to provide a redundant arrangement of disk drives with an improved method of data synchronization in the event of power failure.
It is another object of this invention to provide an improved method of data synchronization for a redundant array of disk drives wherein the failure of a drive or cache memory during a power failure can be detected and recovered from.