1. Field of the Invention
This invention relates to computer system data storage, and more particularly to a system for synchronizing the information stored in the reserved area within each storage unit of a redundant array system.
2. Description of Related Art
A typical data processing system generally includes one or more storage units which are connected to a Central Processor Unit (CPU) either directly or through a control unit and a channel. The function of the storage units is to store data and programs which the CPU uses in performing particular data processing tasks.
Various types of storage units are used in current data processing systems. A typical system may include one or more large capacity tape units and/or disk drives (magnetic, optical, or semiconductor) connected to the system through respective control units for storing data.
In such systems, a problem exists if one of the large capacity storage units fails such that information contained in that unit is no longer available to the system. Often, such a failure will shut down the entire computer system.
The prior art has suggested several ways of solving the problem of providing reliable data storage. In systems where records are relatively small, it is possible to use error correcting codes which generate ECC syndrome bits that are appended to each data record within a storage unit. With such codes, it is possible to correct a small amount of data that may be read erroneously. However, such codes are generally not suitable for correcting or recreating long records which are in error, and provide no remedy at all if a complete storage unit fails. Therefore, a need exists for providing data reliability external to individual storage units.
A number of approaches to such "external" reliability have been described in the art. A research group at the University of California, Berkeley, in a paper entitled "A Case for Redundant Arrays of Inexpensive Disks (RAID)", Patterson, et al., Proc. ACM SIGMOD, June 1988, has catalogued five different approaches for providing such reliability when using disk drives as failure-independent storage units. Arrays of relatively low cost disk drives are characterized in one of five architectures, under the acronym "RAID" (for Redundant Arrays of Inexpensive Disks). FIG. 1 shows one such system in which a multiplicity of storage units 1 are connected to a controller 2. The controller 2 is coupled to a Central Processing Unit (CPU) 3 by standard bus 4.
In such systems, in which arrays of low cost storage units are provided as the means for storing large quantities of data with very high fault tolerance, it is common to allocate a number of sections of each storage unit as general data storage areas 5 for general use by the system, and to allocate one section of each storage unit as a "Reserved Area" (RA) 6. Each RA 6 is used to store items such as: system configuration information; primary software which can be recovered and loaded upon initiation of the system; secondary software which can be recovered and loaded at a time after initiation of the system; an "error log" in which an error history can be saved for use by diagnostic routines; temporary data for a host "scratch-pad"; and diagnostic software routines which can be executed or loaded as needed by the system.
In FIG. 1, the RAs 6 within each data storage unit 1 are shown allocated as described above. In such systems, it is a common practice to replicate the Reserved Area information in corresponding blocks within all of the storage units of the array. (A "blocks" of data is that grouping of data which is typically managed as a unit when performing reading and writing operations and typically is the smallest addressable unit of data). The practice of replicating the Reserved Area information allows a failure of one or more of the individual storage units to have a minimal impact upon the operation of the system. Additionally, there is no need to determine which of the storage units currently maintains the Reserved Area information. However, since each storage unit is equally responsible for maintaining an RA 6, it is vital that the information in the RA 6 of every storage unit be exactly the same as the information in the RA 6 of every other storage unit. The process of maintaining the same information in each RA 6 is known in the art as "synchronizing" the data storage units. When the Reserved Area information stored in the various data storage units is not identical, the data storage units are said to be "unsynchronized." In an unsynchronized system it may be difficult or impossible to verify which RAs 6 contain valid information.
A common problem which faces such systems is how to perform a reserved area update process in a manner that can tolerate interruptions and prevent the system from becoming unsynchronized, and thereby prevent uncertainly as to which storage units contain valid RA reserved area information.
For example, if a power supply failure occurs while the RAs of a particular system are being synchronized in response to a change in the system configuration then the synchronization process will abort before it is complete. Upon return of power to the system, it may be impossible to reliably determine which of the storage units has valid data and which has old or otherwise unreliable data. Comparing the data in each of the RAs would indicate that a problem had occurred in the synchronization process, but would give no indication as to whether the problem occurred early in the process, in which case only a few of the storage units would be updated with the new RA data, or near completion of the process, in which case most of the data storage units would contain the new data and only a few would contain the old data. Additionally, some storage units might contain corrupt data due to the sudden loss of power to a storage unit during a Write sequence.
Because it is vital for many of the systems which employ failure-independent data storage arrays to rely upon the data stored in the RAs, it is desirable to provide a system that can very reliably synchronize all storage units during an update to the RAs. It is also desirable to provide a system that can determine the last state of the update process if a failure occurs, so that the update can be restarted without loss of data or uncertainty as to the validity of any data in the RAs. The present invention provides such a system.