The present invention relates to maintaining mirrored copies of computer data in redundant data storage units, and in particular to determining which units contain current data when re-initializing the system.
The extensive data storage needs of modern computer systems require large capacity mass data storage devices. A common storage device is the magnetic disk drive, a complex piece of machinery containing many parts which are susceptible to failure. A typical computer system will contain several such units. As users increase their need for data storage, systems are configured with larger numbers of storage units. The failure of a single storage unit can be a very disruptive event for the system. Many systems are unable to operate until the defective unit is repaired or replaced, and the lost data restored. An increased number of storage units increases the probability that any one unit will fail, leading to system failure. At the same time, computer users are relying more and more on the consistent availability of their systems. It therefore becomes essential to find improved methods of sustaining system operations in the presence of a storage unit failure, and restoring the system to normal operating mode when the failure condition has been corrected.
One method of addressing these problems is known as "mirroring". This method involves maintaining a duplicate set of storage devices, which contains the same data as the original. The duplicate set is available to assume the task of providing data to the system should any unit in the original set fail. A system may have a duplicate set of all stored data ("fully mirrored"), or of some subset of the data ("partially mirrored"). Mirroring is becoming increasingly attractive as computer users demand improved system reliability and availability.
A user with a system containing mirrored storage will expect the utmost in reliability from his storage. Since the essence of mirroring is that if one storage unit fails, another is available to take its place, the system must necessarily be able to operate with only one of a pair of mirrored units functioning. When both units of a mirrored pair are functioning and contain current data, the units are said to be synchronized. If one of a mirrored pair of storage units fails, and the other continues to operate, the data in the failing unit will soon become obsolete. The "failure" of a unit simply means that data can no longer be read from or written to the unit. This could mean that the storage unit itself is not operating, or that some other component of the system, such as an I/O processor, is not functioning. Restoring the failing storage unit to operation may leave the data on the storage medium intact, as when a circuit card containing control logic is replaced.
Because a system may operate when the disk units of a mirrored pair are no longer synchronized, it must know the state of the mirrored pair, i.e., which unit or units contain current data. If the system is powered down for any reason, it must be able to reconstruct the state of its storage units when power is restored and the system re-initializes itself. If a failing storage unit was repaired or replaced while the system was down, upon re-initialization the operating system must be able to ascertain that data contained on the repaired or replaced unit is unreliable, and initiate a process to re-synchronize the units, which brings the data on the repaired or replaced unit current with that on the non-failing (current) unit.
One method of ascertaining the state of a mirrored pair of storage units is to store state information on both units. On re-initialization, the system reads this state information. If both units are functioning and the stored state information on both units is that they are synchronized with each other, the system determines that this is the case. In the event of a single storage unit failure while the operating system is up and running, where all other devices operate properly, the operating system will recognize that the non-failing unit alone has current data, and record this new state information on the non-failing unit. When re-initialized after repair, the state information on the non-failing unit will be that it alone has current data, while the failing unit's state record may indicate that both units are synchronized or some unknown state. The operating system is able to determine in this situation that only the non-failing unit contains current data.
However, during re-initialization of the system, it is not uncommon for one of a mirrored pair of units to report that both units are synchronized, while the other unit does not respond. In this case, the system can not determine the state of the mirrored pair with certainty. It is possible that both units were synchronized when the system was powered down, as claimed by the responding unit. But the same situation can arise, for example, when the `A` unit fails, is repaired without loss of its obsolete data, the system is re-initialized, and the `B` unit does not respond. Note that a failure to respond during re-initialization does not necessarily mean that a storage unit is broken. The power switch may be off, or any number of other circumstances may prevent the unit from responding, particularly where a repair action has taken place while the system was powered down.
In the above mentioned situations, the operating system will either be unable to make a state determination, or will guess, possibly making an incorrect state determination. If the operating system is unable to make a state determination, it will generally query the user for the correct state. Because there may be a large number of storage units, and the association of logical address to physical location will not necessarily be obvious, querying the user is a very unreliable method of determining state. Guessing the state or just not knowing the state are both clearly undesirable for a mirrored or fault tolerant computer system, since the user does not receive the reliability and availability he expects.
It is therefore an object of the present invention to provide an enhanced method and apparatus for determining the state of a mirrored pair of data storage units.
It is a further object of this invention to provide an enhanced method and apparatus for determining the state of a mirrored pair of data storage units where multiple device failures occur.
It is also an object of this invention to provide greater redundancy and reliability in information tracking the state of mirrored storage units of a data processing system.
Another object of this invention is to provide a method and apparatus for determining the state of a mirrored pair of data storage units which is less prone to human error.