This invention relates to the field of computer systems. More particularly, a system and methods are provided for recovering from the failure of a disk in a set of mirrored boot disks.
Many computing environments require high availability of storage resources. This is often accomplished by mirroring data across multiple physical storage devices (e.g., disk drives). A computer system may read from and write to a set of mirrored disks as a single logical disk drive. However, the contents that are written to the logical drive are actually written to each of the physical devices, and a read may be made from any of the devices. As a result, if one disk in the set fails, another can be used in its place, and provide the same data, without halting the system.
Traditionally, at least three devices have been included in a mirror set in order to allow the host computer system to continue operating after one device fails, or to reboot before a failed device is replaced. Systems usually require a minimum of three devices because they operate on a quorum basis, with each device having one xe2x80x9cvote.xe2x80x9d A majority of devices (e.g., two) must be available so that the system can identify which device(s) contain valid data or distinguish stale data (or a device holding stale data) from fresh data. If only one disk is available (e.g., two have failed), the system cannot determine whether it contains fresh or stale data, and the system may be configured to cease or prevent operation.
Further, existing disk-mirroring schemes suffer from the possibility that the host computer system may boot from a device having stale data. In a computer system that can boot from multiple devices (e.g., any one of a set of mirrored boot disks), the order in which the system should attempt to boot from each device is usually specified. With existing schemes, if the first device in the order fails during operation of the computer system and is not repaired or replaced before the system reboots, the system may attempt to boot from the device. And, if the device exhibits only intermittent failures, the system may be able to boot from it. However, because the device was considered failed before rebooting, it may not have received all data updates or configuration changes, in which case the system will boot and operate with stale data.
Also, some computing environments that require high availability of storage or boot devices may have space limitations that make it difficult to install or accommodate more than two devices. For example, many computer systems employed in such environments are configured to contain two internal disk drives and, if more are needed, they must be attached or housed in a separate enclosure.
The procedures or instructions for mirroring a set of disks are often encoded in hardware (e.g., firmware), such as within a controller that controls the disks or within a subsystem that includes the disks. However, this arrangement limits the flexibility of the mirroring operations. If, for example, the instructions include the automatic performance of one or more procedures or commands, a system operator typically cannot override them in order to perform them manually (e.g., with different parameters or in a different order).
Thus, in one embodiment of the invention a system and methods are provided for facilitating the mirroring of a limited number of storage devices (e.g., two), such that the system can continue operation if one device fails and, if the system is rebooted prior to repair or replacement of the failed device, the system will not attempt to boot from the failed device, thus preventing it from operating with stale data. Further, the methods may be implemented in software, thus enabling flexibility in the operation of the mirroring and recovery from a failed device. For example, a recovery procedure may be performed manually or automatically, or selected portions of the procedure may be accomplished manually, while other portions are performed automatically.
In one embodiment of the invention, a method of recovering from the failure of a mirrored boot device includes a set of compensating actions that are performed after the failure is detected. Then, after the device is repaired or replaced, a set of reintegrating actions are performed.
Compensating actions may include removing the failed device from a set of devices from which the system may boot, and attempting to remove the failed device from the mirroring scheme. Removing a device from the mirroring scheme may require the deletion of mirror configuration or status data from the device and the updating of such data on the remaining device(s) in the scheme.
Reintegrating actions may include retrieving mirror status data from another mirrored device and recreating the necessary configuration on the repaired or replacement device. After the device joins the mirroring scheme, it may then be added to the set of devices from which the system may boot.
In various embodiments of the invention, different phases of a recovery procedure (e.g., detection of failure, compensating actions, reintegrating actions) may be performed manually or automatically. A system administrator or operator may, therefore, select an appropriate policy specifying the manner (e.g., automatic or manual) in which each phase should be performed.