Maintaining a high level of reliability in data storage systems is a major concern for system administrators. As organizations increasingly rely upon distributed computing systems, the requirement to increase the reliability of network components is likewise increasing. Distributed systems are especially vulnerable to failures because of the inter-dependency of their processing nodes. Consequently organizations adopt measures that enhance their fault tolerance not only to recover the failed computing node, but to minimize the system downtime and lost revenues when one of their nodes fails. This is especially true of critical network nodes which require a high level of availability such as file servers, application servers, Web servers, electronic mail (e-mail) servers, etc.
It is especially important to provide redundant data storage devices to prevent the loss of irreplaceable data in the event of device failure. For example, failure of a storage device upon which the operating system is maintained can render the entire system inoperable. Additionally, many times data corruption occurs when the drive fails upon which the operating system is stored.
One method for providing redundant data storage is to use a system called Redundant Array of Independent Disks (RAID). RAID is a way of combining multiple disk drives into a single entity to improve performance and/or reliability and, in some cases, to provide data recovery if a single drive in the array fails. There are a variety of configurations for combining the disks into a single entity which are commonly referred to as RAID levels. For example, RAID level 1, which is also referred to as disk mirroring, stores duplicate sets of identical data on multiple disk drives. RAID can be implemented as either a hardware (e.g., using a special disk controller) or software storage solution. RAID hardware usually comprises a RAID disk controller, such as an adapter card, to which the data cables of the disk drives are connected.
Software RAID is usually implemented as a set of modules and management utilities that implement RAID without requiring extra hardware. For example, the RAID software can be implemented as a kernel level component layer which resides between the computer's file system and the lower-level disk drivers. Software RAID implementations tend to be more flexible than hardware implementations because more configuration options are available to the system administrator.
However, configuration and administration of RAID is a complex task which is typically beyond the skill level of many system administrators. For example, a system administrator executes a series of manual operations such as modifying startup scripts and file system configurations in order to configure the RAID storage. If one of those operations is executed incorrectly, the system may not start up again after being shut down. However, because configuring the system is performed manually, errors are common. Additionally, the RAID configuration cannot presently be performed while installing operating system software because the file management software is not available until after the operating system is installed. This leads to increased administrator workload because the RAID must be configured after the operating system installation.
In the interest of making computer software more user friendly, software is now relied upon to provide system administration solutions which were, at one time, performed manually by highly skilled system administrators. As a result, less technically proficient people are now often utilized to perform system administration tasks. This allows organizations to conserve resources by employing less skilled, and therefore less expensive, personnel to perform system administration duties. Alternatively, larger organizations may save resources by employing one or two highly skilled system administrators and a greater number of less skilled assistants who perform more mundane tasks such as system backups and software installation.
Therefore, in order to perform a relatively complex operation such as configuring a RAID system, a more highly skilled system administrator is needed to either manually configure the system, or to coach a less skilled assistant through the process. This is regarded as a waste of resources on the part of the skilled system administrators who are needed to either manually configure the RAID or to be available to coach less proficient system administrators in the process.
To this end, it would be advantageous to provide a straight forward recovery system to handle cases in which the system disk of the data storage system fails or becomes corrupted. Using such a recovery system, technicians could recover from system disk failures without the need for consulting experienced system administrators.