In response to organizations' increasing reliance on network-based server computers and the increasing cost of computer downtime, manufacturers developed fault tolerant or redundant systems designed to reduce downtime. Such systems typically use memory back up and redundant components in attempting to provide continuous system operation. Many redundant systems can be found in the prior art.
For example, U.S. Pat. No. 4,607,365 to Greig, et al., discloses a system that automatically selects secondary components as needed to compensate for faults in the system. Similarly, U.S. Pat. No. 4,727,516 to Yoshida, et al. discloses redundant memory arrays and U.S. Pat. Nos. 4,484,275 and 4,378,588 to Katzman et al. teach multiple processors. While those redundant computer systems may prevent a complete server failure in some cases, those systems do not address many causes of computer downtime.
Studies show that a significant percentage of network server downtime is caused by transient faults in the I/O subsystem. These faults may be due, for example, to adapter card firmware, or hardware which does not properly handle concurrent errors, and often causes servers to crash or hang. Diagnosing intermittent errors can be a frustrating and time-consuming process. The result is hours of downtime per failure, while a system administrator discovers the failure, takes some action, and manually reboots the server. The computer systems of the prior art do not provide a computer system manager with the tools needed to keep computers running while failed parts are removed and repaired or while upgrades are performed.
Moreover, even if hardware components of a server computer can withstand being added or removed without shutting down the server computer or making it unavailable, a system manager could not simply remove a piece of hardware and plug in another piece without causing immense disruption of the software. Such a physical swap would cause hundreds or thousands of error conditions every few seconds, likely resulting in corruption of data and possibly even systemwide software failure. Low level software modules, particularly device drivers, must be carefully administered during any change to the hardware components they service. Making matters more difficult, device drivers are among the most complicated and least understood classes of software, few of them alike, but nearly all having arbitrary and arcane command sets.
Without some tool to provide guidance and uniformity, network administrators could only add or remove components to an operating computer by issuing precise sequences of arcane, error-prone commands having difficult-to-remember, numeric-range parameter values, interspersed with a variety of hardware manipulations, with little or no feedback during the entire process to indicate successful progress. Moreover, completely different sets of commands and parameter values may be required to perform hot plug operations on differing components, or on similar components from differing vendors. Both the high possibility of making mistakes and the steep learning curve make manual performance of hot plug operations impractical at best.
Industry focus and cooperation on computer system management has prompted the development of standards for performing routine management operations on computers. Today's standards generally provide databases containing a wide variety of management information needed to carry out many computer system management tasks. While the standard practices used to manage computers are becoming more uniform and effective as growing numbers of computer system managers learn, implement and improve these standards, there has been little if any focus on the area of adding or removing components to a running, operating computer.