The spectacular proliferation of electronic devices, particularly computers, in modern society, both in numbers and complexity, demands that such devices satisfy ever increasing standards of reliability and serviceability to avoid degeneration into chaos. In the early days of the computer industry, relatively high failure rates and corresponding "down time" when the computer system was unavailable to perform useful work were accepted as the norm. As the industry has matured, computers have become more reliable, and users have come to depend on these systems being available when needed. This dependence has become so profound that, for many businesses, the mere unavailability of the computer system for any appreciable length of time can cause significant commercial injury.
In the early days of the computer industry, a computer component was replaced by shutting off power to the system, replacing the component, and re-powering the system. This is, of course, a logical way to fix a toaster, but the complexity of modern computers makes this undesirable. It is not possible to simply shut off power and turn it back on as one would a light bulb. A computer system's state and data must be saved when it is powered down. Its software must be re-loaded, and its state restored, when it is re-powered. For a large modern computer system, these operations can take a very significant amount of time, during which the system is unavailable to its customers.
Computer manufacturers are well aware of the dependence of their customers, and have accordingly devoted considerable attention to these problems. As a result, many modern computer system have some degree of fault tolerance, and are capable of concurrent maintenance. Fault tolerance means simply that a single component of the computer system may fail without bringing the entire system down, although in some cases performance of the system or some other characteristic may be adversely affected. Concurrent maintenance is the capability to repair or replace some component of a computer system without shutting down the entire system, i.e., the system can continue to operate and perform useful work (although possibly in a diminished capacity) while the repair is being performed. A system which is both fault tolerant and capable of concurrent maintenance can, in theory, be kept running 24 hours a day for an indefinite length of time. In fact few, if any, systems achieve this level of reliability with respect to every component which may possibly fail.
One example of this type of fault tolerance is an array of storage devices known as a "RAID", i.e. redundant array of independent disks. A RAID stores data on multiple storage devices in a redundant fashion, such that any data can be recovered in the event of failure of any single storage device in the array. RAIDs are usually constructed with rotating magnetic hard disk drive storage devices, but may be constructed with other storage devices such as optical drives, tape drives, etc. Various types of RAIDs providing different levels of redundancy or other operating characteristics are described in a paper entitled, "A Case for Redundant Arrays of Inexpensive Disks (RAID)", by Patterson, Gibson & Katz, presented at the ACM SIGMOD conference (June, 1988). Patterson, et al., classify five types of RAID, designated levels 1 through 5. The Patterson nomenclature has become standard in the industry. RAIDs have proliferated to the point where an industry trade group called the RAID Advisory Board has attempted to establish standards for RAID characteristics. Further information regarding RAIDs can be found in The RAIDbook, A Source Book for Disk Array Technology, published by the RAID Advisory Board (5th Ed. February 1996).
Frequently, a RAID is manufactured and marketed as a stand-alone storage subsystem, which is housed in its own cabinet with its own power supply and supporting hardware and software, and which communicates through a standard communications interface with a host computer system. Since it is desirable to make data available to the host system at all times, even if a single storage device in the RAID subsystem fails, the subsystem will frequently have its own on-board data recovery capability, which may include temporary spare drives for storage of recovered data. The RAID subsystem may additionally have redundant power supplies or other redundant components.
Electronic systems frequently use backplane circuit cards for distribution of power, data signals, and/or mounting of active or passive circuit elements and connectors. Such a card typically contains multiple parallel layers for embedded circuit patterns, grounds, or power distribution. Pluggable connectors couple the backplane to other modules which make up the electronic system, such as power supply modules, storage devices, or logic cards. Often, such a backplane card acts primarily as a distribution medium, i.e., it conveys power and/or data signals from one module to another, and contains relatively few functional components attached directly to the backplane itself. However, the backplane may contain functional components.
Because the backplane typically contains no moving parts and relatively few functional components, the probability of backplane failure is normally significantly less than the probability of failure of a disk drive storage device or a power supply. Accordingly, system designers have typically ignored the possibility of backplane failure, concentrating instead on such matters as the recovery of data from a nonfunctioning storage device. However, it is possible for a backplane to fail, albeit rarely. Because the backplane often forms the center of a web of communications among other modules, failure of the backplane, when it does occur, may be catastrophic. Systems are not typically designed to continue operation in the face of such a failure. Often, the system must be shut down, and numerous (perhaps all) modules in the system must be removed in order to replace a defective backplane. All of this takes precious time, time during which the system will be unavailable to its users.
As modern computer systems improve in sophistication and reliability, and users come to rely with greater dependence on the continuing availability of their systems, it is increasingly important to provide improved redundancy and concurrent maintenance capability in computer systems.