The spectacular proliferation of electronic devices, particularly computers, in modem society, both in numbers and complexity, demands that such devices satisfy ever increasing standards of reliability and serviceability to avoid degeneration into chaos. In the early days of the computer industry, relatively high failure rates and corresponding "down time" when the computer system was unavailable to perform useful work were accepted as the norm. As the industry has matured, computers have become more reliable, and users have come to depend on these systems being available when needed. This dependence has become so profound that, for many businesses, the mere unavailability of the computer system for any appreciable length of time can cause significant commercial injury.
Electronic devices generally, and computers in particular, tend to be vulnerable to "transients", i.e., sudden large voltage and/or current impulses of typically short duration. These can cause data loss by corrupting any of numerous signals being transmitted among various components of a computer system, or can cause changes of state in memory cells, or can even cause hardware damage. Transients occur due to a variety of environmental sources. Some of these transients can be reduced with proper system design techniques, although it is impossible to eliminate all transients. Likewise, electronic systems can be made less susceptible to transient induced failures with proper design techniques, but it is impossible to eliminate all vulnerability to such failures. Of particular interest herein, transients and failures may occur as a result of attempts to concurrently use and maintain a system.
In the early days of the computer industry, a computer component was replaced by shutting off power to the system, replacing the component, and re-powering the system. This is, of course, a logical way to fix a toaster, but the complexity of a modern computer, and its importance to its users, makes this undesirable. It is not possible to simply shut off power and turn it back on as one would a light bulb. A computer system's state and data must be saved when it is powered down. Its software must be re-loaded, and its state restored, when it is re-powered. For a large modem computer system, these operations can take a very significant amount of time, during which the system is unavailable to its users.
Computer manufacturers are well aware of the dependence of their customers, and have accordingly devoted considerable attention to these problems. As a result, many modern computer system have some degree of fault tolerance, and are capable of concurrent maintenance. Fault tolerance means simply that a single component of the computer system may fail without bringing the entire system down, although in some cases performance of the system or some other characteristic may be adversely affected. Concurrent maintenance is the capability to repair or replace some component of a computer system without shutting down the entire system, i.e., the system can continue to operate and perform useful work (although possibly in a diminished capacity) while the repair is being performed. A system which is both fault tolerant and capable of concurrent maintenance can, in theory, be kept running 24 hours a day for an indefinite length of time. In fact few, if any, systems achieve this level of reliability with respect to every component which may possibly fail.
One example of this type of fault tolerance is an array of storage devices known as a "RAID", i.e. redundant array of independent disks. A RAID stores data on multiple storage devices in a redundant fashion, such that any data can be recovered in the event of failure of any single storage device in the array. RAIDs are usually constructed with rotating magnetic hard disk drive storage devices, but may be constructed with other storage devices such as optical drives, tape drives, etc. Various types of RAIDs providing different levels of redundancy or other operating characteristics are described in a paper entitled, "A Case for Redundant Arrays of Inexpensive Disks (RAID)", by Patterson, Gibson & Katz, presented at the ACM SIGMOD conference (June, 1988). Patterson, et al., classify five types of RAID, designated levels 1 through 5. The Patterson nomenclature has become standard in the industry. RAIDs have proliferated to the point where an industry trade group called the RAID Advisory Board has attempted to establish standards for RAID characteristics. Further information regarding RAIDs can be found in The RAIDbook, A Source Book for Disk Array Technology, published by the RAID Advisory Board (5th Ed. February 1996).
Frequently, a RAID is manufactured and marketed as a stand-alone storage subsystem, which is housed in its own cabinet with its own power supply and supporting hardware and software, and which communicates through a standard communications interface with a host computer system. Since it is desirable to make data available to the host system at all times, even if a single storage device in the RAID subsystem fails, the subsystem will frequently have its own on-board data recovery capability, which may include temporary spare drives for storage of recovered data. The RAID subsystem may additionally have redundant power supplies or other redundant components.
Whatever the level of redundancy in a RAID, other subsystem or computer system, it may sometimes be necessary to replace a component. In this case, it is frequently desirable to keep the system or subsystem operating while the component replacement is being performed. Some modern systems have a "hot plugging" capability, whereby a replaced component plugs into a functioning system and immediately receives power from the system, thereby becoming powered-on and potentially operational the instant it is plugged in.
Hot plugging is not a simple matter. If an arbitrary component is replaced by means of hot plugging in an operating computer system, the act of removing the old and inserting the new component can cause serious problems. Arcing, current spikes, and induced transients may potentially cause loss of data or even physical damage to system components. Logic signals may be in undefined states, and with the need to stop or restart the component's normal functions, removal of one component or installation of a replacement component may cause additional problems for system components physically or logically adjacent to the removed or inserted component. Therefore, systems which support hot plugging must be specially designed to handle these problems.
Components may be hot plugged into a backplane power and signal distribution card. Such a backplane contains attachment couplings for one or more other components, which may be cards, or other modules such as disk drives, power supplies, etc. One of the functions of the backplane card (in fact, often the only function) is the distribution of power, data or other signals among the various modules attached to it.
If a powered-on component is suddenly removed from or inserted into a backplane card, the power lines on the backplane will experience a sudden change in current distribution and/or load. The change in power current can not only affect the power lines, but adjacent data lines may experience noise due to electromagnetic effects. Such phenomena can be transmitted to other components attached to the same backplane through power or other lines, potentially damaging the components or causing data loss.
Several options are available in the prior art to support hot plugging components into a backplane card of a computer system or subsystem. It is possible, for example, to design the system so that selective components and bus connections can be logically isolated and gradually powered off while the system continues to operate. This requires special switching circuitry which adds to the cost of the system, and typically requires correct operator intervention (i.e., the correct unit(s) must be powered off). It is further possible to design the system with special circuits and logic which detect the impending removal of a component, and take similar action automatically. But this again adds to the cost of the system. Additionally, components which plug into the backplane might be designed to tolerate larger voltage spikes, but this places additional constraints on the design of components and generally adds to the cost. It is alternatively possible to reduce induced noise by distributing power on dedicated power cables separate from a backplane carrying data or other signals, but this may increase cost, require additional space, or interfere with other design considerations.
While concurrent maintenance is a major consideration, electronic systems comprising pluggable modules in a backplane circuit card may be subject to transients from other sources, as for example, the catastrophic failure of one of the modules. It is desirable to reduce the vulnerability of pluggable component systems to such transients without the attendant cost associated with prior art techniques.