The invention relates generally to fault tolerant computer systems and, more particularly, to mechanisms for fault tolerant access to system-critical devices on peripheral busses.
Fault-tolerant computer systems are employed in situations and environments that demand high reliability and minimal downtime. Such computer systems may be employed in the tracking of financial markets, the control and routing of telecommunications and in other mission-critical functions such as air traffic control.
A common technique for incorporating fault-tolerance into a computer system is to provide a degree of redundancy to various components. In other words, important components are often paired with one or more backup components of the same type. As such, two or more components may operate in a so-called lockstep mode in which each component performs the same task at the same time, while only one is typically called upon for delivery of information. Where data collisions, race conditions and other complications may limit the use of lockstep architecture, redundant components may be employed in a failover mode. In failover mode, one component is selected as a primary component that operates under normal circumstances. If a failure in the primary component is detected, then the primary component is bypassed and the secondary (or tertiary) redundant component is brought on line. A variety of initialization and switchover techniques are employed to make a transition from one component to another during runtime of the computer system. A primary goal of these techniques is to minimize downtime and corresponding loss of function and/or data.
Fault-tolerant computer systems are often costly to implement since many commercially available components are not specifically designed for use in redundant systems. It is desirable to adapt conventional components and their built-in architecture whenever possible.
To reduce downtime, fault tolerant systems are designed to include redundancy for connections and operations that would otherwise be single points of failure for the system. Accordingly, the fault tolerant system may include redundant CPUs and storage devices. Certain devices on peripheral busses may also be single points of failure for the system. In a system that uses a Windows operating system, for example, the loss of a controller for peripheral busses and/or a video controller results in a system failure.
Devices such as a keyboard, mouse, monitor, floppy drives, CD ROM drives, and so forth typically communicate with a system I/O bus, such as a PCI bus, over a variety of peripheral busses such as a USB and an ISA/IDE bus. The various peripheral busses connect to the PCI bus through a peripheral bus controller, such as an Intel PCI to ISA/IDE Xcelarator. The windows operating systems require that the peripheral bus controller plug into location 0 on the system PCI bus, or what is commonly referred to as xe2x80x9cPCI bus 0.xe2x80x9d
A PCI-to-PCI bridge may be used to provide additional slots on a PCI bus. A bridge for use with the PCI bus 0, for example, provides slots for the system-critical peripheral bus controller and video controller, and various other devices. The PCI-to-PCI bridge is then a single point of failure, as is the peripheral bus controller and the video controller. While it is desirable to provide fault tolerance by including redundant paths to the peripheral devices, through redundant PCI-to-PCI bridges and associated peripheral bus controllers and video controllers, the operating system is not equipped to handle them. The operating system requires that all of the peripheral bus controllers connect to PCI bus 0, and redundant controllers alone thus can not provide the desired, fully redundant paths to the peripheral devices. Accordingly, what is needed is a mechanism to achieve such redundancy within the confines of the commercially available operating systems.
The inventive system essentially hides redundant paths to the peripheral devices from the operating system, by reporting a single xe2x80x9cvirtualxe2x80x9d path to the peripheral busses over PCI bus 0. The virtual path includes at least a virtual peripheral bus controller and a virtual video controller. The system also tells the operating system that the real controllers are on another PCI bus on an opposite side of a PCI-to-PCI bridge connected also to PCI bus 0. An I/O system manager selects one of the actual paths, which may, but need not, be connected to PCI bus 0, to handle communications with the peripheral devices.
The I/O system manager maintains the controllers on the unselected path in an off-line or standby mode, in case of a failure of one or more of the controllers on the selected path. If a failure occurs, the I/O system manager performs a fail-over operation to change the selection of controllers, as discussed in more detail below. The operating system does not respond to the controller failure by declaring a system failure, however, because the operating system continues to look to the virtual path, with its virtual controllers, as a valid path to the peripheral devices. Accordingly, the fail-over operation does not adversely affect the overall operations of the system.
As discussed in more detail below, the system also allows hot swapping of PCI bridges, and associated devices on the PCI bus and the peripheral busses.