The technical field of computer systems having redundant subsystems and components.
Current multi-processor computer systems are typically supplied with one or more redundant or spare devices that can be used in the event of failure of the primary device. For example, a computer system may come equipped with two ethernet cards so that upon failure of the first ethernet card, the second (spare) card can be used with no, or minimum computer downtime. To provide adequate redundancy, these current computer systems may include spare devices for each of multiple partitions into which the computer system is divided. Thus a computer system with three partitions may include one primary and one spare device for each of the three partitions. This arrangement of primary and spare devices adds to the cost of the computer system and places additional space constraints on the computer system layout.
A method and a mechanism are described herein that are capable of generating a virtual hardware path to allow transactions addressed to a failed computer system component to be claimed by a substitute computer system components. In an embodiment, the components are input/output (I/O) devices, such as ethernet cards, or other I/O devices. However, the method and mechanism may be adapted for use by computer components other than I/O devices.
The original and the substitute components are preferably of a same type. The substitute component may be currently used for other computer system functions (i.e., the substitute component is active in the computer system). Alternatively, the substitute component may be inactive, such as an installed spare, for example.
In an embodiment, hardware is used to make a path to/from a failing or failed component look identical to a path to/from a substitute component. The same physical path to/from the failed component is maintained, but a virtual path is established for the substitute component. Software may then be used to suspend activities to/from the failed component, reconstruct a state of the failed component in the substitute component, and resume operation on the substitute component. Then, all transactions or activities for the failed component will go to the substitute component. To ensure this transfer, address translation mapping is invoked using a set of range registers. When a processor generates an address that goes to a component, the address is checked against the range registers to determine which component the transaction should be routed to. If the transaction needs to be rerouted because of a component failure, a map table will indicate the reroute distinction address pointed to by the range registers.
In particular, identification information for the original (failed) and the substitute components may be stored in a reroute module identification block, and the identification information may be related, such as by use of the map table, for example, so that when an original component fails, the appropriate substitute component may be identified by reference to the reroute module identification block. The substitute component includes programming used to claim transactions addressed to the failed component, and to copy a state of the failed component to the substitute component.
In an embodiment, a virtual input/output (I/O) interconnect mechanism for use in a computer system having a plurality of I/O devices and a plurality of processing units, where I/O devices and processing units are coupled by one or more bridge units, includes an address decode block having a multiplexer that multiplexes inputs to produce an address, where the address relates to a transaction related to a processor unit, a range register decoder that receives the address and provides a destination address of a module to receive the transaction related to the address, and a reroute module identification block that receives the destination address. The reroute module identification block includes an original module identification that provides an address of one or more original modules in the computer system, and a remapped module identification that provides logical destination module identifications of substitute modules in the computer system, where a substitute module replaces functions of an original module in the computer system.
In an embodiment, a method for substituting operating components for failed components in a computer system includes the steps of detecting a failed component, and determining if a component of a same type as the failed component exists. If a substitute component exists, the method includes suspending all activities, such as direct memory access going to or coming from the failed component, copying a state of the failed component to the substitute component, deconfiguring the failed component, updating reroute module identification to remap a hardware path for the failed component to the substitute component, updating configuration registers of the substitute component, and resuming activities such as direct memory access to the failed component. If a substitute component does not exist, the method invokes an error handler.