In many computer applications, it is very important to be able to provide a high degree of reliability. A known way of achieving this is to use a fault tolerant computer, such as taught in U.S. Pat. No. 6,167,477 (referred to hereinafter as the '477 patent). The system described in the '477 patent has two (or more) processing sets that can be configured to perform identical operations. The processing sets are joined by a bridge (sometimes known as a voting bridge), which in turn is linked to a bus. Various input/output devices are then attached to the bus (e.g. disk storage, network communications, etc.).
The design philosophy of this fault tolerant system is to try and prevent any error that might occur within one of the processing sets from propagating into the outside world, in other words from the bridge onto the bus, and from there to the various external devices. In order to achieve this, the output from the processing sets is compared at the bridge. Any discrepancy between the outputs from the different processing sets is detected and serves as an indication of potential error.
At this point, the system generally tries to make a determination as to which of the processing sets is responsible for the error. There are various techniques for doing this. For example if one processing set is trying to perform an illegal operation, such as writing to a non-existent address on the bus, it is assumed that this is the processing set that is malfunctioning. Alternatively, in systems having more than two processing sets, it can be sensible to go with the verdict of the majority of the processing sets. In more complicated situations, the bridge can ask the processing sets to run various diagnostic tests on themselves, in order to try and pinpoint the source of the error.
Assuming that a determination of the malfunctioning processing set can indeed be successfully made, then the system can continue in operation by using only the processing set(s) that is (are) behaving correctly, with the output from the other processing set being ignored.
The system in the '477 patent also supports a split mode, in which the two or more processing sets can be operated independently. Such a mode is particularly useful after an error has been detected, in order to allow a faulty system to be properly investigated, repaired, and then brought up-to-date. Another possible reason for adopting split mode is to increase system throughput, since the processing sets are now no longer duplicating each other's work, but can now both perform useful (separate) calculations.
In the system taught by the '477 patent, in split mode each of the various devices attached to the bus is formally allocated to one or other of the processing sets. This ownership information is maintained in a set of registers within the bridge. Accordingly, in a programmed I/O operation a processing set can only read from or write to a device that it has ownership of. Likewise, a given device can only perform DMA (direct memory access) for memory belonging to its assigned processing set.
This approach helps to ensure proper independence between the processing sets in split mode, so that the operations of one processing set do not impact the operations of the other processing set(s). This is particularly important given the main reason for going to split mode is after the detection of an error, indicating a potential malfunction in one of the processing sets.
However, the implementation of split mode in known systems has not generally achieved full independence between the different processing sets. One reason for this is that interrupts from the devices have generally been transmitted as out-of-band events via a special-purpose chip, bypassing the bridge and therefore the device ownership information. Consequently, each interrupt is transmitted to all of the processing sets, irrespective of which particular processing set actually owns the device from which the interrupt originated. Accordingly, each processing set has to investigate the source of all incoming interrupts. Thus if the source of an interrupt corresponds to a device that the processing set owns, then that processing set is responsible for servicing the interrupt; if not, then the interrupt can be discarded. It will be appreciated that as a result of this configuration, a processing set must process all interrupts, at least initially, rather than just those from associated devices, and this can have a negative impact on overall system performance.
A potentially more serious problem is that known systems also support a maintenance bus, which is typically used for controlling various physical operating parameters of a device in the system, such as fan speed for cooling purposes, and power supply regulation. Device ownership by the various processing sets is not enforced on the maintenance bus. As a result, it is potentially feasible for one processing set to switch (power) off a device supposedly belonging to another processing set. It will be appreciated that this would not be a desirable occurrence, and could lead to error, or at least the unexpected inability of a processing set to continue operations.