In many computer applications, it is very important to be able to provide a high degree of reliability. A known way of achieving this is to use a fault tolerant computer, such as taught in U.S. Pat. No. 6,167,477. The system described in this patent has two (or more) processing sets that are configured to perform identical operations. The processing sets are joined by a bridge (known as a voting bridge), which in turn is linked to a bus. Various input/output devices are then attached to the bus (e.g. disk storage, network communications, etc.). The design philosophy of this fault tolerant system is to try and prevent any error that might occur within one of the processing sets from propagating into the outside world, in other words from the bridge onto the bus and from there to the various external devices.
In order to achieve this, the output from the processing sets is compared at the bridge. Any discrepancy between the outputs from the different processing sets is detected and serves as an indication of potential error.
At this point, the system generally tries to make a determination of which of the processing sets is responsible for the error. In some cases this can be relatively straightforward. For example if one processing set is trying to perform an illegal operation, such as writing to a non-existent address on the bus, or is producing output with a parity error, then it is a reasonable assumption that it is this processing set that is malfunctioning. Also, in systems having more than two processing sets, it is sensible to go with the verdict of the majority of the processing sets. For example, in a system with three processing sets, these can effectively be regarded as voting two-to-one in favour of the correct output (assuming only a single processing set is faulty at a given time).
On the other hand, if the discrepancy simply represents a difference in output data between two processing sets, then it is difficult for the bridge to determine, a priori, which one of the two sets is behaving incorrectly. In this case, the bridge can ask the two processing sets to run various diagnostic tests on themselves, in order to try and pinpoint the source of the error.
Assuming that a determination of the malfunctioning processing set can indeed be successfully made, then the system can continue in operation by using only the processing set(s) that is (are) behaving correctly, with the output from the other processing set being ignored. Of course, if there were originally only two processing sets, then at this point the system is no longer fault tolerant, since there is no comparison of output from different processing sets. Therefore, in some such situations it may be desirable for the fault tolerant system to simply stop operation once a discrepancy is detected, pending suitable maintenance or restorative action. Such a fault tolerant system can still be regarded as robust because it can detect the presence of a defective processing set, and prevent this from having any unwanted external manifestation or consequence. Of course, if the system originally included more than two processing sets, then it can continue in fault tolerant mode even after one of the processing sets has to be discounted as faulty.
It will be appreciated that in a system initially having two processing sets, the particular circumstances of the system will determine whether it is better to react to an error by stopping processing altogether until corrective action is taken, or instead to automatically continue operations on the basis of the one remaining good processing set. For example, if the system is being used for nightly reconciliation of various accounting records, it may be better to defer operations until fault tolerance can be restored. On the other hand, if the system is being used for the real-time control of a distribution network, there may be no option but to continue on the basis of the single processing set.
Note that in some situations it is possible for there to be a discrepancy in the outputs from the processing sets even if they are all still operating correctly. This can arise when the output depends on the particular processing set that produced it, for example because it incorporates the serial number of the CPU chip. However, this problem is rather rare, and in practice can normally be worked around.
A much more significant issue concerns the relative timing of the processing sets. Thus if the processing sets do not stay in proper synchronisation with one another, then their outputs will appear to be different despite the fact that both are operating correctly. In order to address this problem, the fault tolerant system of the above-referenced patent drives all the processing sets with a single clock signal. More particularly, each processing set has its own internal clock, which is regulated by a phase-locked loop (PLL) oscillator. An appropriate control signal is then supplied to the various clocks to ensure that their respective processing sets are properly maintained in lock step with one another. In addition, the bridge has its own clock, and this is similarly kept in synchronisation with the clocks of the processing sets. As a result of this common timing, the bridge should receive the outputs from the processing sets effectively simultaneously, and so can perform a proper comparison of them.
The processing sets are also arranged to have a purely synchronous (lockstep) design, avoiding metastable states. This serves to eliminate differences that might be caused for example by an input signal arriving exactly on a clock boundary, in order to ensure that such a signal is ultimately resolved in the same fashion by each processing set. This avoids different processing sets allocating the input signal to different clock cycles due to (inevitable) physical variations in supposedly identical components of the different processing sets.
One limitation of the above approach is that it requires timing variations on the links from the processing sets to the bridge to be small compared to the clock cycle, in order that fully synchronous lockstep behaviour can be maintained as regards the communications between the bridge and the different processing sets. This is conventionally achieved by placing the processing sets and the bridge together in the same chassis or rack assembly to ensure that the outputs from the processing sets are properly received at the bridge at the same clock value.
However, there may be certain situations in which the outputs from the processing sets arrive at the bridge at significantly different times, in other words where jitter or other timing variations in transmission are not small compared to the clock interval. This can then lead to a loss of synchronisation of the outputs, and consequently the detection of discrepancies by the bridge, even although the processing sets themselves are still functioning correctly.
Typically there are two particular circumstances which can give rise to this problem (these may occur either singly or in combination with one another). The first is the desire for ever-increasing clock speeds to boost processing performance. Thus what might be an acceptably small amount of jitter at a relatively low clock speed may become more troublesome as the system is pushed towards higher clock speeds, even if the physical configuration of the system is otherwise unchanged (for example the devices are all located within the same rack assembly).
A second cause of difficulty is that it may be desirable to locate the bridge in a different chassis from the processing sets. Indeed, in some installations the bridge may be located in a different room, a different building, or even at a different geographical site from the processing sets. Likewise, the processing sets may themselves be geographically separated from one another. Such a distributed architecture allows a much more flexible use of hardware, but it also leads to significantly increased transmission delays on the links between the processing sets and the bridge. Such lengthy connections are also much more susceptible to increased jitter and timing variations.
Consequently, it is no longer possible in such systems to guarantee that the outputs from different processing sets will arrive simultaneously at the bridge. This in turn will prevent the bridge from properly checking that the outputs match one another, and confirming that the processing sets are indeed operating correctly. Therefore, it will be seen that a potential loss of synchronicity can undermine the fault tolerant behaviour of the system.