This invention is directed to the problem of designing a computer system that will continue to be able to exercise control despite the occurrence of a component fault. There are numerous applications where computer survival is critical, including military, space, and transportation applications. There are others where computers would be used if they were more dependable, including medical, and nuclear applications.
The earliest approach to enhanced computer system reliability was to have two or three computers each capable of control, and to switch from one to another when one failed. The problem is how to find out that one has failed, and to be able to restart the job on the next computer. Another approach has been to have two computers running in synchronism.
They will disagree when one has failed, thus solving the first part of the problem. Still another approach, three in synchronism, will not only show disagreement when one has failed but will indicate which one it was. The latter principle was employed by the Saturn V Launch Vehicle Digital Computer in the mid 1960's.
Another approach to the problem is to use coded representations for data that will be altered in an identifiable manner by any component fault. This has been used in various projects, notable the JPL STAR computer designed and built at the Jet Propulsion Laboratory. This approach is designed to avoid the expense of replicating to detect and correct faults. The disadvantages are that many non-standard circuits must be designed, manufactured, and understood by maintenance personnel, and also that it is difficult to verify at an arbitrary instant of time that all of the assumed protection is indeed present and in working order.
There are various ways in which one can employ the triple-redundancy principle. One is to triplicate small parts of a computer and vote on every input to each part. This is characteristic of the Saturn V computer design. Another way is to triplicate an entire system and vote on all inputs to the system. This represents an extreme measure, rather than a practical approach.
There are other more realistic approaches in which parts of the system are triplicated with voting at chosen points. Some systems, notable for aircraft, have used more than three of each part of a system in order to achieve immunity to more than one failure.
Another proposal was a computer system composed of numerous small processors and memory modules interconnected by a time-shared bus, in which three units operate together to perform a part of the total system job. Any triplet can fail, in which case the failure will be detected and the information necessary for restart will be salvaged and passed along to another triplet on the first occasion when another triplet is available.
Yet another approach was suggested, which comprised a group of individual non-redundant units each connected to a common bus system. The shortcoming with both of the last approaches has been that no mechanism had been devised that would allow units to be connected to the bus in such a way that they could be disconnected and reconnected when necessary despite the presence of faults, and that this ability to disconnect and reconnect could be dynamically verified.
Without this, there was the possibility that a single failure would either bring down the entire system, or else would go undetected until a second failure occurred which, together with the first, would bring down the system. In either case the desired fault tolerance is not achieved.
The invention presented here departs from the above in that it allows data connections to be reliably made and broken between processor, memory units, etc., and members of a redundant bus. Such connections can be changed only by two or three processors acting in synchronism. No single processor can change its own connection status, nor that of any other unit.
When a unit persistently disagrees with its assigned partners, it will have its power switched off and will be logically disconnected from the system, by other, correctly functioning, processors acting in synchronism.