This invention relates to digital computing apparatus and methods that provide essentially continuous operation in the event of numerous fault conditions. The invention thus provides a computer system that is unusually reliable. The computer system also is highly flexible in terms of system configuration and is easy to use in terms of sparing the user from concern in the event of numerous fault conditions. The system further provides ease of use in terms of programming simplifications and in the provision of relatively low-cost hardware to handle numerous operations.
Faults are inevitable in digital computer systems due, at least in part, to the complexity of the circuits and of the associated electromechanical devices, and to programming complexity. There accordingly has long been a need to maintain the integrity of the data being processed in a computer in the event of a fault, while maintaining essentially continuous operation, at least from the standpoint of the user. To meet this need, the art has developed a variety of error-correcting codes and apparatus for operation with such codes. The art has also developed various configurations of equipment redundancies. One example of this art is set forth in U.S. Pat. No. 4,228,496 for "multiprocessor system". That patent provides pairs of redundant processing modules, each of which has at least a processing unit and a memory unit, and which operates with peripheral control units. A fault anywhere in one processing module can disable the entire module and require the module paired with it to continue operation alone. A fault anywhere in the latter module can disable it also, so that two faults can disable the entire module pair.
This and other prior practices have met with limited success. Efforts to simplify computer hardware have often led to unduly complex software, i.e. machine programming. Efforts to simplify software, on the other hand, have led to excessive equipment redundancy, with attendant high cost and complexity.
It is accordingly a general object of this invention to provide a digital computer system which operates with improved tolerance to faults and hence with improved reliability.
Another object of the invention is to provide digital computer apparatus and methods for detecting faults and for effecting remedial action, and for continuing operation, with assured data integrity and essentially without disturbance to the user.
It is also an object of the invention to provide fault-tolerant digital computer apparatus and methods having both relatively uncomplicated software and a relatively efficient level of hardware duplication.
A further object of the invention is to provide fault-tolerant digital computer apparatus and methods which have a relatively high degree of decentralization of error detection and which operate with relatively simple corrective action in the event of an error-producing fault.
A further object of the invention is to provide fault-tolerant digital computer apparatus and methods of the above character which employ different error detection methods and structures for different system components for obtaining cost economies and hardware simplifications.
A more specific object of the invention is to provide a fault-tolerant computer system having a processor module with redundant elements in the bus structure and in the processing, the memory and the peripheral control units so arranged that the module can continue valid operation essentially uninterrupted even in case of faults in multiple elements of the module.
Other general and specific objects of the invention will in part be obvious and will in part appear hereinafter.