This invention relates generally to data processing equipment and processes and, more particularly, to a method and apparatus for fault-tolerant communication of information among a plurality of information processing elements.
Computers have become an important tool for many businesses because of the ability of even small computers to process vast amounts of data in a short time. In many applications it is important, and often crucial, that the data processing not be interrupted. A failure of a computer system, especially during data communication between the processor and a permanent storage device, can shut down a portion of the related business and can cause considerable loss of data and money. Accordingly, computing systems must provide not only sufficient computing ability to process large amounts of data, but they must also provide a mode of operation which permits data processing to be continued without interruption in the event some component of the system fails.
Information is communicated among elements in a computing system through buses. Buses may comprise a single wire, in which case all information is transferred among the elements in bit-serial format, or it may consist of a plurality of wires which enable information to be transferred in byte-parallel format. In either case, a fault in a single wire or in a component attached to the same wire may cause a failure of the entire computer system. Accordingly, some form of redundancy should be implemented for reliable system operation.
One approach to redundancy is to provide a duplicate bus which may be switched to in the event of primary bus failure. Duplication may be economically feasible when a single-wire, bit-serial bus is used, but these systems are inadequate for applications requiring frequent high data throughput, and the apparent complexity of implementing serial arbitration has discouraged many attempts at constructing a working system. In parallel bus systems, the addition of an extra parallel bus multiplies the amount of hardware necessary for proper operation and greatly increases cost.
One approach in byte-parallel systems is to provide one or more lines as spares in the event that one of the primary lines fails. However, these approaches are often inadequate because the modules may fail when the controlling module attempts to communicate error and reconfiguration information to them over the byte-parallel bus known to have a faulty wire. In some cases, communication with the I/O module may be impossible, and the entire systems fails.
Another shortcoming of sparing is how to test the spare wires before they are actually needed. Conventional devices may not adequately test the spare wires, and a faulty spare is not detected until a module attempts to use it, in which case it is too late.
Once a fault has been detected and analyzed, system operation must often be suspended to fully effect repair. For example, a faulty bus driver in a module usually necessitates module replacement. System operation must often be suspended to initialize and otherwise accommodate the newly inserted module. If the faulty bus driver also caused one of the lines in the bus to fail, so that sparing is required, the problem is compounded by the fact that the newly inserted module is unaware of the new system configuration. At best, this requires the operator to provide sparing information to the newly inserted module and, at worst, the system fails again because the newly inserted module cannot communicate with the rest of the system.
Finally, if a fault arises from an unsynchronized module, it is desirable that the computing system resynchronize the module automatically without operator intervention. Approaches to the problem often result in programs having complicated algorithms and intricate hardware to implement them. This increases costs, and the additional hardware increases the chance of more errors occurring.