The electronic circuitry in most modern computers and data processing machines can be organized into three well-defined logical groups or logic modules, each of which performs a specific subfunction in the accomplishment of the computer's overall function of processing data. For example, most computers contain a central processing module, a memory module and an input/output module.
The central processing module in a computer system typically functions to perform the timing and control operations in the computer and the actual data manipulation or computations required. The memory module is used to store initial data and the results of computations generated by the central processing module. Finally the input/output module is used to receive and forward data from the world outside the computer system into the central processing module and the memory module and to transmit to the outside world the results of the computations carried out by the computer system. Each of the three types of logic modules in a typical computer is a microcosm of the computer itself and may, in turn, be broken down into three submodules or units which have functions similar to the functions of the three main computer modules.
For example, a typical central processing module may be broken into three units: the data processing and control unit, the memory unit and the input/output unit.
The data processing and control unit generates the sequence of signals needed to control the module's operation or carries out the actual data computations or calculations. An arithmetic and logic unit in a central processing module is an example of this functional unit. Similarly, the timing and address generator in a memory module is another example of this type of functional unit.
A second functional unit is the memory unit which temporarily stores the data produced by the data processing and control unit. Examples of a memory functional unit are a cache memory in a central processing module, a memory array in a memory module, or data and command buffer memories located in the input/output module.
The third functional unit is an interface unit which connects a module to an information transfer bus which connects the module to other modules or to the outside world. Examples of interface units are data bus drivers in central processing and memory modules and input/output bus drivers in an input/output module.
In fault-tolerant computers which can tolerate a circuit malfunction or fault without losing data integrity it is necessary to detect faults in all three types of functional units. After detection of a fault, the computer system must respond quickly enough to prevent the computer system from generating erroneous outputs without generating some type of alarm so that an erroneous output is not accepted as accurate. In addition, the computer system must prevent corruption of its internally stored data base caused by faulty inputs or outputs which may be generated by the fault itself so that the computations which were being performed when the fault occurred can be restarted.
Conventional fault detection methods are of two types: error-detecting coding and duplication/comparison. It has long been recognized that error-detecting codes provide an efficient means for monitoring the operability of memory functional units and interface functional units. It is also well-known that error-detecting codes are not practical for monitoring the operability of data processing and control functional units. Accordingly, error-detecting codes have often been used in fault-tolerant computers in environments which require only limited fault detection and monitoring, such as when only minimal fault detection is necessary or when fault detection is desired but only insofar as it can be achieved at a small incremental cost over the basic non fault-tolerant computer cost.
Prior art fault-tolerant computers which have required a high degree of fault tolerance have utilized error-detecting coding for protection of the memory and interface units and a duplication and compare technique for protection of the data processing and control units. Some prior art fault-tolerant computers utilize duplication and compare techniques at the logical module level, but more frequently, such techniques are used at the system output level.
In a conventional duplicate and compare computer configuration used at the logical module level, two identical modules are used, each of which is functionally complete (the unit is capable of performing completed calculations or data manipulations without any additional circuitry). In particular, each of the duplicated modules contains a complete data processing and control unit, a complete memory unit with sufficient memory capacity to service the data processing and control unit and one or more interface units.
During operation, in each of the duplicated modules the data processing and control unit operates solely with its local memory unit over internal address buses and data buses. Fault-detection and monitoring is achieved by dedicating the two identical modules to the same function and comparing the outputs on a continuous or regular basis. In many prior art systems, in order to accomplish the required comparison, the two identical modules are interconnected by external buses. Each of the identical modules accepts inputs from the buses from both its companion module and from the remainder of the computer system. To acheive high reliability the external buses must also be duplicated.
In many prior art arrangements one of the identical modules generates outputs which are normally used by the remainder of the system while the second module generates outputs which are used solely for comparison to the first module's outputs. In other prior art systems the outputs of both modules are used for comparison purposes and by the remainder of the system.
One problem with such a conventional duplicate and compare scheme is that it is wasteful in its utilization of circuitry. In particular, prior art duplicate and compare schemes use twice as much memory as would be required for a comparable non fault-tolerant system. In addition, two external buses must be used for each module pair to obtain the same data throughput as a non fault-tolerant system using only a single bus. In small computer systems this increase in complexitiy may be acceptable; however, in large computer systems the increase in complexity results in a large increase in cost as well as a large increase in the amount of circuitry which, in turn, increases manufacturing costs and the likelihood of circuit failures and replacements.
Accordingly, it is an object of the present invention to simplify fault detection and monitoring circuitry in a computer system.
It is another object of the present invention to provide simplified fault detection circutry which has a fault detection and monitoring capability equivalent to that of conventional duplication and compare techniques.
It is yet another object of the present invention to provide fault detection and monitoring circuitry which can detect all failures resulting from a single circuit component failure.
It is a further object of the present invention to provide a fault-tolerant and self-checking computer circuit which utilizes only the same amount of total memory which would be required in a non fault-tolerant computer system.
It is still another object of the present invention to provide a fault-tolerant and self-checking computer circuit in which external buses need only be the same width and number as required in a non fault-tolerant computer system.