This invention relates to a fault-tolerant computer system.
The traditional approaches to system reliability attempt to prevent the occurrence of faults through improved design methodologies, strict quality control, and various other measures designed to shield system components from external environmental effects (e.g., hardening, radiation shielding). Fault tolerance methodologies assume that system faults will occur and attempt to design systems which will continue to operate in the presence of such faults. In other words, fault-tolerant systems are designed to tolerate undesired changes in their internal structure or their external environment without resulting in system failure. Fault-tolerant systems utilize a variety of schemes to achieve this goal. Once a fault is detected, various combinations of structural and informational redundancy, make it possible to mask it (e.g., through replication of system elements), or correct it (e.g., by dynamic system reconfiguration or some other recovery process). By combining such fault tolerance techniques with traditional fault prevention techniques, even greater increases in overall system reliability may be realized.
According to the invention, a fault-tolerant computer architecture is provided wherein the effect of hardware faults is diminished. The architecture employs a main data bus having a plurality of interface slots for interconnecting conventional computer sub-systems. Such sub-systems may include a magnetic disk sub-system and a serial communication sub-system. The number and type of sub-systems may vary considerably, however, a central processor sub-system which encompasses the inventive elements of the invention is always included.
The central processor sub-system employs a plurality of central processing modules operating in parallel in a substantially synchronized manner. One of the central processing modules operates as a master central processing module, and is the only module capable of reading data from and writing data to the main data bus. The master central processing module is initially chosen arbitrarily from among the central processing modules.
Each central processing module comprises a means by which the module can compare data on the main data bus with data on a secondary bus within each module in order to determine if there is an inconsistency indicating a hardware fault. If such an inconsistency is detected, each module generates state outputs which reflect the probability that a particular module is the source of the fault. A synchronization bus which is separate from the main data bus interconnects the central processing modules and transmits the state outputs from each module to every other central processing module.
More specifically, each central processing module comprises a shared data bus connected to the main data bus through a first bus interface. A number of hardware elements are connected to the shared data bus including a read/write memory, an asynchronous receiver/transmitter circuit, a timer circuit, a plurality of control and status registers, and a special purpose read/write memory, the purpose of which is to store data corresponding to main data bus interface slots having a defective or absent computer sub-system.
Each module further comprises a comparator circuit which is part of the first bus interface, the purpose of which is to compare data on the main data bus with data on the shared data bus and generate state outputs in response thereto. A parity checking circuit is also part of the first bus interface and monitors data lines in the main data bus, generating a parity output which is used as an input to the comparator circuit.
A private data bus is connected to the shared data bus through a second bus interface. The private data bus is also connected to a plurality of hardware elements which may include a read/write memory, a read-only memory, and a xe2x80x9cdirtyxe2x80x9d memory. The purpose of the xe2x80x9cdirtyxe2x80x9d memory is to store data corresponding to memory locations in the read/write memory to which information has been written. As will become clear, this facilitates the copying of data from one central processing module to another. Also connected to the private data bus and controlling the operation of each central processing module is a central processing unit which operates in a substantially synchronized manner with central processing units-in other central processing modules.
Finally, each central processing module contains a control logic circuit which is connected to and controls the first and second bus interfaces. The control logic circuit receives as its inputs the state outputs generated by the comparator circuits in every central processing module. The circuit, using these and other control signals described more specifically below, generates, among other things, control logic signals which indicate to the central processing unit whether a fault has occurred. If a fault is detected, each module then executes a routine which identifies the location of the fault, disables the failed module or sub-system, and then returns to the instruction being executed at the time the fault was detected.
Embodiments of this invention will now be described by way of examples only and with reference to the accompanying drawings.