The invention disclosed and claimed herein relates to digital data processing and, particularly, to methods and apparatus for fault-detecting and fault-tolerant computing.
A shortcoming of many conventional digital data processors is their inability to detect and correct a range of data transfer and operational faults. Personal computers and workstations, for example, are typically configured to detect only a single class of data transfer faults, such as, parity errors. Larger computer systems often incorporate error-correcting codes, at least in peripheral device communications, to detect and correct single-bit errors.
Computer systems marketed and described in prior patents to the assignee hereof are capable of detecting and correcting a wider range of faults. U.S. Pat. No. 4,750,177, for example, discloses a fault-tolerant digital data processor having a first functional unit, such as a central processor, with duplicate processing sections coupled to a common system bus for identically processing signals received from other functional units, such as the memory or peripheral device controller. In addition to checking the received data, the first functional unit compares the output generated by the sections while one of them--the so-called "drive" section--transmits processed data to the bus. When the section outputs do not agree, the functional unit drives an error signal onto the bus and takes itself off-line. According to the patent, a functional unit such as a central processor can have a redundant partner that is constructed and operates identically to the original. In such a configuration, if one of the partners is taken off-line due to error, processing continues with the partner.
According to U.S. Pat. No. 4,931,922, also assigned to the assignee hereof, redundant peripheral control units, each with duplicate processing sections, control the latching of data and signaling of errors vis- a-vis data transfers with attached peripheral devices. For this purpose, data signals applied to the peripheral device bus by either the control units or peripheral devices are captured and compared by processing sections within each control unit. The results of those comparisons are shared between the controllers. If the comparisons indicate that the data captured by both control units agrees, then control units generate a "strobe" signal that causes the data to be latched. If the units do not agree after successive retries, the control units withhold issuance of the "strobe" signal and enter an error-handling state for determining the source of the error.
Co-pending, commonly assigned U.S. patent application Ser. No. 07/926,857 discloses, in one aspect, a digital data processing system having dual, redundant processor elements, each including dual, redundant processing sections that operate in lock-step synchronism. Failure detection stages in each section assemble signals generated by that section into first/second and third/fourth groups, respectively. Normally, signals in the first group match those of the third group, while the signals in the second group match those of the fourth group. The groups are routed between the sections along conductor sets separate from the system bus. To detect fault, each stage compares a respective one of the groups with its counterpart. If either stage detects a mismatch, it signals an error to its respective processing section, which can take the corresponding processor element off-line.
Although the methods and apparatus describe, d in the above-mentioned patents and patent applications provide higher degrees of fault detection and fault tolerance than previously attained, still further improvement in this regard is desirable.
An object of this invention, therefore, is to provide digital data processing apparatus and methods having still greater fault-detecting and fault-tolerant capacity than the prior art.
An object of the invention is to provide apparatus and methods for fault detection that may be readily implemented without consuming excessive hardware resources such as chip pins or board conductors.
A related object is to provide such apparatus and methods as can be implemented in high-speed hardware elements, such as ASIC's.
Another object is to provide such apparatus and methods as can be exercised without incurring excessive signal processing delays.
Still another object of the invention is to provide apparatus and methods for fault detection as can be readily used in fault tolerant processing configurations.