1. Field of the Invention
The present invention relates to a fault-tolerant computer system with emphasis on the connection between central processing units (CPUs) and input/output adapters. More particularly, the invention relates to connection controls on the side of input/output adapters in a dual-structure fault-tolerant computer system, wherein not only the input/output adapters but also processors and memories are furnished on a duplex basis.
2. Description of the Related Art
Recent years have seen computer technology gaining widespread use. In particular, traffic control systems, banking systems, and other critical structures are being supported by computers. Deployed in such a ubiquitous manner, the computers can cause enormous problems in the functioning of society, if any were to fail. Because of the potential for far-reaching adverse effects in the event of failure, computers are being required to ensure ever-higher reliability.
Sustained demands for enhanced computer reliability exist notably in the field of electronic control. Such demands have been met in part by multiple-computer systems such as the one disclosed in Japanese Patent Laid-Open No. Sho 57-20847. The disclosed system operates under a scheme of having a plurality of computers perform the same calculations, whose results are compared at the point of data output so that only the correct results are output, which presupposes that output timings are synchronized by software for comparison. The proposed scheme is suitable for use in control systems of relatively small scale; the scheme cannot be applied to today's complicated, large-scale application programs because huge efforts are needed to compare data when the programs are run.
Recently, however, a number of proposals regarding fault-tolerance technology have been made for data comparison based primarily on hardware. One scheme of fault-tolerance technology is disclosed illustratively in U.S. Pat. Nos. 5,317,726 and 5,384,906. As its precondition, the scheme involves getting typically three identical CPUs to perform the same command stream and decide, by majority, the results of the command execution. In connection with this scheme, when the processors taking part in the majority-based decision operate on independent clocks, appropriate measures are needed to synchronize these processors in operation. Traditionally, multi-processor computers have been used extensively to meet the demand for higher processing performance. A typical fault-tolerance technique used by the multi-processor computer is the so-called pair-and-spare method. The method, described illustratively in Nikkei Electronics, May 9, 1983, pp. 197-202, involves the use of a pair of wired boards loaded with memories having self-diagnostic functions, and with processors operating in cooperation. With this method, a fault that may occur in one of the two wired boards is bypassed by the circuits on the other board, which keeps functioning. Because the operation continues even in the event of a fault, there is no need to execute a check point restart, in which the processing is restarted from a suitable check point preceding the point in time at which the fault occurred.
Another example of fault-tolerance technology is a dual rail processor disclosed in U.S. Pat. Nos. 4,907,228 and 5,255,367. The disclosed dual rail processor constitutes a fault-tolerant computer system comprising two processors having data paths extending therefrom (i.e., as a dual rail). Shared resources such as a memory are connected to the paths. At the entry to the shared resources are a pair of basic data processors capable of detecting an error by comparing signals from two data buses. An input/output adapter shared by the two data processors has error detecting means for detecting errors at the entry.
A further example of fault-tolerance technology is a computer system disclosed in Japanese Patent Laid-Open No. Hei 4-241039. The disclosed computer system involves the use of a number of wired boards (i.e., replacement units) comprising processor units (BPUs) each equipped with a fault-tolerance function. If a fault occurs in a BPU during operation, its fault-tolerance function maintains normal operation until the next "appropriate" point (called a check point hereunder for convenience) at which the faulty BPU is taken over by another BPU. Check points are established illustratively at points of task changeover. The components making up each BPU are furnished in a multiple (i.e., redundant) structure so that a component becoming defective in the BPU is compensated by a combination of the normal components enabling normal operation to continue up to the next check point. Cache memories whose faults may be detected by parity check are provided in a dual structure. If one of the memories fails, the other normal memory takes over. If general-purpose MPUs are incorporated, they cannot be equipped with self-checking functions. In such cases, the MPUs are furnished in a triple or quadruple structure so that the output signals from these units are compared to select normal units.
As described, if a BPU on a replacement-unit wired board develops an internal fault, the normal processing is still allowed to continue until the next check point is reached. This means the absence of any deterioration in performance attributable to the conventional practice of preserving the check point status upon fault in preparation for a later check point restart. In addition, the absence of paired BPUs eliminates the need for signal lines that are conventionally necessary for clock synchronization between different BPUs. With no clock synchronization required, the clock rate is boosted. Since the MPUs constituting replacement units operate using the same clock signal, no specific operations are required to synchronize the MPUs in operation, which is another contributing factor to the continuation of processing performance.
The conventional techniques outlined above are designed to constitute processors and memories, the minimum environment for executing software, in a multiple structure such that, if any one of these key components fails in operation, it is disconnected on a hardware basis to ensure an uninterrupted program run. That is, any fault that may occur in the processors or memories remains completely transparent to programs. In that respect, these techniques are important to alleviating the burden of special programming for building a fault-tolerant system.
In an effort to make the input/output arrangement highly reliable, the above pair-and-spare method proposes operating a pair of wired boards comprising self-diagnostic input/output adapters. If a circuit fault occurs in one of the two wired boards, the other board takes over and continues normal processing. With the operation kept uninterrupted despite a fault, there is no need to perform the conventional check point restart of input/output processing, whereby the processing would be restarted at a check point preceding the point in time at which the fault occurred. However, this method requires preparing specialized input/output adapters.
The scheme disclosed in U.S. Pat. Nos. 4,907,228 and 5,255,367 proposes an input/output adapter which, shared by a pair of data processors, has error detecting means for error detection at the entry. When a fault is detected, the characteristic of the fault is verified. If the verification reveals that the operation cannot continue, the input/output adapter is disconnected from the system; if the operation is allowed to continue upon verification, a fault processing routine is executed to restore the input/output adapter from the fault. This scheme also requires the use of a specially designed input/output adapter. One disadvantage of the scheme is that if the input/output adapter is found to be inoperable upon fault, that device is disconnected from the system and the processing can no longer continue from that point on.
The above-described conventional schemes and methods utilize general-purpose processors, but require specifically designed peripheral circuits to constitute a multiple CPU arrangement. Compared with ordinary data processors, workstations, and personal computers having the same general-purpose processors, the conventional fault-tolerance techniques are noted for their inevitable high costs and increased overheads of both hardware and software.
Today's general-purpose processors are rapidly rising in performance. The cycle in which to develop ordinary data processors, workstations, and personal computers based on such high-speed processors is getting shorter than ever before. This trend poses a problem in the growing gap in cost-performance between ordinary data processors, workstations, and personal computers on the one hand, and fault-tolerant computers using the same processors but requiring special peripheral circuits on the other.
More specifically, the conventional pair-and-spare method and the scheme disclosed in U.S. Pat. Nos. 4,907,228 and 5,255,367 require specially designed hardware to build fault-tolerant input/output adapters. On the other hand, simply combining ordinary input/output adapters with ordinary data processors, workstations, or personal computers does not constitute fault-tolerance in processors or input/output adapters. In particular, the following problems have been encountered:
(1) Some ultra-high-speed processors used by ordinary data processors, workstations, and personal computers perform I/O access operations asynchronously with respect to I/O access instruction execution. That is, when an I/O access fault is detected and reported, the program can be executing an instruction far ahead of the I/O access instruction that developed the fault.
(2) The input/output adapter used by ordinary data processors, workstations, and personal computers is generally not provided in a dual structure. Nevertheless, two input/output adapters need to be connected illustratively to the data processor as well as to any input/output modules. If a fault occurs in one input/output adapter, that adapter is disconnected as instructed by the data processor, and the other input/output adapter takes over. This setup, however, must always be accompanied by special means which disconnects the faulty input/output adapter from the input/output modules connected thereto, if a fault occurs in an input/output adapter or on the input/output bus.