1. Field of the Invention
The present invention relates to computer systems. More particularly, the present invention relates to fault-tolerant computer systems using multiple CPUs.
2. Background Information
Highly reliable digital processing is of critical importance in a number of applications ranging from industrial process control to aerospace, and other applications involving human safety, to banking and financial services. Increased reliability has been achieved in various computer architectures by employing redundancy. For example, triple modular redundancy (TMR) systems provide high reliability by employing three Central Processor Units (CPUs) executing the same instruction stream, with the respective CPU outputs being cross-checked by "voting" hardware or software.
In such TMR systems, voting circuitry or software typically compares the input/output (I/O) requests from each CPU. For example, when the CPUs request access to a memory which is external to the CPUs, i.e., clocked separately from the local CPU clocks, voting may be performed. Also, in general, whenever CPUs need to access devices clocked separately therefrom, synchronization is necessary in order to allow CPU output comparison and thus to ensure fault tolerance. Such synchronization and voting operations, while important in providing a high degree of fault tolerance to the overall system, have a significant impact on the system performance. Thus, the more frequently a fault tolerant computer system has to access external devices, such as external memory, and perform synchronization and voting operations, the slower the overall performance of the system.
The frequency of access by CPUs to an external memory in a fault tolerant computer system may be reduced by employing a local memory for each CPU which is synchronous to its local CPU. Such local memory may store the operating system kernel as well as some of the most frequently accessed data, thereby reducing the number of occasions where the CPU must access external memory and undergo voting and synchronization operations. To provide the desired combination of memory capacity, cost and speed, a dynamic random access memory (DRAM) is preferably employed for such local memory.
A problem is introduced by the use of DRAMs for local memories, however, due to the requirement that such dynamic memories must be periodically refreshed. That is, currently conventional DRAM designs require a periodic refresh of the charge held in their memory cells to prevent loss of data through charge leakage. The frequency of refresh is specified for any given commercially available DRAM design and requires a certain number of refresh cycles in a predetermined period of time, usually measured in milliseconds. The specific timing of the refresh for any given DRAM is not predetermined, however, as long as the required number of refreshes always occur in the predetermined interval of time. The refreshes are typically triggered by a refresh timer which issues refresh requests sufficiently frequently to ensure that the DRAM specifications for refreshes are met.
Refresh requests based on a fixed refresh timer thus could introduce refreshes in a manner which is asynchronous between independently clocked CPUs in a fault tolerant computer system. Such asychronous refreshes can introduce divergences between the CPUs in a multiple CPU fault tolerant system. Additional hardware and/or software is required to take into account these asynchronous refreshes between the CPUs to prevent possible errors when I/O requests from the CPUs are voted.
In addition to the problem of possible divergence between CPUs due to asynchronous local memory refreshes in each CPU, the refreshes have a negative impact on system performance. In particular, refresh of local memory may use 1% to 3% of all possible memory cycles. This results in a reduction in system speed since the local processor must stall waiting for the refresh to complete when a local memory access is requested during a refresh cycle. Additionally, since the number of refresh cycles may differ between CPUs and the number of stall cycles for the processor in the respective CPUs may thus also differ, the local memory refreshes may increase the rate at which the CPUs drift out of real time synchronization. This in turn may require more frequent synchronization of the CPUs thereby offsetting the gains provided by the local memory in reduction of voting and synchronization frequency.
Accordingly, it will be appreciated that a need presently exists for a high speed fault tolerant computer system which has reduced need for synchronization and/or voting operations. More particularly, it will be appreciated that a need presently exists for a fault tolerant computer system which can exploit the use of a local memory in each CPU module to a maximum degree relative to speed and synchronization overhead.