1. Field of Use
The present invention relates to data processing systems and more particularly to fault tolerant systems.
2. Prior Art
As data processing systems become entrusted with performing increasingly more critical tasks requiring high dependability, this increases the need for such systems to be fault tolerant. The goal of a fault tolerant design has been stated as one to improve dependability by enabling a system to perform its intended function in the presence of a given number of faults.
Fault tolerance in data processing systems has been achieved through redundancies in hardware, software, information and/or computations. A fault tolerance strategy is considered to include one or more of the following elements. It includes masking (dynamic correction of generated errors); detection of errors (symptoms of faults); containment (prevention of error propagation across defined boundaries; diagnosis (identification of the faulty module responsible for a detected error; repair/reconfiguration (elimination or replacement of a faulty component or a mechanism for bypassing it); and system recovery (correction of the system to a state acceptable for continued operation.
Recovery is defined as the continuation of system operation with data integrity after an error or fault occurs. Some consider dynamic recovery essential to a fault tolerant data processing system if information integrity is to be preserved as a consequence of minimum error propagation.
Dynamic recovery has employed the use of self testing and dynamically checked logic circuits, most specifically in the form of interface error checkers. Implementation of dynamic hardware retry has been used to overcome the effects of transient hardware faults transparently to the rest of the system.
One approach in implementing dynamic recovery has involved the use of a recovery control unit (RCU). The RCU is used to analyze error signals, sometimes to perform rapid hardware diagnosis and sometimes to initiate interrupts for additional program aid such as diagnostics. As such, the RCU provides the basis for dynamic interaction between hardware and software for recovery. This approach is described in greater detail in the article entitled "Logic Design for Dynamic Interactive Recovery" by W. C. Carter, et al., published in the November, 1971 issue of IEEE Transactions on Computers, Vol. C-20, No. 11, pages 1300-1305.
There have been rapid changes in computer architectures, with increased integration in VLSI devices and extensive use of pipelining techniques. These have resulted in significantly more complex designs involving the use of VLSI customized chips in implementing the different pipelined stages of a processing unit which execute several instructions in parallel at high speed.
This complexity and speed have made it exceedingly difficult to implement a fault tolerance strategy because of difficulties in defining interface boundaries and in the detection of faults without the introduction of substantial duplication of hardware.
Faults are generally classified in terms of their duration, nature and extent. The duration of a fault can be transient, intermittent or permanent. A transient fault often, the result of external disturbances exists for a finite length of time and is non recurring. A system with an intermittent fault oscillates between faulty and fault-free operation which usually results from marginal or unstable (metastable) device operation. Permanent or so-called "hard" faults are device conditions that do not change or correct with time. They result from component failures, physical damage or design errors. It has been observed that intermittent faults (so-called soft faults) typically occur with greater frequency than "hard" faults. Such faults are more difficult to detect, since they may disappear after producing errors.
As mentioned, all of the above discussed changes in computer architectures make it very difficult to implement recovery strategies for various types of faults without adding to design complexity. These difficulties are further compounded in the case of tightly coupled multiprocessor systems containing several VLSI processing units particularly when these units are microprogrammed or firmware controlled. One such multiprocessor system responds to certain errors, such as firmware parity errors or shadow chip miscomparison errors by causing the processing unit to come to a halt. This approach has provided little or no opportunity for the software to recover. Further, halting the processor in such instances could result in the system being placed in an unrecoverable state.
For further information regarding fault classifications and certain commercial computers incorporating fault tolerance strategies, reference may be made to the articles entitled "Fault-Tolerant Computing: Fundamental Concepts," by Victor P. Nelson and "Fault Tolerance in Commercial Computers," by Daniel P. Siewiorek, published in the July, 1990 issue of "Computer" published by IEEE.
Accordingly, it is a primary object of the present invention to provide a method and apparatus for carrying out a fault tolerance strategy which does not increase the complexity of a computer architecture.
It is a more specific object of the present invention to provide a method and apparatus for enabling reliable recovery from faults occurring within the stages of a pipelined data processing unit subsystem of a multiprocessor system.