This invention pertains to fault-tolerant computer systems and particularly to dynamically reconfigurable voting systems.
As computers take increasingly critical roles in the society, simply augmenting the quality of components and improving design techniques can not provide adequate computer reliablity to meet established needs. Computers must be able to function as failures are encountered. In today's computer parlance, this translates into computer fault tolerance.
Fault tolerance is essential for environments that require support to on-line continuous processing such as the banking industry, finance, tele-communication, air-traffic control, military application, retailing, etc. Since the start of the last decade, commercial computer manufacturers have devised a variety of architectures to support fault tolerance.
With the advent of microprocessors, computer systems have moved from the clean environments of computer rooms to industrial environments. A harsher environment makes environmental variables of a computer system fluctuate dramatically, e.g., temperature and humidity level, primary power supply fluctuation and electromagnetic interference. These factors will likely induce computer failures.
A second reason for computer failure that is commonly cited by computer industry experts is the inadvertent user abuse due to the proliferation of computers and the lowered level of computer literacy among average computer users.
Thirdly, as computer systems grow ever larger, there are more components in the system and subsequently, the probabilty of system failure due to the malfunction of any component increases.
In a computer system, failures can be generated from several levels that are parallel to the levels in the digital computer system. The first is the circuit level. This level consists of such components as resistors, capacitors, inductors, and power sources. The second level consists of logic gates and data operators built out of gates. Logic gates can be further divided into two categories of combinational and sequential logic gates. The next level is the program level. The program level is unique to digital computers. At this level a sequence of instructions in the device is interpeted and causes action upon a data structure. This is the Instruction Set Processor sublevel. The ISP description is used in turn to create software components that are easily manipulated by programmers--the high-level-language sublevel. The result is software, such as operating systems, runtime systems, application programs, and application systems. The last level in a computer system is the programming memory switch (PMS) level. This level includes the input/output devices, memories, mass storage, communications, and processors. They are all interconnected to form a computer system.
The sources of computer failures can be catagorized in the following way:
Failure. Physical change in hardware. PA1 Fault. Erroneous state of hardware or software resulting from failures of components, physical interference from the environment, operator error, or incorrect design. PA1 Error. Manifestation of a fault within a program or data structure. The error may occur some distance from the fault site. PA1 Permanent. Describes a failure, fault, or error that is continuous and stable. In hardware, permanent failure reflects an irreversible physical change. The word hard is used interchangeably with permanent. PA1 Intermittent. Describes a fault or error that is only occasionally present due to unstable hardware or varying hardware or software states. PA1 Transient. Describes a fault or error resulting from temporary environmental conditions. The word soft is used interchangeably with transient. PA1 Error detection, masking, and correction. By comparing data from redundant hardware components, errors are detected. Faulty data will be dynamically corrected. The system must also be able to prevent the propagation of faulty data across the defined boundaries. PA1 Diagnosis. This is a process in which the computer system identifies the faulty hardware components, (e.g., module, data path, section on logic board, etc.). PA1 Repair/reconfiguration. A system is repaired either by replacing the failed module with a spare or by reconfiguring the system structure or work load distribution to circumvent the module. There are "hot" replacement and "cold" replacement. A hot spare concurrently performs the same operations as the module it is to replace, needing no initialization when it is switched into the system. A cold spare is either not powered or used for other tasks, requiring initialization when switched into the system. PA1 Recovery. When there is a fault occurred in the computer system, correction needs to be made to restore the system to the initial state of continued operation.
The relationship of these classes of computer error is illustrated in FIG. 1.
There are two important measurements of fault tolerant capabilities in a computer system. The first is availability. Availability is a function of time. It is the probability that the system is operational at the instant of time. The second measurement is reliability. It is the conditional probability that the system has survived the interval [0, t], given that it was operational at time t=0. Reliability is used to describe systems in which repair cannot take place (as in satellite computers) or in which the computer is serving a critical function and cannot be lost even for the duration of a repair (as in flight computers on aircraft) or in which the repair is prohibitively expensive.
Aside from apparent benefit of having a computer system that can function without interruption in the event of computer failure, it is economically sound to own a fault tolerant computer system. The cost of a computer system is not limited to initial purchase; significant costs recur during the life of a system. A fault tolerant computer system not only reduces the cost of maintanence, it reduces the cost of downtime significantly.
Software-fault tolerance and hardware-fault tolerance share many common features such as redundancy, self-checking, etc. with different focuses. Software-fault tolerance aims at eliminating computer error due to false software design and programming. Hardware-fault tolerance aims at eliminating computer error due to hardware failures.
Both experimental and real-life fault tolerant systems have begun to use design diversity to tolerate software faults. Such systems focus strongly on design faults, where the term "design" encompasses everything from system requirements to realization during both initial production and future modifications. Design faults are a source of common-mode failures, which defeat fault-tolerant strategies based on strict replication and generally have catastrophic consequences.
In a diversified design, the different systems produced from a common service specificaion are called variants. A diversified design has at least two variants plus a decider, which monitors the results of variant execution, given consistent initial conditions and inputs. The common specification must explicitly address the decision points, that is, it must state when to make decisions and what data to base them on.
The best-documented techniques for tolerating software design faults are the recovery block (RB) approach and N-version programming (NVP). In the first approach, the variants are called alternates and the decider is an acceptance test, which is applied sequentially to the alternates' results. If the results of the primary alternate do not satisfy the acceptance test, the secondary alternate executes. In the second approach, the variants are called versions, and the decider is a vote based on all versions' result.
Most of the real-life systems do not implement either a recovery block approach or N-version programming, but rather are based on self-checking software which consists of either a variant and an acceptance test or two variants and a comparison algorithm.
In implementing design diversity, there are two main considerations. First is the number of variants. Aside from economic considerations, the number of variants for a given software fault-tolerance method is directly related to the number of faults to be tolerated. The soft or solid nature of the software faults significantly affects the architecture only when it must tolerate more than one fault. Also note that an architecture tolerating a solid fault can also tolerate a infinite sequence of soft faults, provided there are no fault coincidences.
The second consideration is the level of fault-tolerance application. The level of application involves two questions: How much should the system be decomposed into components to be diversified? And which layers (application software, executive, hardware) must be diversified? The answer to the first question involves a trade-off between two opposing considerations: smaller components allow a better mastering of the decision algorithms, but larger components aid diversity. In addition, the decision points are "non-diversity" points (and synchronization points for N-self checking programming, NSCP and NVP); as such, they must limited. Decision points are necessary only for interactions with the environment (sensor data acquisition, delivering orders to actuators, operator interaction, etc.). However, performance considerations could prompt additional compromises.
Software-fault tolerance primarily deals with faults that are generated from errors in software design and programming. Hardware-fault tolerance deals with errors that are due to hardware design or circuit errors. It can be achieved through redundancy in hardware, software, information, and/or computations. A fault-tolerance strategy includes one or more of the following elements:
The spare-and-pair implementation resorts to hardware redundancy to achieve fault tolerance. By comparing critical signals from two identical hardware components, the system detects error and removes faulty components from the system.
FIG. 2 illustrates a typical implementation of this scheme. In a Stratus On-Line Continuous Processing System, all hardware components are mirrored meaning that two components are identical and executing the same function. In FIG. 2, there are two CPU boards, CPU1 and CPU2. On each board, there are two identical data paths; each has a processor which is called A Side and B Side. Each side receives inputs from its own bus and drives its own bus. Each bus is wired-OR of one half of each board. On each board, the two sides constantly compare with each other. In the case of a disagreement, the red light is turned on and the board is removed from the system. In such event, the other board becomes the sole CPU board in the system.
The objective of this invention is to provide an improved fault-tolerant computer.