There exists a paradox in contemporary computing systems: they are used to protect various critical infrastructures of our society, such as electrical power, transportation, communication, etc., but they possess no identifiable error-detecting and -correcting infrasystem of their own. Such an infrasystem, or support architecture, should be (1) generic—that is, suitable for a variety of “client” computing systems—and (2) transparent to the client's software, but able to communicate with it, (3) compatible with and able to support other defenses that the client system employs, and (4) fully self-protected by ability to tolerate faults, immune to the client system's faults, including design faults, and to attacks by malicious software.
My present invention is an error-detecting and -controlling architecture with the above properties, imparting resilience to the overall computing system. My invention is preferably implemented entirely in hardware, to maximize system error-controlling dependability.
The following sections discuss conventional methods of noting and controlling errors in high-availability systems, and their shortcomings, and a principle of design I call “the immune system paradigm”.
Error-sensing and -correction defenses in current high performance and availability platforms (servers, communication processors, etc.) are found at four different hardware levels: component (chip, cartridge, etc.), board, platform (chassis), and group of platforms
COMPONENT LEVEL FAULT TOLERANCE—At the component level, the error detection and recovery mechanisms are part of the component's hardware and firmware. In general, we find very limited error detection and poor error containment in commercially available processors. For example, in Pentium and P6 processors error detection by parity only covers the memory caches and buses, except for the data bus which has an error correcting code, as does the main memory. All the complex logic of arithmetic and instruction processing remains unchecked. Recovery choices are “reset” actions of varying severity. The cancellation in April 1998 of the duplex “FRC” mode of operation eliminated most of the error containment boundary. All internal error detection and recovery logic remains entirely unchecked as well.
Similar error coverage and containment deficiencies are found in the high-end processors of most other manufacturers. The exceptions are IBM S1390 G5 and G6 processors that internally duplicate the arithmetic and instruction handling units and provide extensive error detection or correction and transient recovery for the entire processor. Near 100% error detection and containment are attained in the G5 and G6 processors, which carry on the legacy of fault tolerance from the IBM 9020—the last mainframe.
Intel too has an early record of building error-removal systems. The original design of the Pentium and P6 family of processors (except P4) includes the FRC (functional redundancy checking) mode of operation. In FRC two processors operate as a master checker pair that receives the same inputs. The checker internally compares its outputs to those of the master and issues the FRCERR signal when a disagreement is found. The master enters the Machine Check state upon receiving FRCERR. Operation in FRC mode provides near 100% error detection at the master's output and also makes the output an error containment boundary.
In April 1998, however, a set of specification changes was issued by Intel: “The XXX processor will not use the FRCERR pin. All references to these pins will be removed from the specification.” (The “XXX” stands for all processors from Pentium to Pentium 111.) Deletion of FRC mode left the Pentium and P6 processors with very limited error detection and containment. No further explanation was provided by Intel for the change. My conjecture is that asynchronous inputs could not be properly handled in the FRC mode. Intel did not reply to my inquiry about the cause of FRC deletion.
Processors that do not have adequate error detection and containment can be made fault-tolerant by forming a self-checking pair with comparison (e.g., the FRC mode) or a triplet with majority voting on the outputs. Since the FRC mode deletion, there is a second deficiency of contemporary processors: they do not (or cannot) provide hardware support for comparison or voting operations.
FAULT TOLERANCE AT HIGHER LEVELS—At the board level, complete redundant hardware as well as software components are used to assure very high availability. The “hot standby” approach is especially widely used in the fields of embedded systems and telecommunications. Hot standby duplexing selectively duplicates the most critical subsystems, such as the CPU, power supplies, cooling fans, etc. Less costly error-containment techniques such as ECC, RAID, N+1 sparing, etc. are used for the remaining subsystems. The CPU boards present the greatest challenge: to detect faults in both CPUs and to execute a rapid switchover to the hot standby CPU when the active CPU is faulty. A good example of the state-of-the-art is the Ziatech high availability architecture. The critical elements that execute CPU switchover are three hardware modules and four software modules for each CPU. These modules must be operational to make a switchover possible, but they are not protected by fault tolerance themselves.
At the platform level a widely used technique is Intelligent Platform Management (IPM) that requires the introduction of the IPM hardware subsystem into the platform(s). It consists of additional on-the-market hardware (buses and controllers) and firmware that provides autonomous monitoring and recovery functions. Also provided are logging and inventory functions. The effectiveness of the IPM monitoring and recovery actions is limited by the error information outputs and recovery commands of the conventional processors of the platform. For example, the P6 processors have only a set of “reset” commands and five error signals (after deletion of FRCERR) whose coverage was estimated to be very limited.
Known IPM subsystems are not themselves protected by error containment or resilience. The cost of adding such attributes may be high because of the multiple functions of the IPM. The Version 1.5 of the IPM Interface Specification (implementation independent) has 395 pages, which represent a lot of functionality to be protected. Furthermore, the IPM does not support comparison or voting for redundant multichannel (duplex or triplex) computing. A cluster is a group of two or more complete platforms (nodes) in a network configuration. Upon failure of one node, its workload is distributed among the remaining a nodes. There are many different implementations of the generic concept of “clustering”. Their common characteristic is that they are managed by cluster software such as Microsoft Cluster Service, Extreme Linux, etc. The main disadvantages for telecommunication or embedded systems are: (1) the relatively long recovery time (seconds); and (2) the cost of complete replication, including power consumption, replication of peripherals, etc.
The four approaches discussed above are at different levels and can be implemented in different combinations. The integration of the different error detection, recovery and logging techniques is a major challenge when two or more approaches are combined in the same platform.
THE DESIGN-FAULT PROBLEM—None of the approaches described above addresses the problem of tolerating deign faults in hardware (“errata”) and in software (“bugs”) of the COTS processors. Yet a study of eight models of the Pentium and P6 processors shows that by April 1999 from 45 to 101 errata had been discovered, and from 30 to 60 had remained unfixed in the latest versions (“steppings”) of the processors. The discovery of errata is a continuing process. For example, consider the Pentium III. The first specification update (March 1999) listed 44 errata of which 36 remained unfixed in five steppings (until May 2001). From March 1999 to May 2001 35 new errata were discovered of which 22 remained unfixed. Other manufacturers also publish errata lists, but those of Intel are most comprehensive and well organized.
Most of the errata are triggered by rare events and are unlikely to cause system failures; yet the designers of error-sensitive systems cannot ignore their existence and the fact that more errata will be discovered after a system design is complete. Continuing growth of processor complexity and the advent of new technologies indicate that the errata problem will remain and may get worse in the future.
The most effective method of error detection and containment, particularly including errors arising in the design stage, is design diversity, i.e., multichannel computing in which each channel employs independently designed hardware and software, as in the Boeing 777 Primary Flight Control Computer. The Boeing and other diverse designs employ diverse COTS processors and custom hardware and software because few or no commercially available processors support multichannel compute ing. Design-spawned error correction by means of design diversity will become much less costly if it can be supported by commercially available hardware elements. It is also important to note that design diversity provides support for the detection and neutralization of malicious logic.
LIMITATIONS OF THE FouR APPROACHES—The implementation of defenses at all or some of the above described four levels has led to the market appearance of many high-availability platforms (advertised as 99.999% or better) for server, telecommunications, embedded and other applications; however, all four approaches show deficiencies that impose limits on their effectiveness.
At the component level the Intel P6 and Itanium processors, as well as those of most other manufacturers (except IBM's G5 and G6) have a low error detection and containment coverage, leaving instruction handling and arithmetic entirely unchecked. After executing the Reset command most of the existing checks (bus EPC, parity, etc) are disabled and must be enabled by software that sets bits in the (unprotected) Power-On Configuration register. In general, critical recovery decisions are handed off to software.
At the board level, such as in “hot standby”, unprotected “hard core” hardware and software elements handle the critical switchover procedure.
At the platform level the Intelligent Platform Management (IPM) hardware subsystem handles both critical recovery decisions and voluminous logging, configuration record keeping and communication management operations. The critical IPM element is the Baseboard Management Controller (BMC) that is itself unprotected and depends on interaction with software to carry out its functions.
At the cluster level that is software-controlled the disadvantages are long recovery times and the high cost of complete system (chassis) replication.
In summary, the weaknesses are:
1. the presence of unprotected “hard core” elements, especially in the error detection and recovery management hardware and software;
2. the comingling of hardware and software defenses: both must a succeed in order to attain recovery;
3. the absence of built-in support for multiple-channel computing that provides high coverage and containment, especially when design diversity is employed to attain design-fault tolerance.
It is my conclusion that during the explosive evolution of hardware over the past 25 years, computer hardware has not been adequately utilized for the assurance of error containment. This observation is not intended to unduly criticize or derogate the efforts of the very many talented and diligent workers in this field; however, these are exceedingly difficult problems, and the state of the art does leave ample room for refinements.