1. Field of the Invention
This invention relates generally to robustness (resistance to failure) in computer systems; and more particularly to novel apparatus and methods for shielding and preserving computer systems—which can be substantially conventional systems—from failure.
2. Related Art
(a) Earlier publications—Listed below, and wholly incorporated by reference into the present document, are earlier materials in this field that will be helpful in orienting the reader. Cross-references to these publications, by number in the following list, appear enclosed in square brackets in the present document:    [1] Intel Corp., Intel's Quality System Databook (January 1998), Order No. 210997-007.    [2] A. Avi{hacek over (z)}ienis and Y. He, “Microprocessor entomology: A taxonomy of design faults in COTS microprocessors”, in J. Rushby and C. B. Weinstock, editors, Dependable Computing for Critical Applications 7, IEEE Computer Society Press (1999).    [3] A. Avi{hacek over (z)}ienis and J. P. J. Kelly, “Fault tolerance by design diversity: concepts and experiments”, Computer, 17(8):67-80 (August 1984).    [4] A. Avi{hacek over (z)}ienis, “The N-version approach to fault-tolerant software”, IEEE Trans. Software Eng., SE11(12):1491-1501 (December 1985).    [5] M. K. Joseph and A. Avi{hacek over (z)}ienis, “Software fault tolerance and computer security: A shared problem”, in Proc. of the Annual National Joint Conference and Tutorial on Software Quality and Reliability, pages 428-36 (March 1989).    [6] Y. He, An Investigation of Commercial Off-the-Shelf (COTS) Based Fault Tolerance, PhD thesis, Computer Science Department, University of California, Los Angeles (September 1999).    [7] Y. He and A. Avi{hacek over (z)}ienis, “Assessment of the applicability of COTS microprocessors in high-confidence computing systems: A case study”, in Proceedings of ICDSN 2000 (June 2000).    [8] Intel Corp., The Pentium II Xeon Processor Server Platform System Management Guide (June 1998), Order No. 243835-001.    [9] A. Avi{hacek over (z)}ienis, G. C. Gilley, F. P. Mathur, D. A. Rennels, J. A. Rohr, and D. K. Rubin. “The STAR (Self-Testing-and-Repairing) computer: An investigation of the theory and practice of fault-tolerant computer design”, IEEE Trans. Comp., C-20(11):1312-21 (November 1971).    [10] T. B. Smith, “Fault-tolerant clocking system”, in Digest of FTCS-11, pages 262-64 (June 1981).    [11] Intel Corp., P6 Family Of Processors Hardware Developer's Manual (September 1998), Order No. 244001-001.    [12] A. Avi{hacek over (z)}ienis, “Toward systematic design of fault-tolerant systems”, Computer, 30(4):51-58 (April 1997).    [13] “Special report: Sending astronauts to Mars”, Scientific American, 282(3):40-63 (March 2000).    [14] NASA, “Conference on enabling technology and required scientific developments for interstellar missions”, OSS Advanced Concepts Newsletter, page 3 (March 1999).
(b) Failure of computer systems—The purpose of a computer system is to deliver information processing services according to a specification. Such a system is said to “fail” when the service that it delivers stops or when it becomes incorrect, that is, it deviates from the specified service.
There are five major causes of system failure (“F”):    (F1) permanent physical failures (changes) of its hardware components [1];    (F2) interference with the operation of the system by external environmental factors, such as cosmic rays, electromagnetic radiation, excessive temperature, etc.;    (F3) previously undetected design faults (also called “bugs”, “errata”, etc.) in the hardware and software components of a computer system that manifest themselves during operation [2-4];    (F4) malicious actions by humans that cause the cessation or alteration of correct service: the introduction of computer “viruses”, “worms”, and other kinds of software that maliciously affects system operation [5]; and    (F5) unintentional mistakes by human operators or maintenance personnel that lead to the loss or undesirable changes of system service.
Commercial-off-the-shelf (“COTS”) hardware components (memories, microprocessors, etc.) for computer systems have a low probability of failure due to failure mode F1 above [1]. They contain, however, very limited protection, or none at all, against causes F2 through F5 listed above [6, 7].
Accordingly the related art remains subject to major problems, and the efforts outlined in the cited publications—though praiseworthy—have left room for considerable refinement.