1. Field of the Invention
The invention concerns memory for computer systems generally and more specifically concerns memory which is shared among processes which have concurrent access to the memory.
2. Description of the Prior Art
Increasingly, systems which are themselves implemented by means of computer systems are designed as systems in which a number of processes performing various system functions share memory. In such shared memory systems, the memory is concurrently accessed by the processes, and the processes communicate with one another by setting and reading values in the shared memory.
A major goal in the design of modern computer-implemented systems is fault tolerance, that is, the ability of the system to keep on functioning in the presence of the failure of some of its components. An important part of fault tolerance is benign failure modes: when a component of a system finally does fail, the failure should be such that the rest of the system can easily deal with it. An example of a benign failure mode for a processor is the stop failure mode: when the processor fails, it simply stops, rather than continuing to operate in an erroneous fashion.
In shared memory systems, it is of course crucial that the shared memory be fault tolerant, and it is desirable that it have a benign failure mode. Recent research has demonstrated that it is possible to build a fault-tolerant shared memory from potentially faulty components, as long as the number of faulty components does not exceed a threshold value. See Yehuda Afek, David Greenberg, Michael Merritt, and Gadi Taubenfeld, "Computing with faulty shared memory", in Proceedings of the 11th ACM Symposium on Principles of Distributed Computing, August, 1992.
Two types of benign failure of shared memories have also been investigated: the crash mode and the omission mode. See P. Jayanti, T. Chandra, and S. Toueg, "Fault-tolerant wait-free shared objects", in 33rd Annual Symposium on the Foundations of Computer Science, IEEE Computer Society Press, October, 1992. The crash mode is like the stop mode for processes: the failure of the memory appears to the processes which are sharing it to be complete and instantaneous; memory operations performed by a process preceding the failure behave correctly; memory operations performed after the failure return a special value which indicates that the operation was not performed. The special value is represented in the following by the * symbol. In the omission mode, memory operations performed on the shared memory by a process during and after a failure do not behave correctly, but instead return a special value which indicates that the operation may or may not have been performed. The special value is represented in the following by the ? symbol. The difficulty with the omission mode is that once a failure has occurred, a process never again knows definitely whether a memory operation was performed. The omission mode is thus substantially less benign than the crash mode.
A problem which the foregoing work has not solved is the construction of gracefully-degrading fault-tolerant memories from components having very benign failure modes. A gracefully-degrading fault-tolerant memory is one whose failure mode is at least as benign as the failure modes of its components. The Jayanti reference above demonstrates that it is not possible to construct a gracefully-degrading fault-tolerant memory from components having the crash failure mode, but that it is possible to construct a gracefully-degrading fault-tolerant memory from components having the omission mode.
What is needed, and what is provided by the present invention, is memories with the following properties:
they have failure modes which are more benign than the omission mode; and PA1 they may be used to construct fault-tolerant memories which are gracefully degrading.