1. Field of the Invention
The present invention relates to a memory state recovering apparatus capable of restoring the contents of a memory to the original state in a computer system.
2. Description of the Related Art
After having executed a program and finished the process, ordinary computers generally cannot return control to the preceding state and then restart the process.
In the following various application techniques, however, it is desirable to use the function of returning the contents of the memory to the preceding state and resuming the process at that point in time (the memory state recovering function).
(1) Software debugging
If any error occurred during the execution of a program, returning control to the preceding state would enable the cause of the error to be analyzed.
(2) Fault tolerance
If the process stopped due to a failure during the operation of the system, the operation would be allowed to continue without stopping the system, by returning control to the preceding state and resuming the process therefrom.
Such fault tolerance techniques have been disclosed in, for example, Philip A. Bernstein, "Sequoia: A Fault-Tolerant Tightly Coupled Multiprocessor for Transaction Processing," IEEE Computer, Vol. 21, No. 2, 1988.
(3) Back tracking
In logic programming languages, the back tracking of the executed state is a basic operation. Use of the function of returning the contents of the memory to the preceding state realizes back tracking.
One technique considered to be a method of realizing the aforementioned memory state recovering function is a backward recovery method.
FIG. 1 shows a block diagram of a system using the backward recovery method. The system of FIG. 1 comprises a processor 30, a memory control section 31, a main memory 32, and a before image buffer 33.
The before image buffer 33 is a memory for retaining the preceding state of the main memory 32 under the control of the memory control section 31. A single entry (also called a before image element) consists of a main memory address and data.
An example of the operation of the system constructed as shown in FIG. 1 will be explained below.
Now, consider a case where the processor 30 writes data "Dnew" into location "A" in the main memory 32.
After having received a request for a "Write" process from the processor 30, the memory control section 31, before updating the main memory 32, reads data "Dold" stored in the same location "A" and stores it together with the address value "A" of the location in the before image buffer 33. Thereafter, the memory control section 31 writes data "Dnew" into location "A" in the main memory 33.
Each time receiving a request for a "Write" process from the processor 30, the memory control section 31 repeats the operation and stores an address in the main memory 31 and the data therein in another entry in the before image buffer 33 in sequence.
To bring the main memory 32 into the preceding state, the memory control section 31 sequentially reads the entries (addresses "A" and data "Dold") stored in the before image buffer 33, starting with the latest one, and writes data "Dold" in the memory locations with the addresses "A" in sequence.
In general, to resume the execution of the program from a certain state, not only the preceding contents of the main memory 32 but also the preceding internal state of the processor 30 are required. One of the methods of retaining the internal state of the processor 30 is a checkpoint method that stores the internal state in the main memory 32 at suitable time intervals. In the checkpoint method, the timing of storing the internal state is referred to as a checkpoint, and the act of storing the contents of the main memory 32 and the internal state of the processor 30 is referred to as performing a checkpoint.
In performing a checkpoint, the before image buffer 33 is cleared at the same time. As a result, in the before image buffer 33, the original values of the locations (addresses) in the main memory 32 updated from the latest checkpoint up to now are stored.
This makes it possible to return control of the program from any point in time to the latest checkpoint.
Such techniques have been disclosed in, for example, Rok Sosic, "History Cache: Hardware Support for Reverse Execution," Computer Architecture News, Vol. 22, No. 5, 1994.
Next, an example of applying the above-described memory state recovering function to a multiprocessor system will be explained.
FIG. 2 shows a multiprocessor system where n processors 30-1 to 30-n are connected to each other via a bus 34. A memory control section 31 receives a processing request from each of the processors 30-1 to 30-n via the bus 34.
In the multiprocessor system of FIG. 2, too, the operation of the memory control section 31, main memory 32, and before image buffer 33 can be controlled similarly to the configuration of FIG. 1.
Specifically, each time receiving a request for a "Write" process from each of the processors 30-1 to 30-n, the memory control section 31 reads the relevant data "Dold" from the main memory 32 before updating the main memory 32 and sequentially stores it together with the address in the before image buffer 33.
To bring the main memory 32 into the preceding state, the memory control section 31 sequentially reads the entries (addresses "A" and data "Dold") stored in the before image buffer 33, starting with the latest one, and writes data "Dold" in the memory locations with the addresses "A" in sequence.
By performing a checkpoint at suitable time intervals and storing the internal states of all of the processors 30-1 to 30-n, it is possible to return control from any point in time to the checkpoint and resume the process.
It is a common practice that today's processors have cache memory to speed up memory access. Cache memories come in two types: write-through cache memories and copy-back cache memories.
In the case of the write-through cache, when the processor has executed a write process, the value of the data stored in the cache is updated and at the same time, the data stored in the main memory is updated to the value retained in the cache. Therefore, the contents of the cache provided for the processor coincides with the contents of the main memory, so that the memory state recovering function can be realized using the same techniques as described above. The checkpoint processing can be effected in the same manner.
In the case of the copy-back cache, when the processor has executed a write process, what is updated is only the value in the cache and the updated contents are not reflected immediately in the main memory. Thereafter, only when the data updated as a result of replacing the cache entry is written into the main memory, the contents of the main memory are updated. When the contents of the cache are written into the main memory (in the case of a "Write-Line" process), the writing is usually effected on a cache line basis, each cache line consisting of a plurality of words.
FIG. 3 shows a multiprocessor system where n processors 30-1 to 30-n are provided with caches 40-1 to 40-n, respectively. When the caches 40-1 to 40-n are copy-back caches, the multiprocessor of FIG. 3 operates as follows in order to realize the memory state recovering function.
At the time of a checkpoint, as shown in FIG. 4A, not only the internal state of the processor but also all of the updated data items ("A", "B", "C") that are held in the caches and not reflected in the main memory 32 are written back into the main memory 32, thereby storing the system state at this checkpoint. The process for writing the updated data items ("A", "B", "C") back into the main memory 32 is carried out in the same manner as the case that will be described below, in which the before image is retained. Thereafter, the before image buffer 33 is cleared.
When a cache has issued a request for a "Write-Line" process to the memory control section 31 (shown on FIG. 3) after the checkpoint, that is, when the data (the cache line including "a") updated in the cache is to be reflected in the main memory 32, the memory control section 31 transfers the data to the before image buffer 33 to save the contents of the data retained at the checkpoint.
Specifically, when receiving a request for a "Write-Line" process from the cache, the memory control section 31 reads the line data that includes data "A" from its address to be written back into in the main memory 32, and stores the line data together with its address value in the before image buffer 33 (a single line entry stored in the before image buffer 33 consists of a line address in the main memory 32 and line data). Thereafter, the memory control section 31 writes back the updated data (the cache line including data "a") in the cache into the main memory 32.
To return the contents of the main memory 32 to the preceding state (the state at the immediately preceding checkpoint), the memory control section 31 sequentially reads the entries (addresses and the corresponding line data items) stored in the before image buffer 33, starting with the latest one, and writes the line data items into the memory locations with the line corresponding addresses in sequence. This makes it possible to return the main memory 32 to the state at the preceding checkpoint (provided that only the data in the main memory 32 is considered).
At the time of a checkpoint, however, all of the data updated in the cache and not reflected in the main memory 32 must be written back in unison into the main memory 32. As a result, many requests for a "Write-Line" process have been issued intensively to the memory control section 31. Because the data is written back into the main memory 32, the old data existing in the main memory 32 and going to be written back into must be stored in the before image buffer 33 in unison.
With the memory state recovering function as described above, since a single "Write-Line" process requires two accesses for reading and writing data from and into the main memory 32 and a write access to the before image buffer 33, very many memory accesses occur at a checkpoint where many "Write-Line" processes take place intensively.
At the checkpoint where a lot of memory accesses take place, because the system looks as if it stopped in the meantime and cannot preform the remaining ordinary processes, if much time is spent on the checkpoint process, a decrease in the processing efficiency of the entire system will result.
The tendency gets more noticeable as the number of processors increases or the number of cache lines to be written at the time of a checkpoint increases because of an increase in the capacity of a cache. This raises a serious problem in constructing a large-scale and high-performance system.
As explained above, with a conventional memory state recovering apparatus, in case of writing the line data from the cache into the main memory, the old data is read from the main memory and is retained in the before image buffer 33. Further, in case of performing a checkpoint in a multiprocessor system using copy-back caches, all of the updated data stored in the cache is written back into the main memory in unison.
This causes that the process for storing the relevant data items in the main memory 32 in the before image buffer 33 is concentrated at the time of copying back. Therefore, the time required for the checkpoint process increases and another process cannot be executed during the checkpoint process. As a result, the system performance is degraded.